Multicomponent Spatial-Temporal Graph Attention Convolution Networks for Traffic Prediction with Spatially Sparse Data

Predicting traffic data on traffic networks is essential to transportation management. It is a challenging task due to the complicated spatial-temporal dependency. The latest studies mainly focus on capturing temporal and spatial dependencies with spatially dense traffic data. However, when traffic data become spatially sparse, existing methods cannot capture sufficient spatial correlation information and thus fail to learn the temporal periodicity sufficiently. To address these issues, we propose a novel deep learning framework, Multi-component Spatial-Temporal Graph Attention Convolutional Networks (MSTGACN), for traffic prediction, and we successfully apply it to predicting traffic flow and speed with spatially sparse data. MSTGACN mainly consists of three independent components to model three types of periodic information. Each component in MSTGACN combines dilated causal convolution, graph convolution layer, and the weight-shared graph attention layer. Experimental results on three real-world traffic datasets, METR-LA, PeMS-BAY, and PeMSD7-sparse, demonstrate the superior performance of our method in the case of spatially sparse data.


Introduction
Traffic prediction is one of the most essential tasks in the Intelligent Transportation System [1]. e goal of this task is to predict the future traffic conditions (e.g., traffic speed and traffic volume) by analyzing the historical traffic data. Accurate and timely traffic prediction is essential to many realworld applications. For example, if traffic data could be predicted accurately in advance, the transportation department can dynamically adjust the time of traffic lights; moreover, the navigation system can change the route in time to reduce congestion. However, traffic prediction is very challenging because of the dynamic spatial correlations and nonlinear temporal correlations. Early traffic prediction methods [2,3] can be divided into classic statistical methods and machine learning models, which are limited by the stationarity assumption and fail to capture the spatial correlations.
Recently, many deep learning models have been proposed for traffic prediction. For spatial modeling, graph convolutional neural networks (GCN) [4][5][6] are widely used in graph-based data. Diffusion convolutions [7] and attention mechanism [8][9][10][11][12][13] are also adopted by researchers to capture spatial dependencies. From the perspective of periodicity, some methods use time information of sample data as additional input features [5,7] to learn periodic information, and some attempts [8,14,15] divide the data and model into multiple components to capture the correlations under different periods. However, existing methods have the following shortcoming.
Existing approaches mainly focus on capturing the temporal and spatial dependencies based on dense spatial data. In the real world, there are not enough available detectors in some regions due to underdeveloped transportation or abnormal conditions (equipment maintenance, extreme weather, etc.) [16]. As the experiment shown in Figure 1, with the decrease of detector numbers, the prediction results of existing models turn to be worse. e reason is that most existing methods [5,7,17] only calculate the spatial dependence once in each module, and the attention mechanism is not fully utilized. erefore, when the detectors are insufficient or the adjacent detectors are far away, the spatial correlation features cannot be fully captured by existing methods. In this article, we propose a novel deep learning architecture to address this problem. To the best of our knowledge, this work is the first to bring out the spatially sparse data problem.
To address the problem, we propose a novel framework called Multicomponent Spatial-Temporal Graph Attention Convolutional Network (MSTGACN), which consists of three relatively independent modules; each module is composed of multiple spatial-temporal graph attention convolution blocks to capture spatial correlations efficiently in the case of spatially sparse data. We sample a part of detector data from two public datasets, METR-LA and PeMS-BAY [7], and we construct a sparse dataset by sampling detectors in district-7 of Caltrans Performance Measurement System (PeMS). We evaluate MSTGACN on three sparse datasets, and experimental results demonstrate that MSTGACN outperforms existing methods.
Overall, the contributions of our work can be summarized as follows: (1) We propose a spatial-temporal graph attention convolution block consisting of dilated causal convolution, graph convolution layer, and graph attention layer. e parameters of two GAT layers are shared in one block. MSTGACN can more effectively capture the spatial-temporal features in the case of spatially sparse data. (2) We adopt two strategies to capture multiple periodic information effectively. First, day-of-week information and time-of-day information are extracted as additional features. Second, the input data and the model are divided into three components, which are used to capture the weekly, daily, and recent periodic features of data. (3) We bring out the problem of spatially sparse data in traffic prediction. Moreover, we evaluate our model MSTGACN on three real-world sparse datasets. Experimental results validate that the proposed model is superior to existing methods in the case of spatially sparse data.
To better present our work, the rest of this article is arranged as follows. We describe related works and the task definition in Section 2 and Section 3. en, our method will be detailed and introduced in Section 4. We present our experimental results in Section 5. Finally, we conclude the article in Section 6.  been proven to be able to extract temporal information. Recently, people have proposed many RNN-based models [18][19][20][21][22] in traffic prediction whose performance is superior to traditional statistical methods [23][24][25] and machine learning models [26,27]. However, when the sequence length is long, the RNN-based model becomes inefficient and its gradients may explode. On the contrary, CNN has advantages of parallel calculation and stable gradient. erefore, a CNN-based model [17] has been proposed to capture temporal dependencies. Besides, some methods [5,28,29] borrowed dilated causal convolution from the speech processing field to expand the receptive field. To capture periodic information of the time series, the authors of [8] constructed three different time series segments as the input to capture the periodic features in traffic data, but they did not utilize the time period information of each sample data, such as "time of day" and "day of week."

Spatial
Modeling. For spatial modeling, some previous methods [30][31][32] converted the road network at different times to a regular 2D grid and utilized traditional convolution to capture spatial correlations, while the non-Euclidean correlations in road networks have been ignored. Recent studies further explored the effectiveness of GCN in modeling non-Euclidean spatial structure; this is more in line with the structure of the road network in the real world. Many researchers proposed new approaches for effective spatial modeling based on GCN. Yu et al. [17] proposed Spatial-Temporal GCN, which was entirely composed of convolution structure in spatial and temporal dimensions. Li et al. [7] proposed Diffusion Convolutional Recurrent Neural Network and applied bidirectional random walks on graphs to capture the spatial dependency. Wu et al. [5] also adapted diffusion convolution, but they developed a novel adaptive dependency matrix to capture the hidden spatial dependency, which did not depend on prior knowledge. In these methods, the adjacency matrix represents the relationship between the nodes, but edges are much more complicated and interact with each other. Chen et al. [33] constructed the edgewise graph according to various edge interaction patterns and implemented the interactions between nodes and edges using bicomponent graph convolution. However, we found that the datasets used by the existing methods have a commonality. e nodes are relatively dense in space, and the adjacent nodes belong to the same road and are close to each other, so there is an obvious upstream and downstream relationship. For secondary roads in cities or roads in villages, the collection equipment is not as dense as the main roads in big cities. e abnormal equipment will also cause the problem of sparse node distribution. To study the traffic flow prediction problem in the sparse scenario, we sampled the existing datasets and constructed a new dataset with sparse points.

Attention Mechanism.
e core idea of the attention mechanism is to dynamically focus on the most crucial information based on the input data. A large number of people have proposed attention-based models to solve traffic forecasting problems. Yin et al. [34] applied an internal attention mechanism to capture the interactions among multiple time series and a dynamic neighborhood-based attention mechanism to model the complex spatial correlations. Guo et al. [8,35] applied temporal attention and spatial attention to capture dynamic spatial-temporal correlations. To stabilize the learning process, the researchers [36] replaced the traditional attention mechanism with a multi-head attention mechanism. Velickovic et al. [13] employed an attention mechanism into graph structure to dynamically adjust the importance of adjacent nodes. Guo et al. [28,37] replaced GCN with graph attention networks (GAT) and [37] used meta knowledge to generate weights of GAT. GAT has achieved or matched state-of-the-art results across several benchmarks for graph-related tasks [38]. Considering that the spatial correlations are difficult to capture in the case of spatially sparse data, we employ multiple-stacked GAC blocks for better relation exploitation and prediction, which contains one GCN layer and one GAT layer in each block. e application of GAT with shared parameters in the block may also help alleviate the oversmoothing of GCN.

Preliminaries
e task of traffic prediction is to predict future traffic conditions (e.g., speed and volume) based on the historical traffic measurements of sensors in the road network. We define the road network as a weighted graph G � (V, E, A) with N nodes, where V is the set of nodes, E is the set of edges indicating the connectivity between the nodes, A ∈ R N×N is the weighted adjacency matrix of graph G. Suppose V is the subset of V, indicating some nodes in the graph could not be used as input data when corresponding sensors are abnormal in the road network.
Problem. Given traffic data over past P time slices, the traffic data observed on G could be denoted as X ∈ R N′×M×P (N ′ ≤ N), where M is the number of traffic interests (e.g., traffic volume and traffic speed). When N ′ ≪ N represents that the data are sparse, the goal of this task is to predict the traffic data (speed or flow) of the next

Materials and Methods
In this section, we first present the overall framework of the MSTGACN and the method to capture periodic temporal information; then, we describe the spatial-temporal graph attention convolution (ST-GAC) block of our framework. Finally, we present the multicomponent fusion method. Figure 2, MSTGACN proposed in this article consists of three independent components with the same structure, which are designed to model the recent, daily-periodic, and weeklyperiodic dependencies of the data. Each component is composed of a convolution layer, several stacked ST-GAC blocks, and an output block. e headmost convolution Computational Intelligence and Neuroscience layer captures the correlations between input features and generates multiple feature maps. e ST-GAC block models the spatial-temporal dependencies and each ST-GAC block is skip-connected to avoid oversmoothing. e outputs of the three components Y R , Y D ,and Y W are fused into the final output Y by the multicomponent fusion module.

Details about ree Time Series
Segments. e spatialtemporal correlations vary with different periods. We adopt two strategies to capture multiple types of periodic information effectively. Firstly, we construct two metafeatures: "time of day" and "day of week" as external attributes. ese additional features will be concatenated with original input data along the feature axis. Secondly, We intercept three time series segments (T R , T D , and T W ) along the temporal dimension to construct the input of recent, daily-period, and weekly-period components, respectively. Suppose the sampling frequency is q times per day. e current time is denoted as t 0 ; T P is the length of the sequence to be predicted. T R , T D , and T W represent the length of input data for different components, and they are all integer multiples of T P . e details of the three segments are as follows.
e adjacent sequence is closest to the period to be predicted. e traffic data (speed, flow) at a specific location change continuously with time. us, the data to be predicted will be affected by the data of the previous period.
It consists of the segments in the past few days at the same time period. is segment is used to provide information to model daily periodicity.
It is composed of the segments in the last few weeks, which have the same week attributes and the time intervals as the predicting period.
is segment is constructed for modeling the weekly periodicity.
e three components share the same network structure described in the next section. e output of each component is denoted as Y R , Y D and Y W respectively. ese three outputs are merged by the multicomponent fusion module to obtain the final prediction result.

ST-GAC Block.
We first construct a GAC block, which contains a GAT layer and a GCN layer. e ST-GAC block is composed of two G-TCN blocks and two GAC blocks, as shown in Figure 3. Moreover, the two GAT layers in the same ST-GAC block share the same weights. e G-TCN block is used for capturing the temporal dependencies. e GAC layer is used to learn the correlations between nodes in the case of spatially sparse data. To make Figure 2 more concise, we combine a G-TCN layer and a GAC layer as a T-GAC module.  convolution skipped a fixed step to perform the convolution operation. rough stacking multiple dilated casual convolution layers, it is possible to make the receptive field increase exponentially. Meanwhile, when processing input sequences of the same length, the number of parameters in dilated convolution is less, and the training speed is faster than RNN. Because the spatial distribution of observation points is scattered and far apart, we believe that the relationship between observation points has a delay effect of one hour or even longer, and dilated convolution has better expansibility.

Gated Temporal Convolution
To learn the temporal information better, we utilize a gated mechanism based on the dilated casual convolution (G-TCN) to control information flow. Suppose X ∈ R N′×M×P is the input data; the output of a gated convolution is as follows: where Θ, b, and c are learning parameters, ⋆ is the convolution operator, ⊙ is the elementwise product, and σ is an activation function, which controls the ratio of information flow to the next layer.

Graph Attention Convolution Block.
Because GAT allows for aggregating information from other nodes by assigning different importance and GCN is an efficient variant of the convolutional neural network, which could be used in a non-Euclidean spatial structure, we construct a graph attention convolution block (GAC) with a GAT layer and a GCN layer to learn the spatial dependencies. In this work, we adopt the GCN layer proposed in Graph WaveNet to further model hidden spatial dependencies based on GAT, and the GCN formulation is as follows: where A k apt ∈ R N′×N′ represents the normalized adjacency matrix with self-loops, X ∈ R N′×D denotes the input data, N ′ is the number of available nodes, D represents the characteristic dimension, W ∈ R D×M is the learning parameters, and Z ∈ R N′×M denotes the output. If the graph is directed, then the diffusion process has two directions. Let P k f represent the forward direction, P k b represent the backward direction, and k represent the order of diffusion. It is worth noting that when the road network is denoted as an undirected graph, equation (2) will be changed into the following:

Multicomponent Fusion.
In this section, we discuss how to integrate the outputs of the three components, Y R , Y D , and Y W . In the multicomponent fusion block, Y R , Y D , and Y W are concatenated along the feature axis and regarded as feature vectors of different spatial-temporal dependencies. en, we use two convolution layers with the ELU activation function to learn correlations of three components and the characteristics of each prediction time step. e outputs of the three components are fused as follows: where ‖ means concatenation operation and * is the convolution operation.  Computational Intelligence and Neuroscience months of statistics on traffic speed ranging from Jan 1st, 2017, to May 31st, 2017, including 325 sensors in the Bay Area [39]. PeMSD7-sparse: We selected 42 sensors in Los Angeles and collected two months of data ranging from Mar 1st, 2012, to July 2nd. e selected 42 sensors can be divided into 21 pairs; the two sensors in a pair are very close but face opposite directions. It can be considered that there are 21 nodes in the undirected graph, and each node contains traffic data in two directions. We use the Euclidean distance to calculate the distance between two nodes. e sensor distributions are visualized in Figure 4.

Experiments and Results
In all of these datasets, we aggregate traffic data into 5 minutes and apply Z-Score normalization. e datasets are split in chronological order with 70% for training, 10% for validation, and 20% for testing.

Baselines.
We compare MSTGACN with the following models.
HA: Historical Average that models the traffic flow as a seasonal process and uses a weighted average of previous seasons as the prediction. VAR [40]: Vector Auto-Regression, a more advanced time series model that captures the pairwise relationships among all traffic flow series. DCRNN [7]: Diffusion Convolutional Recurrent Neural Network that integrates diffusion convolution with recurrent neural networks. Graph WaveNet [5]: a convolution network architecture that combines graph convolution with dilated casual convolution and introduces a self-adaptive adjacency matrix. STGCN [17]: A spatial-temporal graph convolution model to predict traffic speed. ASTGCN [8]: Attention-Based Spatial-Temporal Graph Convolutional Networks, which combines the spatial-temporal attention mechanism and the spatialtemporal convolution. [37]: a model with graph attention networks (GAT), using metaknowledge to generate weights of GAT.

Experimental Settings.
Our experiments are conducted on a 64-bit Linux Server with one Intel(R) Core(TM) i7-7800X CPU @ 3.50 GHz and one NVIDIA Titan Xp GPU card. All the tests use 60 minutes as the historical time window. In other words, 12 data points are used to predict the traffic data in the next 5, 15, and 30 minutes. To cover the input sequence length, we use four ST-GAC blocks, and dilated factors of the two T-GAC blocks are set as 1 and 2, respectively. We adopt Adam optimizer to train our model. e initial learning rate is set as 0.001. e dropout rate is 0.5 and 0.3 in GCN and GAT; we set the output dimensions of the GAT layer and GCN layer to be 32, respectively. We use equation (2) as our graph convolution layer and the diffusion steps K is set as 2. e adjacency matrix is constructed by road network distance with a thresholded Gaussian kernel.
where d v i ,v j is the distance between point i and point j, σ is the standard deviation, andϵ is a threshold value to control the matrix sparse degree, and we set ε � 0.1 in our model. To evaluate the performance of different methods, we evaluate MSTGACN, HA, VAR, DCRNN, STGCN, ST-MetaNet. and Graph WaveNet. For these seven models on METR-LA, PeMS-BAY, and PeMSD7-sparse, we adopt Mean Absolute Errors (MAE) and Root Mean Squared Errors (RMSE) as the evaluation metrics.

Quantitative Experimental Results
Tables 1 and 2 demonstrate the average results of MSTGACN and the baseline methods on PeMS-BAY and METR-LA with a different number of nodes. It can be seen that although MSTGACN is second only to Graph WaveNet on the complete dataset, as the number of nodes decreases, the performance of our model gradually exceeds other methods. Tables 1 and 2 show that the performances of STGCN are the worst among models based on deep learning. It may be because STGCN defines the road network as an undirected graph, and these two datasets are defined as directed graphs in DCRNN [7]. e lack of direction leads to a decrease in STGCN performance. To verify this view and further test the validity of MSTGACN, we evaluate these models on the PeMSD7sparse dataset. Table 3 gives the results of MSTGACN and the baseline methods for 5 minutes, 15 minutes, and 30 minutes ahead prediction on the PeMSD7-sparse dataset. We observe the following: (1) On the traffic dataset with scattered location distribution, whether short-term prediction within 5 minutes or a mid-term prediction of 30 minutes, the prediction effects of HA and VAR are bad. It is because HA does not mine the spatial or temporal features of data. Because of VAR's limited modeling ability and the inability to learn mid-and long-term changes, it does not perform well in mid-and longterm prediction. (2) e accuracy of STGCN and ASTGCN is lower than that of Graph WaveNet and the proposed method. By comparing their original datasets, it can be found that there are apparent up-down relationships in the datasets used in the two articles. erefore, in the case of scattered point distribution, STGCN and ASTGCN cannot effectively learn the spatial-temporal dependencies in the data. e method of capturing the spatialtemporal dependencies based on the attention mechanism proposed in ASTGCN is conducive to long-term prediction.
erefore, under the prediction time of 30 minutes, the performance of STGCN is better than ASTGCN.
(3) Single Recent is a degraded version of MSTGACN, which only has one recent component. Due to the designed GAC block, even the single-component model, its performance is better than baselines.
e proposed model uses the GAC block to learn the spatial dependencies, and the multicomponent structure helps capture the correlations under different periods. us, our MSTGACN achieves the best performance in PeMSD7sparse in terms of all evaluation metrics. To verify the effectiveness of the multicomponent division, we investigate   Computational Intelligence and Neuroscience 7 our model with different component settings. As shown in Table 3, we find that the performance of the degraded model with only the weekly component or daily component is not good. It can be considered that it is difficult to learn the temporal and spatial features of spatial-temporal data using only these two periods of data, which indicates that spatialtemporal features are dynamically changing, so it is hard to make accurate predictions for future data using historical data that are far apart. e single recent model performs much better than the single weekly and single daily model, and the triple recent model gets better performance than the single recent model. It could be considered that the components in the triple recent model are combined according to the bagging method, which can effectively improve the performance of ensemble models. e proposed model, which consists of recent, daily, and weekly components, achieves the lowest predicting errors.

Comparative Experiment of GAC Block.
In traffic prediction tasks, different spatial nodes are correlated. Accurately capturing the correlations between sensors in road networks is necessary to predict the traffic data. Because the PeMS dataset is a highway dataset, there are many adjacent points on the same road when the points are densely distributed.
ese points have an obvious upstream and downstream relationship. When the number of nodes decreases, there are few adjacent points located on the same road, and there are many intersections between different nodes. erefore, the upstream and downstream relationship between the points is not obvious and the spatial correlation between different points decreases. Because there are multiple intersections between different detectors, we think that the single use of GAT or GCN is insufficient to capture the spatial relationship.
Before each GCN module, we used an extra GAT module. Although the original intention was to increase the model's ability by extracting features through GAT and summarizing information through GCN on sparse data, the application of this attention mechanism with shared parameters in the block could also help alleviate the oversmoothing of GCN. We did comparative experiments of different modules on PeMSD7-sparse. As shown in Table 4, GAT + GCN shows the best result. Figure 5 demonstrates the average results on PeMS-BAY. It can be seen that although MSTGACN is second only to Graph WaveNet on the     Computational Intelligence and Neuroscience complete dataset, as the number of nodes decreases, the performance of our model gradually exceeds other methods. In order to study the influence of prediction time on model performance, the prediction time is gradually increased from 5 minutes to 1 hour at an interval of 5 minutes. As shown in the figure, the model proposed in this article has achieved good results in both short-term prediction and long-term prediction.

Visualization of Attention Matrix.
To test the performance of stacked GAC blocks, we show different spatial attention matrices among detectors in the PeMS-BAY with 8 nodes. As shown in Figure 6, as the number of GAT computations increases, the spatial attention matrix coefficients also increase. is is reasonable since the stacked GAC blocks increase the receptive field and the distant points also could be highly correlated. Figure 7, we selected four days of data for visual comparison and found that our model could predict the same data trends as the real data.

Conclusions
In this article, we propose a deep learning framework for traffic prediction in the case of spatially sparse data. e model combines dilated causal convolution, graph convolution layer, and the weight-shared graph attention layer. e parameters of two GAT layers are shared in one block to capture the spatial-temporal dependencies of traffic data in the case of sparse points. To capture the multiple periodic information, we extract the day-of-week and time-of-day information as additional features. Moreover, we divide the input data and model structure into three components.
Experiments on three real-world datasets show that the predicting accuracy of our model is superior to baselines. In general, traffic data are affected by many external factors, like weather, events, and holidays. In the future, these external factors should be taken into consideration to improve the predicting performance in the case of spatially sparse data.

Data Availability
Previously reported METR-LA and PeMS-BAY data were used to support this study and are available at https://dblp. org/rec/conf/iclr/LiYS018. ese prior studies (and datasets) are cited at relevant places within the text as references. Previously reported PeMS-D7 data were used to support this study and are available at https://dblp.org/rec/conf/aaai/ SongLGW20. ese prior studies (and datasets) are cited at relevant places within the text as references.

Conflicts of Interest
e authors declare no conflicts of interest.   Computational Intelligence and Neuroscience