Multibranch Adaptive Fusion Graph Convolutional Network for Traffic Flow Prediction

. Urban road networks have complex spatial and temporal correlations, driving a surge of research interest in spatial-temporal trafc fow prediction. However, prior approaches often overlook the temporal-scale diferentiation of spatial-temporal features, limiting their ability to extract complex structural information. In this work, we design the multibranch adaptive fusion graph convolutional network (MBAF-GCN) that explicitly exploits the prior spatial-temporal characteristics at diferent temporal scales, and each branch is responsible for extracting spatial-temporal features at a specifc scale. Besides, we design the spatial-temporal feature fusion (STFF) module to refne the prediction results. Based on the multibranch complementary features, the module adopts a coarse-to-fne fusion strategy, incorporating diferent spatial-temporal scale features to obtain recalibrated prediction results. Finally, we evaluate the MBAF-GCN using two real-world trafc datasets. Experimentally, the newly designed multi-branch can efcaciously utilize the prior information of diferent temporal scales. Our MBAF-GCN achieved better performance in the comparative model, indicating its potential and validity.


Introduction
Te rise in vehicle numbers has resulted in increased pressure on urban trafc and travelers. To improve trafc efciency and reduce congestion, it is crucial to develop an efective and accurate trafc fow forecasting method [1]. Accurate trafc forecasts enable transportation agencies to improve road network capacity by setting up tidal sections and adjusting trafc signals dynamically to alleviate trafc congestion. Travelers can plan their routes with the foresight of road trafc conditions, while online car platform companies (such as DiDi and Uber) can anticipate trafc demand [2]. Trafc speed prediction has a wide range of applications in trafc management and control centers, including trafc monitoring, road condition broadcasting, and trafc control. It can help the trafc management center to better understand the current trafc situation and take corresponding measures in a timely manner to reduce trafc congestion and improve travel experience. Moreover, it can also be applied to trafc navigation and route planning to improve trafc efciency and reduce energy consumption. Terefore, trafc state prediction is necessary.
Trafc fow prediction methods can be categorized into multistep and one-step prediction, depending on the forecasting temporal interval. Multistep prediction aims to predict long-term future road network information, while one-step prediction only indicates the future state in the next temporal step. Both methods require modeling the spatialtemporal correlation among nodes in a dynamic road network. Despite eforts to improve prediction accuracy, several challenges still require further exploration [3][4][5]. For instance, how can we better model complex spatial-temporal movement patterns among trafc data? How can we achieve high precision in long-term or multistep ahead trafc forecasting? How can we use complementary features or factors, such as weather changes, holidays, and trafc accidents, to enhance forecasting accuracy and robustness? Tis paper aims to efectively use complementary time-scale features to improve the modeling of complex spatialtemporal correlations and promote multistep prediction accuracy.
Tere are two main types of forecasting methods, namely, model-driven based and data-driven based. Traditional time-series model-driven (parametric approach) models, such as autoregressive and AROMA (abstract representation of presence supporting mutual awareness) [6], are based on mathematical modeling and assumptions. However, they are insufcient to capture complex spatialtemporal correlations in raw data. Tese methods are also unsuitable for predicting data formed by non-Euclidean structural or other complex topologies. In comparison, the deep learning model has an advantage in handling highdimension and nonlinear data. As shown in Figure 1, it is a typical example of deep learning work. Te input data fow is processed through stacked or recurrent hidden layers and nonlinear activations and then mapped to the highdimension feature space [7]. Te network structure and feature fusion model enhance abstract features and produce fnal results, with learnable weights updated by backpropagation during training, and it extracts knowledge directly from raw data.
From the current study, deep learning models used for trafc fow prediction include graph convolutional networks (GCN), recurrent networks (RNN, LSTM), and transformer [8]. RNN-based methods capture all sequential states but struggle to identify local hidden features and emerging accumulated errors, a common issue with recurrent structures. Transformer-based methods excel in modeling longterm data processing and spatial correlation but require high computational and memory costs and long training times. GCN-based methods have gained popularity for their use of graph structures in road maps, enabling them to capture non-Euclidean data structures and complex spatialtemporal movement patterns. Tis simple architecture design is ideal for challenging problems such as long-term time series predictions.
Although studies have been conducted to predict trafc status using the primary GCN method, there still exist the following overlooked issues: (1) Instinctively, the representation of time series data can be translated to the frequency domain. Take a one-dimensional timing signal as an example. Te low-frequency part always represents the overall trend, and the high-frequency component represents fne-grained fuctuations, as shown in Figure 1. Te upper fgure illustrates the trafc fow variation captured by a sensor during a week in Los Angeles County, USA, and the lower fgure shows the lowfrequency part of the same data. We decompose it in the frequency domain by Daubechies-8 tap decomposition, assuming that the trafc fow observation period is observed in hours. In this case, the trafc fow is likely to show a relatively fat trend in the approaching period (these are the regular hours, ignoring the rush hour conditions). However, if observations are made in minutes, the trafc fow nodes will fuctuate sharply around the general trend. Tat is to say, diferent sampling windows correspond to diferent frequency fltering operations. Figure 1 shows that temporal-scale information exists in the frequency and time domains, and diferent temporal scales represent additional semantic information. Terefore, we can use the prior of temporal-scales and design a multibranch forecasting method to simplify the learning target: predicting trends and then predicting the fuctuation based on trends. We can design adaptive weights to fuse the complementary temporal-scale features to improve prediction accuracy further. (2) By decomposing the time series into diferent time scales and using a multibranch structure to leverage the spatial-temporal features of the corresponding scales fully, we can obtain spatial-temporal expressions for diferent hierarchies. Te extracted elements of each branch are then fused efciently to generate the fnal prediction results. For this purpose, we designed the STFF (spatial-temporal feature fusion) module, which uses a coarse-to-fne strategy to adaptively fuse the spatial-temporal features of diferent scales. In this way, the temporal "trends" can be enhanced and recalibrated, and the spatialtemporal "details" can be refned.
Our main contributions in methodological and theoretical aspects are summarized as follows: (1) We propose a multibranch prediction framework.
Feature extraction branches are designed to take advantage of the complementary features at diferent temporal scales. Tis design intends to decompose the complex time series forecasting and reduce the model learning difculty by explicitly exploiting the structural information in the temporal dimension; (2) We propose the STFF module, which uses a coarseto-fne strategy to fuse spatial-temporal elements of diferent scales and obtain the fnal refned prediction results; (3) We validate this work using real-world data. Tere is a signifcant improvement in the two real-world datasets, and the prediction accuracy of the MBAF-GCN model exceeds the baseline.
Te rest of this paper is as follows: in Section 2, we summarize the existing research on trafc fow prediction that is closely related to this paper. In Section 3, we defne the trafc forecasting problem and present the methodology and the details of the proposed multibranch adaptive fusion graph convolutional network. In Section 4, we detail the dataset and experimental setup, analyze experimental results on real data, and conclude our work in Section 5.

Related Works
Graph neural network-based methods have become dominant in trafc forecasting research, due to their ability to capture spatial dependency in non-Euclidean graphs. Many methods have been proposed to leverage spatial-temporal characteristics fully. Tis section reviews relevant literature on GCN structure design and complementary feature exploitation.
Most GCN-based trafc prediction methods follow a straightforward single-branch network design. Tang et al. proposed a spatial-temporal graph convolutional network (STGCN) to solve the problem of trafc data chronology [5]. It overcomes the shortcomings of traditional trafc forecasting methods applied to the nonlinearity and complexity of trafc data. It integrates graphic convolution and gated temporal convolution to form the basic module of the network. Trough spatial-temporal convolutional blocks, it integrates graphic convolution and gated temporal convolution. Qin et al. proposed the NDGCN model to use node information features to generate node embedding for unlabeled data. Tey learn a function that creates embedding by sampling and aggregating elements from the node's local neighborhood, which achieves a combined increase in performance and speed [9]. Yu et al. introduced 3D temporal graph convolutional networks (3D-TGCN), which verifed the model's efectiveness in simplifying trafc data training and capturing spatial-temporal characteristics of trafc data [10]. Song et al. used Graph WaveNet for spatial-temporal graph modeling. It optimizes the shortcomings of traditional models in capturing the features of long-time series trend data by combining graph convolution with dilated traditional convolution [11]. Zhu et al. proposed the AST-GCN algorithm. Tey used the framework to aggregate and transform features during training, reducing training time, and memory complexity. Tey further optimized to the GCN, which can automatically adjust per layer in the GCN, further reducing the training time by half [12]. Chen et al. also used a semantic-interactive graph convolutional structure [13]. Wang et al. solved how to determine appropriate neighborhoods to improve the graph structure. Tey proposed the GraphHeat concept to enhance the smoothness of the signal on the graph structure [14]. Jepsen et al. ofered relational fusion networks (RFNs) with a different adjacency matrix at diferent levels. Te adjacency matrix can be continuously learned during training [15]. Bai et al. introduced external factors and proposed an attributeaugmented spatial-temporal graph convolutional network (A3T-GCN). We separate the dynamic and static attributes of external factors to verify that the model can efectively perceive the infuence of external factors [16]. Lee and Rhee proposed the spatial node association's distance, direction, and positional relationship graph convolutional network (DDP-GCN) model. Te model can automatically extract node-related features in trafc data, and the removed parts can be dynamically adjusted [17]. Lin et al. proposed spatialtemporal fusion graph neural networks. An extension of the GNN model captured the complex spatial dependencies and dynamic trends of road networks [18]. Guo et al. also extended the GNN-based model and proposed an attentionbased spatial-temporal graph neural network (ASTGNN) [19].
A few methods have begun to pay attention to using the multibranch GCN to exploit complementary spatialtemporal features. Guo et al. construct a two-stream graph network to consider micro and macrotrafc complementary information [19]. Micro refers to trafc sensors, and macro refers to trafc regions. Te diference between our works is that they conduct the splitting in the spatial dimension, while we focus on exploiting complementary  Journal of Advanced Transportation temporal features. Ioannidis et al. also focus on using hierarchical characteristics of spatial-temporal features and simultaneously predicting the fne-grained and coarsegrained trafc conditions over a road network [20]. Although considering the coarse-grained and fne-grained structural information in spatial-temporal dimension, they explicitly build the multiscale network by topology closeness and the trafc fow similarity.
In contrast, the network scale-specifc feature in our work is learned and recalibrated in an end-to-end form without manual intervention. Ke et al. introduce deformable convolution to enhance the modeling capability of spatial nonstationarity and design a multibranch network to model temporal dependency, including weekly trend, daily periodicity, and hourly closeness [21]. Like Jeon and Hong [22], they follow an artifcial assumption in splitting spatialtemporal scales. All these works proved the efectiveness of conducting a multibranch network structure design to take advantage of the complementary spatial-temporal hierarchical characteristics fully.
Tere is also a lot of work that uses transformer-based model structures to solve trafc prediction tasks. For example, [23] uses multiple spatiotemporal attention blocks to construct the encoder and decoder of the model and applies a transform attention layer between the encoder and decoder to perform feature transformation. In [24], spatial and temporal transformers construct the base feature extraction module, which can address the existing faws in spatiotemporal dependency. Transformer-based structure to capture spatiotemporal dependencies: a novel self-attention mechanism that is capable of utilizing the local context in the temporal dimension, and a dynamic graph convolutional module that incorporates self-attention in the spatial dimension. In [25], Zhang et al. uses self-attention to capture both short-term and long-term temporal correlations, and the proposed temporal fusion transformer has great advantages over traditional prediction models when the prediction horizon is longer than one hour. Although these transformer-based works have great advantages over longterm trafc prediction problems, the transformer structures have generally high time costs, which limit the algorithms to edge devices with insufcient computational power, and applications with high requirements for realism. Tis limits the scalability of the algorithms for applications with insufcient computational power and high-performance requirements.
While previous research has incorporated spatialtemporal scales, most models only use a single-branch approach and do not consider the variability of scales. Although some studies have used multibranch GCN approaches, they do not efectively decompose temporal trends. Furthermore, existing models face challenges in optimizing learning speed and capturing spatialtemporal characteristics for long-term trend data. Tis paper addresses these challenges by incorporating coarse-to-fne integration of diferent spatial and temporal scales, improving model learning efectiveness.

Problem Defnition.
Te mathematical defnition of the trafc prediction problem is described then. We will introduce our framework and two key components, namely, spatial-temporal feature extraction branch and the spatialtemporal feature fusion module.
We will solve the multistep trafc forecasting problem. We use X c n,t ∈ R N×T×C to denote the series of trafc data collected by N sensors in a region during a period. t ∈ (1, 2, ..., T) denotes the temporal sampling interval of the sensors. c ∈ R C indicates the dimension of trafc information of interest (e.g., speed, volumes, and fows), and then χ � X c :,0 , X c :,1 , ..., X c :,t, ... represents all trafc data collected in the same region during time t. As mentioned before, we predict future values of related trafc sequences based on historical observation. Terefore, we can formulate this aim as fnding a function F to predict the subsequent τ steps based on the past T steps of historical data.
where θ is the learnable parameter of the model; due to the spatial and temporal complex correlation between the nodes in the region, we adapt the GCN-based model and form the graph structure G � (V, e, A), where V, e, and A represent the adjacency matrices of a set of vertices, a set of edges, and G, respectively. Terefore, this problem is expressed in the form of graph solving as follows: Following the spectral graph theory, the graph Laplacian matrix, eigenvalues, and eigenvectors are the theoretical basis of graph convolution. And diferent types of Laplacian matrices can be divided into three categories as follows: (a) Un-normalized Laplacian, which is also called combinatorial Laplacian, formulated as L � D − A (b) Normalized Laplacian, a normalized style and commonly used form in the GCN, formulated as Among them, A ∈ R n×n presented as an adjacent matrix and D ∈ R n×n presented as a diagonal degree matrix. Te feature decomposition of the Laplacian matrix can obtain its eigenvector matrix U and eigenvalue matrix Λ, so the Laplacian matrix can be expressed as L � UΛU T , where Λ ∈ R N×N is a diagonal matrix and U ∈ R N×N is Fourier basis.
We formulate a graph convolutional flter g θ � diag(θ) parameterized by θ � R N . Hence, the graph convolution of x defned in the Fourier domain is g θ * Gx � Ug θ U T x. Here, * G denotes a graph convolution operation and U T x is the graph Fourier transform of x. However, the computational complexity of such an operation is too large, so the Chebyshev approximation is generally used (note: the form of the Chebyshev poly- Ten, the graph convolution can be expressed as the Chebyshev approximation flter as follows: In general, we only take k � 1 to avoid further complexity. Tat is to say, the frst-order approximation of spectral graph convolution is used. Te formula changes to Since θ is the flter's parameter, so θ � θ 0 ′ � − θ 1 ′ . Ten, the formula changes to 3.2. Model. As shown in Figure 2, the framework of our model mainly contains two parts, namely, multibranch spatial-temporal feature extraction and spatial-temporal feature fusion module. Each branch is a response to a specifc temporal scale. We achieved this by carefully setting the temporal receptive feld and designing a global attention module to guide a branch's GCN layers. Specifcally, GCN layers in each branch are guided by diferent high-level attention maps. We construct the global graph attention module to generate diferent adjacency matrices for each branch, making them specifc for feature extraction and prediction at a particular scale. We classify the feature branches into three types, namely, tendency branch, coarse branch, and fne branch, based on the temporal scales setting from coarse to fne. To balance the trade-of of prediction accuracy and efciency, we frst use a self-attention structure to extract the global correlation from the input data and provide global guidance for the subsequent graph convolutional module. Ten, each branch uses CNN-based blocks to extract local spatial-temporal features, ultimately achieving efective prediction.
Each branch's original input is extracted as scale-specifc features and then fused by the adaptive spatial-temporal fusion module. As Figure 2 shows, we follow the coarse-to-fne fusion strategy, which is to fuse tendency branch features and coarse branch features frst, and then the enhanced features are fused with the fne branch subsequently. Trough the spatial-temporal fusion layer, the initial prediction information will be recalibrated and refned. We will describe the abovementioned modules in detail.

Spatial-Temporal Feature Extraction Branch.
Te entire spatial-temporal feature extraction branch consists of several stacked feature extraction blocks and one global attention guidance module. We use the gated TCN without a dilation ratio setting as the temporal feature extraction module, unlike Graph WaveNet [26]. We disagree with the use of dilation ratios in convolutions to expand the receptive feld. Instead, we suggest designing a global spatial-temporal module to provide global guidance, as the receptive feld is critical for modeling complex correlation patterns between nodes. Our approach involves self-attention, allowing any node in the sequence to access any other nodes based on the long-term correlation matrix. Terefore, there is no need to use a dilation ratio to expand the receptive feld. Te feature extraction block's formal expression is as follows: We got the raw historical trafc data χ ∈ R N×D×S , we extract temporal features as the form where Θ 1 , Θ 2 , b, and c are learnable weights and biases of normal temporal convolutional layers. ⊙ is the element operates. g(·) is the tanh activation function, and σ(·) is the sigmoid activation function that acts as a temporal gate, which determines the ratio of information that passes to the next layer. Exploiting spatial correlation, the GCN module guided by global attention is used in the spatial feature extraction part. We follow the default setting as Graph WaveNet, except using the global attention matrix as an additional adjacency matrix. Its formal expression is as follows: Z is the spatial feature results, A is the adjacent matrix, and W is the learnable weights.
We have designed three branches that are specifc to diferent spatial-temporal scales, namely, the tendency branch, coarse branch, and fne branch. Te tendency branch utilizes a temporal convolutional kernel size equal to the length of the entire historical sequence (12 in our work) to extract global information for tendency feature extraction. In contrast, smaller kernel sizes are used in the coarse and fne branches to extract more detailed information from the local neighborhood. Each branch also receives intermediate supervision specifc to its scale. To achieve this, we remove high-frequency components of labels with diferent degrees.

Global Graph Attention Module.
To provide scaleaware attention guidance in each branch, as shown in Figure3, we embed a global graph attention module, which undergoes convolutional fltering of diferent kernel sizes in Journal of Advanced Transportation the temporal dimension. To improve performance, we use multihead attention to establish dependencies among every element and express the information of diferent subspaces. Tis stabilizes the learning process and ensures the suitability of the graph convolutional network for our purposes, as the adjacent matrix can signifcantly afect performance.
Given the input X t− T+1:t � [X t− T+1:t , ..., X t ] ∈ R T * N * P , we simplify the notation as X. We use a 1-D temporal convolutional block to convert the input features into higher-dimension features on each node in the initial step. Notably, diferent branches have diferent kernel sizes to achieve scale-specifc. As demonstrated in Figure 4, the tendency branch has the largest temporal kernel size setting, and the fne branch has the smallest temporal kernel size setting.
where ⊗ refers to the convolutional operation, W refers to the learnable parameters of convolutional layers, and k refers to kernel size. Ten, three subspaces are obtained, namely, query subspace spanned by Q ∈ R N * d q , key subspace K ∈ R N * d k , and value subspace V ∈ R N * d v . Te latent subspace learning process can be formulated as where W q s ∈ R dG ×d s A , W k s ∈ R dG ×d s A and W v s ∈ R dG ×dG are the weight matrices for Q s , K s , and V s , respectively.
Scaled dot-product attention is used to compute the global attention matrix as follows: After we use multihead design to get richer latent information, we concatenate all attentions together and project again to get the fnal values: Te attention map generated by the global attention modules is used for global guidance for spatial-temporal extraction blocks followed up.   Journal of Advanced Transportation

Spatial-Temporal Feature Fusion Module.
Te fusion between the tendency branch and the coarse branch is similar. We designed the multibranch spatial-temporal fusion module to achieve coarse-to-fne feature fusion. Te coarseto-fne strategy exploits the temporal scale prior to refning prediction results.
As previously mentioned, we obtain scale-aware features from the multibranch network. Tese features correspond to diferent temporal semantic meanings based on the sampling windows. We frst fuse the features extracted from the tendency branch with those extracted from the coarse branch. Ten, we fuse the resulting features with features similarly extracted from the fne branch. Te complementary features from diferent branches are adaptively merged and enhanced through this fusion process.
Te specifc fusion module design is shown in Figure 5. We illustrate our fusion method by fusing the coarse and fne branches. We use the temporal attention matrix extracted from the coarse branch to enhance the relevant parts of the features in the fne branch and avoid any loss of detailed information with residual links.
Te formal expression of the fusion model is as follows: Given the coarse branch feature map F ∈ R C×H×W as input, fusion sequentially infers a 1-D temporal attention map M c ∈ R C×1×1 , as illustrated in Figure 5. Te overall attention process can be summarized as where ⊗ denotes element-wise multiplication. Te following describes the details of each attention module: where σ denotes the sigmoid function, W 0 ∈ R C/r×C and W 1 ∈ R C×C/r are the learnable parameters. Note that the MLP weights, W 0 and W 1 are shared for both inputs, and the ReLU activation function is followed by W 0 . After the fnal fusion, supervision between prediction and ground truth labels are processed in the training phase.  Journal of Advanced Transportation 7 real-time trafc data, enabling researchers to use machine learning and deep learning techniques to predict and analyze trafc fow. We extracted trafc speeds from both datasets and aggregated them into 5-minute intervals, and applied Zscore normalization. Te sensor distribution of the datasets is visualized in Figure 6.

Experiment Settings.
Te experiments were conducted on a Linux operating system running on an Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40 GHz and an NVIDIA GeForce GTX 1080 GPU. Te entire work was built using the open-source Python machine-learning library, Pytorch. To obtain the best parameters and avoid overftting, we selected some data augmentation strategies, such as injecting small noise/outliers into time series using random noise perturbations to improve the model's robustness. We also performed a grid search to locate the optimal parameters. During the training phase, we used the Adam optimizer and mean square error as the loss function. Te initial learning rate was set to 10e − 4, with a decay rate of 0.7 after every 15 epochs. Te model's input was the historical trafc speed data, and the output was the predicted trafc speed values for a certain time interval (15 min/30 min/1 hour) in the future.
We evaluated the performance of the MBAF-GCN model on two real-world trafc datasets, namely, METR-LA and PEMS-BAY. Te datasets contain key attributes of the transportation network and time-stamped geographic information. Te sensor distribution of the datasets is shown in Figure 6. We aggregated the trafc speed data into 5minute intervals and applied Z-score normalization. Te datasets were split chronologically into 70% for training, 10% for validation, and 20% for testing.

Evaluation Metric.
To evaluate the performance of diferent methods, this work employs mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE) metrics, defned as where v t is the speed of the detected vehicle and v t is the predicted vehicle speed.

Baselines.
We compare our framework MBAF-GCN with the following baselines: (1) HA: historical average method, which models historical trafc as a seasonal process, and then uses the weighted average of historical seasons as the forecast value [27]; (2) ARIMA: with Kalman flter [28]; (3) SVR: based on historical data, SVR uses linear support vector machines to train models, establish input-output relationships, and then make predictions [29]; (4) FC-LSTM: recurrent neural networks with fully connected LSTM hidden units [30]; (5) WaveNet: its main component is causal convolution, which is a convolutional network architecture for sequence data [31]; (6) DCRNN: a difusion convolutional recurrent neural network that captures temporal dependencies with graph convolutions formalized by a difusion process and spatial dependencies with an encoder-decoder framework [32]; (7) GGRU: graph-gated recurrent unit network [30]; (8) STGCN: a spatial-temporal graph convolutional model based on a fxed Laplacian matrix to capture spatial-temporal features [33]; (9) Graph WaveNet: integrating difusion convolution and 1-D dilation convolution to capture spatial-temporal correlations [34]. changes, respectively. Te prediction curves generated by our proposed method and GraphWaveNet demonstrate proximity to the ground truth during the period of smooth trafc changes [34]. However, during the period of severe trafc changes, our method outperforms GraphWaveNet in accurately ftting the ground truth curves. Tis outcome can be attributed to the efective extraction of complementary and discriminative features through the implementation of the temporal multiscale structure in our method.

Experimental Results
Analysis. We validated our model and nine baselines on the datasets METR-LA and PEMS-BAY for 15 minute, 30 minute, and 60 minute ahead predictions and present the results in Table 1. Table 1 shows that traditional time-series prediction techniques (HA, VAR, and SVM) have the lowest accuracy and are inadequate for modeling nonlinear and complex spatial-temporal relationships. CNN-LSTM, an early deep learning technique, can signifcantly enhance prediction accuracy by avoiding artifcial assumptions and learning valuable features from data but is still incapable of modeling intricate spatial-temporal correlations. GCN-based schemes (STGCN and MSTGCN) ofer higher accuracy due to their superiority in modeling complex nonlinear non-Euclidean distance structures and simultaneously modeling spatial and temporal dependencies. Recent studies, such as Graph WaveNet, have limitations in modeling complex spatialtemporal movement patterns. In contrast, MBAF-GCN, which captures complementary temporal dependencies and utilizes the coarse-to-fne fusion design, outperforms these baseline schemes in terms of prediction accuracy. Te model signifcantly outperforms Graph WaveNet on the 30 and 60 minute-ahead predicted values and is on par with it on the 15 minute horizon. To better illustrate the predictive power of diferent models, we visualized the MAE, MSE, and MAPE prediction errors of diferent models on the METR-LA dataset and PEMS-BAY. It can be visualized that the prediction accuracy of Graph WaveNet and MBAF-GCN is signifcantly lower than that of other control methods. At the same time, MBAF-GCN is more advantageous at different time intervals, especially at 30 min and 60 min, and reaches the lowest prediction accuracy.

Ablation Experiments.
To verify the efectiveness of the proposed module in our scheme, we conducted two sets of ablative experiments as follows: (1) Efect of the Multibranch Structure. To verify the efcacy of the proposed multibranch structure, we established a control group using a global graph attention module and spatial-temporal layers in a single branch form. Te objective was to examine whether the accuracy enhancement resulted from the complementary temporal-scale features. In the multibranch settings, we created a single-branch model with nearly identical parameters (notably, we adjusted the convolution parameters in each module to ensure consistency in the output dimensionality with the multibranch), while keeping the other parameters constant. For the experimental group, which featured the MBAF-GCN prototype structure, we eliminated the fusion module from the experimental setup and replaced it with an additional operation to prevent any interference with the experimental outcomes due to the introduction of additional covariates in the fusion module. We utilized the addition operation to merge features from multibranches. We conducted controlled experiments on the two datasets, and the experimental results are shown in the following form (Table 2).
Te experimental fndings suggest that the prediction accuracy is enhanced by the explicit exploitation of structural information of spatial-temporal features using a multibranch design with a parallel multiscale structure, which outperforms the sequence homogeneous structure network with comparable parameters. In addition, the results highlight the efectiveness of utilizing the temporal scale as prior knowledge.
In this study, a global graph attention module is integrated into each branch to provide global spatial-temporal attention guidance and scale-aware attention. To investigate the impact of spatial-temporal scaling-specifc modeling, the heatmap of the adjacent weighted matrix in each branch is visualized. Specifcally, Figure 9 depicts the heatmap of the global graph attention module in the tendency branch, coarse branch, and fne branch, which are learned from the METR-LA dataset under the same input settings.
Te results demonstrate that the attention module in the tendency branch (Figure 9(a)) exhibits high values on distant    nodes, suggesting its ability to model long-term spatialtemporal dependency. Tis is critical as predicting the tendency of time series requires the merging of all information, even those located far away. Te heatmaps in the coarse branch and fne branch (Figures 9(b) and 9(c)) show more correlations on the diagonal and are sparse in longterm connections due to the smaller receptive feld of these branches. As a result, the learned attentions are more focused on local correlations. Notably, although these attention modules are in diferent branches, they provide scale-specifc and complementary features that can be leveraged in subsequent steps.
(2) Efective of the Fusion Module. To validate the efectiveness of our proposed fusion module in fusing spatiotemporal features of diferent scales and to verify the coarseto-fne fusion strategy, we conducted a series of ablative experiments. Te control group used concatenation and addition to fuse features from diferent scale branches, without distinguishing features of diferent scales or using   any attention mechanism to adaptively adjust the fusion weights [34]. By comparing the results of these experiments, we can demonstrate the efectiveness of the proposed coarseto-fne fusion strategy. Te experiments were conducted on two datasets. Te experiments show that the multiscale fusion strategy is signifcantly better than the addition and concatenation feature combination approach, as shown in Table 3. Te concatenation or addition fusion strategy merges all branch features with the same weights and ignores the intrinsic correlation of temporal features at diferent scales. While the coarse-to-fne mechanism can enhance and recalibrate the temporal trend and refne the detailed prediction of the time series.

Conclusion
Te study proposes the MBAF-GCN model as a novel approach for trafc forecasting. Tis model employs a multibranch structure with a coarse-to-fne fusion design, ofering advantages over comparative models. Firstly, unlike traditional single-branch network designs, the proposed model leverages prior knowledge of spatial-temporal characteristics across diferent temporal scales to estimate trafc conditions in real-time, capturing both temporal patterns and complex spatial dependencies. Secondly, each branch in the multibranch framework has its loss supervision, which facilitates the learning process and enhances prediction accuracy.
Our study conducted extensive comparison experiments on two real data sets to evaluate the performance of the MBAF-GCN model. Our results demonstrate the following: (1) Te MBAF-GCN model outperforms the traditional single-branch prediction structure in terms of accuracy. In particular, it shows signifcant improvements over the Graph WaveNet model in predicting values 30 and 60 minutes ahead. (2) Our study provides novel insights into the use of multibranch complementary temporal features in graph convolutional networks and the fusion of spatial-temporal features from coarse to fne. Te MBAF-GCN model achieves competitive results compared to other models based on actual data and is capable of continuously correcting the predicted trafc trends.
In conclusion, while the MBAF-GCN model has demonstrated high prediction accuracy and validity, it is still subject to the infuence of external factors that afect trafc conditions in the real world, such as weather changes, social events, and air conditions. In future research, we plan to investigate how to incorporate these external factors into the model in a reasonable way to improve the realistic prediction accuracy of the multibranch network design.

Data Availability
Te data used to support the fndings of this study are available at https://gitee.com/zhouchena1/MTGNN/.

Conflicts of Interest
Te authors declare that they have no conficts of interest.