A Multifeature Fusion Short-Term Traffic Flow Prediction Model Based on Deep Learnings

Short-term traﬃc ﬂow prediction is an important component of intelligent transportation systems, which can support traﬃc trip planning and traﬃc management. Although existing predicting methods have been applied in the ﬁeld of traﬃc ﬂow prediction, they cannot capture the complex multifeatures of traﬃc ﬂows resulting in unsatisfactory short-term traﬃc ﬂow prediction results. In this paper, a multifeature fusion model based on deep learning methods is proposed, which consists of three modules, namely, a CNN-Bidirectional GRU module with an attention mechanism (CNN-BiGRU-attention) and two Bidirectional GRU modules with an attention mechanism (BiGRU-attention). The CNN-BiGRU-attention module is used to extract local trend features and long-term dependent features of the traﬃc ﬂow, and the two BiGRU-attention modules are used to extract daily and weekly periodic features of the traﬃc ﬂow. Moreover, a feature fusion layer in the model is used to fuse the features extracted by each module. And then, the number of neurons in the model, the loss function, and other parameters such as the optimization algorithm are discussed and set up through simulation experiments. Finally, the multifeature fusion model is trained and tested based on the training and test sets from the data collected from the ﬁeld. And the results indicate that the proposed model can better achieve traﬃc ﬂow prediction and has good robustness. Furthermore, the multifeature fusion model is compared and analyzed against the baseline models with the same dataset, and the experimental results show that the multifeature fusion model has superior predictive performance compared to the baseline models.


Introduction
With the development of urbanization, the number of population and motor vehicles in cities is increasing. While the demand for travel, especially in the morning and evening rush hours, often makes the road utilization rate saturated, resulting in urban "traffic diseases." In this case, in order to solve the urban "traffic disease," intelligent transportation system (ITS) was developing [1][2][3][4]. And with the development of big data technology, ITS has started to change into data-driven ITS [5]. Among them, short-term traffic flow prediction is one of the core components of ITS, which provides the basis for traffic management, traffic control, and traffic guidance, as well as support for travel decision of travelers. However, short-term traffic flow has complex stochastic and nonlinear characteristics, which brings great challenges to traffic flow prediction. And how to accurately predict short-term traffic flow has been a hot topic of concern for scholars in the field of traffic engineering. e methods proposed in the early studies on shortterm traffic flow forecasting mainly consist of three main methods, parametric methods, nonparametric methods, and combined methods, which include both parametric and nonparametric methods. Parametric methods include the autoregressive integrated moving average model (ARIMA) and its variants [6,7]. Nonparametric methods include K-nearest neighbor nonparametric regression methods (KNN) [8], Kalman filters (KF) [9], support vector machines (SVR) [10], and artificial neural networks (ANN) [11]. Combined methods are a combination of two or more methods [12][13][14].
However, due to the development of data-driven ITS, especially the development and widespread use of traffic information collection technologies, such as induction detectors, geomagnetic detectors, radio frequency identification technology, radar detection, video detection, and floating vehicle detection [15][16][17][18][19], provide a large amount of data for traffic flow prediction. In this case, there are difficulties for parametric and nonparametric methods to deal with big traffic data. erefore, deep learning methods [20][21][22], which have powerful data feature mining and nonlinear data fitting capabilities, have been applied to traffic flow prediction and achieved some results [23][24][25][26]. However, existing deep learning-based methods for traffic flow prediction mainly consider the spatial and temporal correlation of traffic flow, without fully considering the complex characteristics of traffic flow such as daily and weekly periodicity. In addition, although some combined deep learning methods use several different single models to extract multiple features of traffic flow, such as spatiotemporal correlation and periodicity, in fact, the spatiotemporal correlation and periodicity of traffic flow are a whole and should be considered comprehensively in prediction model. Based on this, this paper designs a multifeature fusion model based on deep learning methods that considers the periodic features of traffic flow for traffic flow prediction, and the main contributions are summarized as follows: (1) A fusion feature model considering the periodic features of traffic flow is proposed, namely, multifeature fusion model. In the model, the CNN-BiGRU module is designed, which treats the spatiotemporal features of traffic flow as a whole, where 1DCNN and BiGRU are used to extract the local trend features and long temporal dependencies trend features of traffic flow, respectively. (2) In multifeature fusion model, two two-layer BiGRU modules are designed to extracting the daily and weekly periodicity features of traffic flow, respectively. (3) In order to improve the prediction performance of the multifeature fusion model, an attention mechanism is designed for the CNN-BiGRU and the twolayer BiGRU modules to adaptively make each module pay attention to the importance of the temporal and periodic features at different times. (4) e multifeature fusion model is validated by simulating the traffic flow collected in the field, and the experiments' results show that the prediction performance of the multifeature fusion model is better than that of the baseline model.

Literature Review
In general, existing traffic flow prediction methods can be classified into parametric methods, nonparametric methods, deep learning methods, and combined methods.

Parametric Methods.
e parametric method is a modelling approach where the structure of the model is predetermined based on theory, and the parameters of the model can be calibrated by realistic traffic flow data. Levin and Tsao [27] applied a time series analysis method to predict the morning peak period traffic on a motorway and found that the ARIMA (0,1,1) model was statistically significant. Zhang et al. [28] developed a hybrid model, where spectral analysis techniques are invoked to extract the daily and weekly periodicity of traffic flows, and the ARIMA model is used to extract the general time trend characteristics of traffic flows. Subsequently, a number of ARIMA variants were applied in traffic flow prediction. For instance, Kohonen self-organizing ARIMA, an autoregressive sliding average model with seasonality, and spatiotemporal autoregressive sliding average model were also used for traffic flow forecasting and achieved good results [29][30][31].

Nonparametric Methods.
Due to the strong randomness and nonlinearity of the state changes in traffic flow, the traffic flow prediction results using parametric methods have a certain degree of deviation from the actual traffic flow. erefore, some nonparametric methods gradually replace parametric methods in traffic flow prediction. Specifically, Ryu et al. [32] proposed a traffic flow prediction model that considering the spatiotemporal information associated with the predicted road section. e spatiotemporal information with the highest correlation to the predicted road section is first selected using a greedy algorithm, and then the traffic flow is predicted using KNN. Yan and Lv [33] proposed a hybrid classification and regression tree k-nearest neighbor model to predict short-term taxi demand. Okutani and Stephanedes [34] proposed two prediction models based on Kalman filter theory to predict traffic flow on streets within Nagoya. Guo et al. [35] proposed a hierarchical Kalman filter-based autoregressive moving average and generalized autoregressive conditional heteroskedasticity model for traffic flow velocity prediction. Hu et al. [36] proposed a hybrid model to forecast the short-term traffic flow based on particle swarm optimization (PSO) and support vector regression (SVR), in which PSO is used to find the optimal parameters of the SVR model. Lu and Zhou [37] proposed a Kalman filter traffic flow prediction model that takes into account structural deviations, where a polynomial is used to describe the evolutionary trend of structural deviations in traffic flow, and a Kalman filter model is used to describe the historical trend of traffic flow. Jiang et al. [38] proposed a support vector machine model with radial basis functions as kernel functions to predict traffic flow speed, and the experiment results showed that the prediction accuracy of the model was better than that of the traditional model. Wang and Shi [39] proposed a chaotic wavelet analysissupport vector machine model (C-WSVM), and the results showed that the C-WSVM model has better prediction performance and practicality. Feng et al. [40] proposed a new short-term traffic flow prediction model based on adaptive multicore support vector machine with spatiotemporal correlation. Wang et al. [41] proposed a combined support vector machine model to forecast short-term metro ridership, which includes a vector machine overall online model (SVMOOL) and a vector machine partial online model (SVMPOL). e SVMOOL model obtains the periodic characteristics of passenger flow, and SVMPOL obtains the nonlinear characteristics of traffic flow.
ANN [42] was regarded as another popular method for traffic flow prediction due to its ability to handle large amounts of multidimensional data, flexibility of model structure, and learning and generalization capabilities. And ANN combined with error backpropagation algorithm, i.e., Backpropagation Neural Network (BPNN) [43], was gradually applied to traffic flow prediction, and subsequently, a short-time traffic flow prediction model incorporating wavelet analysis and BP neural network approach [44] was applied to short-time traffic flow prediction.
en, an adaptive differential evolution algorithm optimized BPNN [45] was applied to short-time traffic flow prediction models. All these methods have achieved good results.

Deep Learning Methods.
With the development of data collection and processing technology, traffic big data has emerged. However, the traditional nonparametric methods have difficulties in processing multisource data [46], and the short-term traffic flow prediction methods have started to shift from nonparametric methods to deep learning methods [24,26,47,48]. For instance, Huang et al. [49] designed a combined prediction model including a deep belief network with unsupervised learning at the bottom and a multitask learning (MTL) layer for supervised prediction, in which the top multitask learning layer can leverage the weight sharing in the DBN to provide better results in support of prediction. Lv et al. [50] proposed a stacked autoencoder model that is trained in a greedy hierarchical approach for training to learn traffic flow features.
One of the difficulties in short-term traffic flow prediction is to obtain spatiotemporal correlation between traffic flow data. In terms of temporal characteristics, recurrent neural networks (RNNs) are a deep learning structure mainly applied to process time series data. RNNs have the function of temporal memory and can be applied to the field of correlation prediction of time series data [51]. However, traditional RNNs cannot tap the long-term dependence properties among traffic flow data due to the gradient disappearance and gradient explosion problems, so Ma et al. [52] applied long-and short-term memory (LSTM) to the traffic flow prediction. Subsequently, Zhao et al. [53] proposed a two-dimensional LSTM network consisting of many memory units with considered spatiotemporal correlations, and the experimental results showed that the proposed network had better prediction performance compared with traditional prediction methods. Wang et al. [54] proposed a deep learning framework based on paths. In the framework, the road network is divided into critical paths, and then the bidirectional long and short-term memory network is used to model the traffic flow of each critical path. Cui et al. [55] proposed a stacked bidirectional and unidirectional LSTM network structure for predicting road network traffic with missing values. Zheng and Huang [56] proposed a traffic flow prediction model based on LSTM, and experimental results showed that the prediction performance of the proposed model outperformed the classical model. GRU, which is a well-known variant structure of the LSTM, has also been applied to traffic flow prediction [57].
In terms of spatial properties, CNN is also a typical structure in deep learning. It is a feedforward neural network for solving problems with grid-like structured data, which not only can reduce the complexity of the model while accurately extracting data features, but also can better extract spatial correlations between traffic flow data [58]. Zhang et al. [59] proposed a CNN model for short-term traffic flow prediction, where the optimal input to the model is a spatialtemporal feature selection algorithm, and experimental results showed that the model outperformed the baseline model. An et al. [60] used a fuzzy convolutional neural network based traffic flow prediction method, which for the first time applied CNN to uncertain traffic incident information and used a fuzzy approach to generalize traffic incident characteristics. Tian et al. [61] proposed a hybrid lane occupancy prediction model called 2LayersCapsNet, which combines an improved capsule network and CNN.

Combined Methods.
Combined models should be useful when a single specified model fails to exhibit good predicting performance, which is a common situation in complex data forecasting [46]. It is difficult for a single forecasting model to capture both the strong complexity and the strong variability of traffic flow, so the proposal of a combined predicting model is necessary. Specifically, to exploit the good linear fitting capability of ARIMA models and the powerful nonlinear relational mapping capability of artificial neural network models, Li et al. [62] proposed a combined ARIMA and radial basis function artificial neural network model to predict short-term traffic flows. Yao et al. [63] proposed a linear hybrid method and a nonlinear hybrid method to predict short-term traffic flows and classified the traffic flow data into similar, unstable, and irregular components. Among them, autoregressive integrated moving average and generalized autoregressive conditional heteroskedasticity models were used to predict the similar and fluctuating components, and Markov models with state membership and wavelet neural networks were used to predict the irregular component. Li et al. [64] analyzed the correlation between the predicted and historical time windows based on the grey correlation coefficient method and used the rank index method to establish a combined prediction model based on ARIMA, BPNN, and SVR developed. A neural network training algorithm combining exponential smoothing and the Levenberg-Marquardt algorithm was proposed to improve the neural networks generalization previously used for short-term traffic predicting [65]. Liu et al. [13] proposed a hybrid forecasting model based on a combination of neural network and KNN methods for short-term traffic predicting. Gu et al. [66] proposed a model incorporating deep learning to predict lane level speeds. In the model, firstly use entropy-based grey correlation analysis to select the lanes with the highest correlation with the predicted lanes to extract spatial features, and secondly, combine LSTM and GRU to build a two-layer deep learning framework to extract temporal features of traffic flow. e experiments results showed that the model outperformed the baseline model in prediction. Ma et al. [67] proposed a novel deep learning-based approach to daily traffic flow prediction incorporating contextual factors. Firstly, a specific CNN is used to extract daytime and intraday traffic flow features, secondly, the extracted features are used as input to an LSTM to learn the temporal features of the traffic flow, and finally, the traffic flow is predicted by combining the contextual information of historical days. Experiments results showed that the robustness and prediction performance of the model outperformed the benchmark model.
With the development of deep learning, especially the proposed and successful application of attention mechanism [68], it has received attention from scholars in the field of traffic, and some results of applying it in combination with CNN or variant RNN (LSTM and GRU) for short-term traffic flow prediction have emerged. For example, Liu et al. [69] proposed a CNN model based on an attention mechanism to predict traffic flow speed, where the input to the model is a three-dimensional data matrix consisting of traffic flow speed, flow rate, and time occupation, and the extraction of spatiotemporal features is done by convolutional units, and the proposed model has better prediction performance when compared with existing models for simulation experiments. Wu et al. [70] proposed a traffic flow prediction model including a data preprocessing module and a traffic flow prediction module, where the data preprocessing module is to repair missing values in the dataset, and the traffic flow prediction module is a model of a combined LSTM deep learning method based on an attention mechanism, and experimental results show that the prediction performance of the model outperforms other deep learning methods (RNN and CNN). Ma et al. [71] proposed a fuzzy logic-based hybrid model based on the complementary advantages of nonparametric and deep learning methods. Firstly, the model uses two submodels, KNN and LSTM, to extract features on the spatiotemporal correlation of traffic flow and the influence of specific contextual factors on traffic flow, and secondly, dynamic weights based on the fusion mechanism are used to optimize the hybrid model, and simulation experiments show that the model has better prediction and robustness than other stateof-the-art models. Ren et al. [72] proposed a combined deep learning prediction (CDLP) model, which consists of two parallel single deep learning models, that is, a CNN-LSTMattention model and a CNN-GRU-attention model. In addition, a dynamic optimal weighting combination algorithm was proposed to combine the outputs of the two single models, and experimental results showed that this model has better prediction performance and robustness than the stateof-the-art prediction models.
In summary, as the research on short-term traffic flow prediction continues to grow, combined prediction models have received more and more attention, and in particular, the application of combined deep learning models has achieved greater success. However, most of the researches are based on the fusion of multiple single combination methods or just obtaining a fusion model of simple spatiotemporal characteristics of traffic flow, which cannot reflect the unified whole of spatiotemporal correlation and periodicity of traffic flow.
In this paper, we analyze the complex characteristics of traffic flow, including the relationship between spatiotemporal and periodic features, and apply CNN, Bidirectional GRU, and Attention mechanism to build a multifeature fusion model for short-time traffic flow prediction.

Method
3.1. CNN. CNN is a deep feed-forward neural network, which mainly consists of a convolutional layer, a pooling layer, and a fully connected layer [73]. e convolutional layer is the most important part of the CNN, where the local features of the input data are obtained in the form of sliding filters, and the number of convolutional kernels in the convolutional layer corresponds to the number of output features in the convolutional layer. Typically, CNN models contain multiple convolutional layers, and the network can generate an excessive number of parameters. To reduce the number of parameters, the pooling layer usually performs a downsampling operation with the output features of the convolutional layer while keeping the overall features unchanged, in order to extract important features and prevent overfitting of the model. e fully connected layer is usually at the end of the CNN, and its main role is to spread the features obtained by convolution and pooling into a feature vector for classification and regression.

Bidirectional GRU.
In order to address the shortcomings of traditional RNNs, which ignore the long-term dependence of time series, LSTM and GRU have been proposed one after another. GRU and LSTM networks have not only the function of short-term memory, but also the function of long-term memory. In particular, the GRU is a further simplification of the LSTM [74], from the three gating units of the LSTM to two gate structures (update gate and reset gate), which further improves the operational efficiency of the network due to the simplified number of gates. e structure of the GRU unit is shown in Figure 1, where the purple line indicates the update gate, and the red line indicates the reset gate, defined as z t and r t respectively. e role of the update gate in the GRU is to determine whether the hidden layer state h t-1 is updated to a new hidden layer state h t , and the role of the reset gate is to control the extent to which the hidden layer state h t-1 is discarded at moment t-1. Equation (1) represent the computation process for each state within each time step in the GRU.
Where ○ represents the Hadamard product, X t represents the input at moment t, W z , W h and W g represent the weight matrix associated with the input, U z , U h and U g represent the weight matrix associated with h t -1 , and b z , b r , b h and b g represent the bias.
Based on GRU network, bidirectional GRU network has been further developed [75]. e structure of a bidirectional GRU network is made up of two GRU layers stacked in di erent directions, which is shown in Figure 2. In the gure, x t is the input to the GRU, h f is the output of the forward GRU layer, and h b is the output of the reverse GRU layer. e input to the BiGRU network contains two time series from the past and the future, and in each moment, the input time series is fed into the two opposite GRU layers, and the outputs [h 1 , h 2 , h 3 , h 4 ] are obtained by the joint determination of these two reverse GRU layers.
At each time node x t , this network has two hidden layers containing opposite order. e neurons in one hidden layer are ordered from left to right, and the other hidden layer is ordered from right to left. To ensure that there are two hidden layers at any moment t, the network consumes twice the amount of storage to store parameters such as weights and o sets. e nal output of the network is the fusion of the outputs of the two hidden layers to produce the nal output. In addition, there is no information interaction between the two opposite hidden layers, and they are computed independently, but the state output vectors of both are combined at the nal output to ensure that the unfolding graph is acyclic.

Attention Mechanism.
e attention mechanism uses a method of assigning di erent weights to the input features of a model in order to highlight the important factors that in uence the model. e function of the attention mechanism can be understood as the process of ltering important information from multiple pieces of information, focusing on the important information and ignoring the unimportant information. e process of focusing on the important information is also the process of calculating the weight coe cients, and the more important the information, the larger the weight coe cient assigned. e process of calculating the context vectors and weights for the application of the attention mechanism to a deep learning model is as follows: Assuming that the output state of the hidden layer of the deep learning model is h 1 , h 2 , . . ., h i , . . ., h t , the context vector can be calculated as C t : where α t,i denotes the attention parameter, the corresponding weight of h i , and the sum of the weights is 1. e attention parameter can be calculated as where e t,i is the alignment model, which scores the input at moment i and the output at moment t. It is calculated as follows: where W a , U a , and b a are the parameters of the feedforward neural network, and s t − 1 can be calculated as follows: where g(·) denotes the deep learning network. Based on (5), the output of the attention mechanism can be calculated as where softmax is the activation function.

Model
Realistic short-term tra c ow often exhibits complexity and randomness, which requires tra c ow prediction models that can tap into multiple features of tra c ow. CNNs can extract local trend features of tra c ows, while bidirectional GRU networks can obtain long-term dependent features of tra c ows not only in the past, but also in the future and can achieve temporal feature extraction by fusing past and future features. By fusing past and future features, temporal feature extraction can be achieved. At the same time, the attention mechanism enables the model to focus on important features. Based on this, this paper proposes a short-term tra c ow model based on a deep learning method of multifeature fusion, which consists of a CNN-BiGRU-attention module and two BiGRU-attention modules, and the model structure is shown in Figure 3.

Hadamard product
Sigmoid function In addition, from a layer perspective, the model consists of an input layer, a hidden layer, a feature fusion layer, and an output layer. e input layer contains a parallel composition of historical time series, daily and weekly series, where the historical time series X T is a sequence of tra c ows from time t − n to t and can be represented as where x t is the tra c ow at time t. e daily periodic tra c sequence X d T can be expressed as where x d t indicates the tra c ow x t corresponding to the previous day. e weekly periodic tra c ow sequence can be expressed as where x g t indicates the tra c ow x t corresponding to the previous week. e hidden layer contains three parallel CNN-BiGRUattention layers with two BiGRU-attention layers. e 1DCNN is chosen as the convolution layer of the model due to the one-dimensional and periodic nature of the tra c ow sequence. e dropout layer is followed by the feature fusion layer, where the features of the tra c ow are fused and output to the output layer for prediction.

Data Processing and Dataset.
e collected cross-sectional tra c ow at the intersection of Shandong Road and Minjiang Road in Qingdao, China, is used as the data set, containing 101 consecutive days of tra c ow data from February 1 to May 12, 2019, and a total of 29,088 raw pieces of data, and the interval for these data is 5 minutes. en, the Lagrangian interpolation method is used to process the missing data and abnormal data. e data are then normalized using the maximum-minimum normalization method to obtain the dataset for the model. A total of 87 days of data from February 1 to April 28 in the dataset are used as the training set, and a total of 14 days of data from April 29 to May 12 are used as the test set.

Experimental Environment and Model Evaluation Index
Selection.
e software and hardware conditions of the experimental environment in this paper are shown in Table 1.
In order to evaluate the performance of the fused feature model, three evaluation metrics were chosen, namely, MAPE, MAE, and RMSE, which are calculated as follows: where n is the total number of samples in the test set, y i is the actual value of the ith sample, and y i is the predicted value of the ith sample.  square error function are generally used. Due to the convenience of calculating, the mean square error function is chosen as the loss function in the fusion feature model, and the calculation formula is as follows:

Model Parameter
where y i is the actual value of the ith sample, y i is the predicted value of the ith sample, and n is the number of samples.

Setting the Number of Neurons in the Model.
Before the model is trained, the number of neurons in the input and hidden layers of the model should be set (the model in the paper is based on a sequence of historical tra c ows to predict the tra c ow value at the next moment, so the number of neurons in the output layer of the model is set to 1; refer to Section 4 for details). e following is the process of setting the number of neurons in the input and hidden layers.  Figure 3: Structure of the multi-feature fusion model.

Journal of Advanced Transportation
To obtain the appropriate number of neurons for the input layer, we select 6, 12, 18, and 24 as the number of neurons for the input layer to train the model and obtain the optimal number of neurons for the input layer by error analysis of the test set. Similarly, for the setting of the number of neurons in the BiGRU layer, four neuron numbers of 16, 32, 64, and 128 are chosen to train the model. e optimal number of neurons in each input and hidden layer of the neural network is determined by error analysis of the test set. Meanwhile, in the 1DCNN layer, the convolutional operation to extract features is implemented through convolutional kernels, and the size of kernel is set 2 * 1, i.e. lters 64 and kernel_size 2. e ReLu function was chosen as the activation function for the convolutional layer. It is calculated as follows: x, x > 0, where x is the input to the activation function.
In the Dropout layer, the neuron loss rate is set to 20%. In addition, epoch is set to 300 rounds, and the batch size is set to 256.
For the error analysis of the test set, MAPE is selected as the main evaluation metric, and MAE and RMSE are selected as auxiliary evaluation metrics. e results of the evaluation metrics for the test set with di erent numbers of neurons in the model input layer and the bidirectional GRU network, including MAPE, MAE, and RMSE, were obtained, as shown in Table 2.
From Table 2, it can be found that the model has the strongest generalization ability when the number of neurons in the input layer is 12, and the number of neurons in the BiGRU network is 128, so we choose 12 and 128 as the numbers of neurons in the input layer of the model and the BiGRU network.

Optimization Algorithm Setup.
In the training process of deep learning models, optimization algorithms are used to iteratively optimize the parameters generated in the training model in order to reduce the value of the loss function, so that the training process of the model becomes stable as the number of iterations increases. e mainstream optimization algorithms include RMSProp and Adam, both of which are applied to train the fused feature model, and the optimization algorithm is selected based on the generalization capability of the model as an indicator. e RMSProp algorithm and Adam algorithm are used to train the fusion feature model, respectively, and the results of the three evaluation metrics are obtained, as shown in Table 3.
As can be seen in Table 3, MAPE, MAE, and RMSE are all smaller than the RMSProp algorithm when the CDLP model is trained using the Adam algorithm.
e results indicate that the Adam algorithm is more e cient than the RMSProp algorithm and is selected as the optimization algorithm for multifeature fusion model.

Results and Analysis.
After determining the parameters of the model, the designed training and test sets are used to validate the predictive performance of the multifeature fusion model. e loss function curves generated by the model during the training process are shown in Figure 4. From Figure 4, it can be found that as the epoch increases, the loss function curves of the training and test sets decrease rapidly and steadily and nally converge to a constant 0, indicating that the design of the multifeature fusion model is reasonable. Figure 5 shows the prediction results of the multifeature fusion model in the test set. It can be found that the multifeature fusion model can t the actual tra c ow in the test set very well; speci cally, the absolute error of the model at each moment is found to be between [−60,60] from the error curve graph.
In addition, to further verify the robustness of the multifeature fusion model, Figure 6 shows the MAPE plot of    Journal of Advanced Transportation the model in the test set. As can be seen from the graph, the trend of the MAPE curve gradually decreases from the maximum value to the in ection point and then slowly increases and gradually converges to 5.52%, which indicates that the fused feature model has good robustness and low error, further indicating that the multifeature fusion model can better achieve tra c ow prediction.
To further validate the feasibility of the multifeature fusion model, the ability of the multifeature fusion model in extracting long-term dependent features and local features of the tra c ow is rst observed. e Conv-BiGRU module (includes other modules) is selected as the comparison model. e structure of the module consists of a parallel layer of a convolutional layer and a BiGRU network, and the function of the module is to extract local trend features and long-term dependent features of the tra c ow individually. e model nally fuses the long-term dependent features, local trend features, and periodicity (including daily and weekly periodicity) of the tra c ow through the feature fusion layer and then predicts them. Second, the impact of periodic features on the multifeature fusion model is veried. Short-term tra c ows usually exhibit strong periodicity, and the advantage of the model is that it takes into account the periodicity of tra c ows by using two BiGRUattention modules to extract the daily and weekly periodicity of tra c ows, respectively. e model containing only one module of CNN-BiGRU-attention is used as a comparison model for validation. ird, the periodicity usually includes daily and weekly periodicity, and the models considering only daily and weekly periodicity, respectively, are used as comparison models for validation. Fourthly, a model that does not contain attention mechanisms in each module is considered as a comparison model for validation. Based on these comparison models and the multifeature fusion models mentioned above, the corresponding MAPE results were obtained by training and testing, as shown in Figure 7.
From Figure 7, it can be found that the maximum, minimum, and median values of the multifeature fusion model containing the CNN-BiGRU module are smaller than those containing the Conv-BiGRU module, indicating that the feature extraction capability of the CNN-BiGRU module is better than that of the Conv-BiGRU module.
is is because the local trend features and long-term dependent features of the tra c ow are intertwined and interact with each other. Furthermore, the maximum, minimum, and median values of the multifeature fusion model are smaller than those of the CNN-BiGRU-attention model with only one module, because the periodic features play an important role in the prediction of tra c ow in the short-term tra c ow. In addition, from Figure 7, it also can be found that the MAPE of the multifeature fusion model is smaller than that of the feature fusion model without the attention mechanism. is indicates that the attention mechanism in multifeature fusion model improves the prediction accuracy by focusing on the important features extracted from each module. Finally, the proposed multifeature fusion model is compared with existing baseline models. e baseline models include the LSTM model, GRU model, CNN-LSTMattention model, CNN-GRU-attention model, and CDLP model [72]. e LSTM model and GRU model are composed of one input layer, two hidden layers (LSTM layer and GRU layer), and one output layer. e CNN-LSTM-attention model is composed of an input layer, a hidden layer, and an output layer, where the hidden layer is composed of a convolutional layer, two LSTM layers, and an attention mechanism layer connected sequentially, and the structure of the CNN-GRU-attention model is the same as that of the CNN-LSTM-attention. e parameters of the ve benchmark models are set as in the multifeature fusion model. e prediction errors in terms of prediction performance metrics for the di erent models are shown in Table 4, from which it can be found that the multifeature fusion model has the lowest prediction error. is is because the LSTM and GRU models mainly consider the temporal characteristics of tra c ow, that is, the long-short time dependence, while the CNN-GRU-attention model and the CNN-LSTM-attention model mainly consider the spatial and temporal characteristics of tra c ow, which is better than the LSTM and GRU models in terms of prediction error. e prediction performance of the CNN-GRU-attention model and the CNN-LSTM-attention model is better than that of the LSTM and GRU models, because the CNN-GRU-attention model and the CNN-LSTM-attention model mainly consider the spatial and temporal characteristics of tra c ow and consider the  Journal of Advanced Transportation spatial characteristics of the model more than that of the LSTM and GRU models. e CDLP model is a combined prediction model based on the CNN-LSTM-attention model and the CNN-GRU-attention model, which also considers only the spatiotemporal characteristics of the tra c ow. e multifeature fusion model extracts the spatiotemporal, weekly, and daily characteristics of the tra c ow by using three di erent modules of the combined deep learning method, so the prediction performance of the multifeature fusion model is better than that of the baseline model. In addition, the training time of the di erent models are shown in Table 5. It can be found that the training time of the multifeature fusion model is the same as that of the CNN-GRU-attention model in the combined model with higher prediction accuracy, but the MAPE, RMSE, and MAE of the model are reduced by 0.19%, 0.71, and 0.35, respectively, which are better than those of the CNN-GRU-attention model. Furthermore, the training time of the multifeature fusion model is smaller than that of the CNN-LSTM-attention model and the CDLP model, while the prediction accuracy is improved in both cases, which can be re ected in Table 4. is is because the model uses the CNN-BiGRUattention module, in which GRU is a simpli cation of the LSTM, so the training time for the multifeature fusion model is less than that of the CNN-LSTM-attention model and the CDLP model (which uses the CNN-LSTM-attention module). erefore, the multifeature fusion model has superior prediction performance.

Conclusion and Future Work
Short-term tra c ow prediction is one of the core components in intelligent transportation systems. In order to solve the problem of not extracting multiple features of tra c ow in tra c ow prediction, in this paper, a multifeature fusion model consisting of a CNN-BiGRU module with an attention mechanism and two BiGRU modules with an attention mechanism is proposed. Moreover, the parameters in the multifeature fusion model including the number of neurons, the optimization algorithm, and other parameters are obtained by experimental calibration. rough experiments, it is found that the CNN-BiGRUattention module can e ectively capture the local trend features and long-term dependent features of the tra c ow, and the two BiGRU-attention modules can e ectively capture the daily and weekly cycle features of the tra c ow.

CNN-BiGRU
Conv-BiGRU without period without daily period without weekly period without attention   At the same time, the attention mechanism improves the prediction accuracy of the model by focusing on the importance of the features acquired in each module, and the feature fusion layer of the model allows the features extracted from each module to be fused to predict future traffic flow trends. Finally, extensive experimental results have shown that the predictive performance of the multifeature fusion model is superior to that of the baseline models for the same dataset.
In this work, we investigate traffic flow prediction using only cross-sectional traffic flows as the object of study. However, in real life, road network traffic flows usually exhibit extremely complex characteristics, and it is difficult for traditional CNN and BiGRU networks to fetch shorttime traffic flow features under complex road networks. erefore, similar graph neural network examples, such as spatiotemporal synchronous graph convolutional neural networks [76], provide a solution to the problem of shortterm traffic flow prediction in complex and large road networks, which is difficult to be solved by traditional combined CNN-GRU models; therefore, it will be reserved for our future work and offers a new alternative approach for traffic prediction. In addition, the prediction of short-term traffic flows is often influenced by weather, traffic accidents, and major events, so the study of short-term traffic flow prediction considering special events will be left as another study for our future research.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.