Traffic Safety Oriented Multi-Intersection Flow Prediction Based on Transformer and CNN

,


Introduction
Urban trafc is an important factor of urban function layout, which seriously afects the development of the social economy and the improvement of people's living standards.Due to the increase in consumers' purchasing power, there are more and more private cars, and the road density is increasing as well as the trafc safety concerns.Tus, indepth research on trafc congestion and efective measures to improve trafc efciency has become a research highlight.
According to the National Urban Car PARC Report, by the frst half of 2019, as of June 2019, there were 250 million cars in China, and 66 cities across the country had more than 1 million cars.As the number of cars on the road rises, morning and evening rush hour overcrowding and minor trafc accidents become more and more common.Trafc congestion and accidents are very detrimental to urban development.Tey will not only increase the time needed for people's travel activities but also adversely afect people's work efciency and life experience.Moreover, the congestion will lead to increased vehicle exhaust emissions and damage the environment.
In the face of a complex trafc environment, the prediction of trafc fow can improve the utilization rate of urban road resources and reduce the possibility of car accidents.It can also provide accurate trafc guidance information for urban trafc signal control [1], and the design of data dissemination techniques based on the travel of trafc participants [2].Shortly, 6G will be crucial for communication, resource allocation, and compute ofoading [3,4].It will also help to collect data for trafc prediction.Trough extracting characteristics from the obtained trafc data, trafc prediction methods can improve road safety and intelligent transportation constructions.
In the past decades, many scholars have put forward various trafc fow prediction models and achieved a series of theoretical and applied research results [5].Most of these research methods are mainly based on statistical models or shallow machine learning methods to describe the evolution of trafc network fow, such as ARIMA [6], ANN [7], and SVR [8].However, these methods can only be used when the data are relatively stable and linear.Te actual trafc fow is extremely variable and will be afected by weather, date, trafc accidents, and other factors.Because the aforementioned elements have an impact on trafc fow, trafc-related time series data typically exhibit nonlinear or rapid change characteristics and are interdependent.In addition, due to the complex trafc network and the increasing number of vehicles, the spatiotemporal sequence trafc data collected based on the Internet of vehicles technology is large in scale and high in latitude, as well as lots of security threatens [9].Terefore, the traditional methods are difcult to mine the deep relationship between trafc spatiotemporal series data and face a huge bottleneck when being applied in practice.In recent years, deep learning has been proven to be able to efectively extract depth features and has made breakthroughs in image processing, speech recognition, natural language processing, and other felds [10].Due to the complex nonlinear spatiotemporal correlation between diferent trafc time series data, the deep learning method is a good choice for trafc fow forecasting tasks.
Because intersections are often interrelated, especially in cities with large trafc fow and short intersection spacing, the congestion of an intersection may afect the trafc distribution and capacity of the whole region.At the same time, the improvement of trafc congestion at a single intersection may aggravate the congestion at adjacent intersections and cannot accurately improve the overall trafc efciency.Terefore, it is necessary to predict the trafc fow of multiple intersections.Tere are not many studies on multi-intersection trafc fow prediction, though, and most of the outcomes are not particularly good.
Tis paper presents a trafc fow prediction method based on transformer and CNN, called CNNformer.In addition, we have made improvements to CNNformer, added the learning of the periodic characteristics of trafc fow, and proposed CNNformer + .Te main contributions are as follows:

Related Work
With the development of the intelligent transportation systems, many cameras, sensors, and other information collection equipments are deployed on the road.Tese equipments have accumulated a large number of trafc time series data with spatial information such as trafc fow, vehicle speed, and lane occupancy rate, providing a good data foundation for trafc fow prediction.

Shallow Machine Learning Methods.
For a long time, to improve the congestion analysis and management decisionmaking ability of intelligent transportation, researchers have proposed a large number of trafc fow prediction models.Willams and Hoel used ARIMA [6] to model the trafc fow.Tis method is to model the single variable trafc fow sequence data as an autoregressive moving average process, to predict the trafc fow.[18].To learn the spatiotemporal feature representation, the trafc fow between various roads in the road network is modeled as a multichannel matrix, which is comparable to the RGB pixel matrix of the image.Furthermore, deep learning methods, such as Deep Q-learning Network (DQN), can also be used to fnd optimal ofoading strategies in intelligent-connected vehicles [19].
Te intersection is the most complex part of the road network because it involves a variety of diferent objects, such as vehicles and pedestrians.With the increase in trafc demand, the problem of trafc congestion at urban intersections is becoming more and more serious.Te short-term trafc fow forecast of intersections has also been the subject of numerous corresponding studies.For example, Qu et al. established a two-layer superposition model based on intersection short-term trafc fow prediction by integrating k-nearest neighbor (KNN) and Elman neural network modeling methods [20].Kim and Jeong proposed a collaborative trafc signal control method based on multi-intersection trafc fow prediction (TFP-CTSC) [21].Li et al. proposed a new deep intersection spatiotemporal network (DISTN) for trafc fow prediction.Considering the spatial and temporal characteristics of the convolutional neural network (CNN) and long-term and short-term memory (LSTM), the depth learning method was applied to intersection trafc volume prediction [22].Furthermore, digital twins have been used to facilitate the design, evaluation, and deployment of IoV-based systems [23,24].However, the research is still in an initial stage.

Methodology
In terms of trafc fow prediction, the camera on the road is usually used to count the number of cars passing by.If multiple intersections in a certain area are considered, data from diferent cameras will contain geolocation and time information.Terefore, we can regard the trafc fow prediction problem as a spatiotemporal sequence problem, namely, we can use the time and space information contained in the data to predict the trafc fow of diferent intersections.Te structure of the model is shown in Figure 1.Te input of CNNformer + contains the trafc fow data of three time windows, which are the trafc fow data of the previous time window (X t−H , X t−H+1 , . . ., X t ), the simultaneous data of the week before the previous time window (X t−H−week , X t−H−week+1 , . . ., X t−week ), and the simultaneous data of the previous month in the previous time window (X t−H−month , X t−H−month+1 , . . ., X t−month ).Each time window contains H time steps, and the trafc fow data of each time step can be described as a two-dimensional matrix.Te three input time windows are processed separately, that is, to stack the H two-dimensional matrices in the data of each time window and input them to CNN.After using CNN to extract the spatial features of data, the convoluted data are input into transformer.After using transformer to extract the time characteristics of data, the data of all time steps will be output.Ten, it stacks the outputs (Z now , Z week , Z month ) of the three time windows and puts them into the average pooling layer.Te fnal output of the model is the predicted trafc fow in the next time window.

Extracting Spatial Features Using CNN.
Because the intersections are often interrelated, the upstream and downstream intersections may afect the trafc fow prediction of the target intersection.In this paper, CNN is used to extract the spatial features of associated intersections.Te input data has the following shape: [H, N, D], where H stands for the number of time steps, N for the number of intersections, and D for the quantity of trafc fow directions at each intersection, where D is equal to 12.
With the great success of convolutional neural network in the feld of image processing, other felds are also trying to use the method of deep learning to solve practical application problems.In the feld of trafc fow prediction, because the trafc fow based on region or station can be organized into a two-dimensional vector or a one-dimensional vector, it is considered as an efective method to mine the spatial characteristics of trafc volume data using the convolution neural network.
For instance, in time step t, the historical fow data of a given road network can be described as a matrix as follows: For each element in the matrix, the superscript format is (intersection number -trafc fow direction number), and the subscript represents the time step t.Each row of the matrix represents the trafc fow of all trafc fow directions Security and Communication Networks at time t at the nth target intersection.Each intersection has 12 trafc fow directions.Terefore, there are 12 columns of trafc fow data, and each column of the matrix represents the trafc volume in a certain trafc fow direction from intersection 1 to intersection N.
When the data of a time step can be described as a matrix, it is easy to think that the matrix can be used as the input of CNN.Te convolution model in this paper is shown in Figure 2. Te spatial features of associated intersections are extracted by using two-dimensional convolution layers with a convolution kernel size of (2, 2) and padding size of (2, 1).After convolution, a ReLU and Dropout layer are added.
Te output X of the Nth convolution layer at time t is X N t , it will then pass through a residual connection.Finally, through the full connection layer, it is transformed into a onedimensional spatial eigenvector Y t .Tis vector is used as the input of the transformer network to capture the time correlation.
Te output shape of CNN is [H, M], where M represents the sum of trafc fow directions at each time step and all relevant intersections, and H represents the number of time steps.

Extracting Time Characteristics Using Transformer.
Te task of predicting trafc fow is a typical time series prediction challenge that uses historical observation data to forecast future trafc fow data.Since transformer is an excellent sequence model, this paper takes the output of CNN as the input of transformer and uses transformer to extract the time characteristics of trafc fow.
Currently, the majority of tasks involving trafc fow forecasting uses RNN and its derivatives, LSTM and GRU.RNN and its variants must process data in sequence during training.Te calculation of time step t depends on the calculation result at time t − 1, so parallel training is not possible.In addition, the coding of the trafc fow by RNN and its variants is only retained in the next time step, which means that the coding of the current time step only strongly afects the representation of the next time step, and its infuence will disappear soon after a few time steps.Although the structure of gate mechanisms such as LSTM alleviates the problem of long-term dependence to some extent, LSTM is still powerless for particularly long dependencies.Te transformer model can avoid recursion, allows parallel computing to reduce training time, and reduces performance degradation caused by long-term dependency.Compared with RNN and variants, the transformer model has stronger structural fexibility and versatility, and can capture a wider range of information relevance.In addition, in the NLP feld, the transformer model processes sentences in a nonsequential manner, and sentences are processed as a whole rather than word by word.
Te transformer does not rely on past hidden states to capture dependencies on previous words but processes a sentence as a whole, so there is no risk of losing or forgetting past information.Based on the abovementioned advantages, this paper attempts to apply transformer to the task of trafc fow prediction.
Te input of transformer is a sequence of spatial eigenvectors containing H time steps, expressed as (Y t−H , Y t−H+1 , . . ., Y t ), where Y t is the spatial eigenvector output from the fow data of time step t after n convolution layers, where t − H to t is the historical time step.Te network is trained to predict the trafc fow of all associated intersections in the next H time steps.Conv FC Figure 2: Convolution model structure.
Transformer is a seq2seq model.Te encoder layer receives input and the decoder layer obtains output.

Encoder. Te encoder layer of transformer includes two sublayers:
(i) Te frst sublayer is a multihead attention, which is used to calculate the input Self-Attention.(ii) Te second sublayer is feed forward, which is a simple fully connected network.
After each sublayer, the residual network is simulated, and the results of each sublayer are displayed as follows: where Sublayer(x) represents the mapping of the sublayer to the input X.To ensure full connection, the dimensions of the output of all sublayers and embedded layers are the same.Te structure of the encoder layer is shown in Figure 3. Te encoder input consists of the following three parts: (i) Input embedding: In the original transformer model, the input of the model is a high-dimensional eigenvector.Te feature vector is obtained by converting the input text through word embedding method such as Word2Vec [25], which is called an embedded vector.Tis paper uses the full join layer to replace the word embedding method to encode the input data.After the full join layer, the shape of the input data becomes [H, E], where H represents the number of input time steps and E represents the feature size of the input data.(ii) Position encoding: Transformer adds an additional vector positional encoding to the input of the encoder layer.Te dimension of this vector is the same as that of the embedded vector, which is used to provide relative position information.Tis vector can determine the position of the current time step in the time window, and the transformer can learn the position information of the time step through this vector.Te formula of the position code is shown as follows: where pos refers to the position of the current time step in the time window, i refers to the subscript of each value in the vector, and d model refers to the size of the input dimension.When pos is an even number, Sine coding is used; when pos is an odd number, Cosine coding is used.(iii) Global time encoding: Based on the transformer model, this paper not only uses location coding for local location embedding but also takes into account the efectiveness of timestamp information in practical applications.Te location codes are extracted from the timestamp corresponding to time series data.
Te calculation of global time encoding is shown as follows: where X mon refers to the month location embedding, X do w refers to the day of week location embedding, X d refers to the day location embedding, X h refers to the hour location embedding, and X min refers to the minute location embedding.Tese fve vectors are combined and input into the full connection layer for coding to generate a learnable embedding.
Finally, the model adds the abovementioned three embedded vectors and sends them to the next layer as input.
A multihead attention is equivalent to the integration of M Self-Attention.Te specifc process of Self-Attention is as follows: (i) Self-Attention will use the input embedded vector to calculate three new vectors.Te dimension of the vector is the same as that of the embedded vector.
Tese three vectors are named as Query, Key, and Value, respectively.Tese three vectors are obtained by multiplying the embedded vector with a matrix, which is randomly initialized.Te dimension of the matrix is [64, E], and E represents the characteristic size of the input data.(ii) Calculate the score of Self-Attention, which determines the degree of attention paid to the input data of other time steps when the model encodes onetime step data at a certain position.Te fractional value is calculated by point multiplication of Query and Key.(iii) Next, divide the result of point multiplication by a constant.Te constant value selected in this paper is 8, which is the root of the frst dimension of the matrix.Ten, do a Softmax calculation on the obtained results.Te result is the correlation between each time step data and the time step data at the current location.(iv) Finally, use the result to multiply the value to get the Self-Attention Value.
Tis method of determining the weight distribution of values through the similarity between Query and Key is called scaled dot product attention.Te calculation formula is shown as follows: where d k represents the dimensions of Query, Key, and Value vectors.A multihead attention is to perform the process of scaled dot-product attention M times, in which not only one group of Q, K, and V matrices is initialized, but M groups are initialized, and then M matrices are output.

Security and Communication Networks
However, the feed forward neural network cannot input multiple matrices.Terefore, M matrices need to be reduced to one.Te precise method entails joining M matrices to create a large matrix, multiplying this large matrix by a weight matrix with the random initialization, and then obtaining the fnal matrix.
In the transformer, each sublayer will be followed by an incomplete module, and there is a layer normalization.Tere are many normalization methods, but the purpose of each method is to normalize the input data to achieve the efect that the mean value is 0 and the variance is 1.Te data should be normalized before entering the activation function so that the input data do not fall in the saturation region of the activation function.
Unlike batch normalization, which calculates the mean and variance in the batch direction, layer normalization calculates the mean and variance on each sample.Terefore, layer normalization is usually used to normalize the sequence model.

Decoder. Te transformer decoder layer includes three sublayers.
(i) Te frst sublayer is masked multihead attention, which is also the Self-Attention of calculation input.However, since future information cannot be known at the time of generation, it is necessary to mask future information.For a sequence, suppose the time step is t, the decoding output should only depend on the output before t, not after t.Terefore, mask operation is required.(ii) Te second sublayer is encoder-decoder attention.
Te output of the encoder layer and the output of the masked multihead attention sublayer are used for attention calculation.(iii) Te third sublayer is feed forward, which is the same as the encoder layer.
Te structure of the decoder layer is shown in Figure 4. Te trafc fow of the input decoder layer is composed of a part of the historical data that is close to the predicted data and an empty vector.Te length of the empty vector is the length of the data to be predicted.Te encoder layer uses the same coding technique for input trafc volume.
Te masked multihead attention sublayer of the decoder layer needs to use a mask so that the decoder cannot see future information.Te specifc method is to generate an upper triangular matrix, the values of which are all 0 s, and apply this matrix to each sequence to achieve the purpose of covering.
Te encoder-decoder attention sublayer of the decoder layer uses the output information of the encoder to calculate the content of the current decoded output.Te diference between this part and Self-Attention lies in the three vectors of Q, K, and V. Q is the attribute of the decoder, while K and V are the last output K and V of the encoder layer.Te calculation method of attention is the same as that of Self-Attention.Trough this method, the encoder can capture the output information of the encoder.

Learn the Periodicity of Trafc Flow Using Average
Pooling.When the decoder layer is completely executed, the fnal output of the three time windows is Z now , Z 0 , and Z month .Ten, we stack these three vectors and input them to the average pooling layer.Te calculation process of average pooling is as follows: 6 Security and Communication Networks where  q t+1 represents the predicted fow data.As shown in Figure 5, average pooling involves combining feature points from diferent neighborhoods and averaging their values to create new features.Compared with the full connection layer, the average pooling can greatly reduce the network parameters, thus reducing the overftting phenomenon.
Te fnal output of the average pooling layer is the predicted trafc volume of the next time window.Gaussian error linear element (GELU) is used as the activation function of the average pooling layer.It is a high-performance neural network activation function because the nonlinear change of GELU is a random regular transformation mode that meets the expectation, and the formula is as follows: where Φ(x) refers to the cumulative distribution of the Gaussian normal distribution of x.GELU introduces the idea of random regularity in activation, which is a probabilistic description of neuron input, and is more intuitive and natural.

Loss Function.
Te loss function, also known as the error function, is used to measure the operation of the algorithm.Te loss function is shown as follows: where α represents the learning rate, Loss() represents the loss function,  q t+1 represents the predicted fow data, and q t+1 represents the actual fow data.Te error between the anticipated trafc fow in the following time window and the actual trafc fow in that time window is measured using the loss function to determine how closely the predicted output value is to the actual value.

Optimization Algorithm.
Te application of machine learning is a process highly dependent on experience.With a large number of iterations, many models need to be trained to fnd the right one.When training a neural network, we frequently employ a large data collection, which will cause the training time to be extremely slow.Terefore, using an appropriate optimization algorithm can efectively improve the speed of the training model.Gradient descent is a method to fnd the objective function, that is, to minimize the loss function.It uses gradient information to fnd the appropriate objective value by iteratively adjusting parameters.It is one of the most widely used optimization Security and Communication Networks algorithms in neural networks.Tis paper uses Adam as the optimization algorithm of the model.Te reason is that it is essentially the combination of momentum and RMSprop algorithms and then corrects its deviation.Te momentum algorithm uses momentum similar to physics to accumulate gradients, and the RMSprop algorithm can make convergence faster while making fuctuations smaller.Terefore, the performance achieved by combining these two algorithms is assumed to be better.Adam fully utilizes the second moment mean of the gradient in addition to computing the adaptive parameter learning rate based on the frst moment mean, as does the RMSProp algorithm.Specifcally, the algorithm computes exponential moving averages of the gradients, using hyperparameters beta1 and beta2 to control the decay rate of these moving averages.Because the initial moving average, beta1 and beta2 values are all close to 1, the moment estimate's deviation is close to 0. By frst computing the deviated estimate, and then, the deviate-corrected estimate, the deviation is optimized.

Simulation Experiment of Regional Traffic Flow Prediction Based on AnyLogic
AnyLogic is a professional virtual prototyping environment for designing complex systems with discrete, continuous, and mixed behaviors.Using AnyLogic, one may easily create a simulation model of the intended system and the system's surrounding environment, including its physical equipment and operators.Te road trafc Library in AnyLogic allows users to model, simulate, and visualize vehicle trafc.Te library supports detailed and efcient physical hierarchical modeling of vehicle motion.AnyLogic can be applied to model vehicles, roads, and lanes of highway trafc, street trafc, production site transportation, parking lot, or any other system.

Data Description.
In the experiment, AnyLogic is used to build a regional road network micro model to simulate the actual road conditions for the statistics of intersection trafc fow data.Tis area is a real region composed of three associated intersections, and each intersection has 12 lanes, as shown in Figure 6.Te simulation data includes three months' trafc fow data.Te statistical interval is 15 minutes, and the trafc fow data of all intersections are collected every 15 minutes.Each model data represents the number of vehicles passing in the direction of trafc fow within 15 minutes.In the simulation, external factors such as morning peak, weekends, and holidays, are considered to enhance the randomness, making the simulation data tend to the real data.

Data Preprocessing.
Before inputting the data into the model, it is necessary to standardize the data to scale the attributes of a sample to a specifed range.It is necessary to eliminate the infuence of diferent attributes of samples with diferent orders of magnitude because (i) Te diference in orders of magnitude will lead to the dominant position of attributes with larger orders of magnitude; (ii) Te diference of orders of magnitude will cause the convergence speed of iteration to slow down; (iii) Algorithms that depend on sample distance are very sensitive to the order of magnitude of data.
In this paper, min-max standardization, also known as normalization, is used as the method of data standardization.Te specifc method is as follows: after the data (x) are centered according to the minimum value, it is scaled according to the range (maximum value-minimum value), and the data are converged to [0, 1].After normalization, the range of the optimization process becomes smaller, the optimization process becomes gentle, and it is easier to correctly converge to the optimal solution.Te calculation formula is shown as follows: 4.3.Evaluation Metrics.Tis paper measures the prediction efect of the model using the mean square error (MSE) and mean absolute error (MAE) loss functions to evaluate the prediction performance of the algorithm more thoroughly.Xeon (R) w-2133 CPU @ 3.60 GHz, the memory is 32 GB, the GPU model is NVIDIA GeForce GTX 1080 Ti, and the operating system is Ubuntu.

Simulation Results and Analysis. Te proposed
CNNformer + is compared with several baseline models, including CNN, LSTM, DISTN [22], CNNformer, transformer, and informer [26].Table 1 compares the performance of the baseline model and CNNformer + in the trafc fow prediction task at the associated intersections.prediction tasks in many felds, which proves that the improvement made by informer in transformer is efective.However, in this experiment, the accuracy of the forecast is lower than that of transformer, which might be because informer has a difcult time capturing the details of trafc fow data.(iv) Compared with transformer, CNNformer has higher prediction accuracy, thanks to CNN's ability to extract the spatial features of trafc fow data at associated intersections.(v) Te prediction accuracy of CNNformer + is higher than that of CNNformer, which verifes that learning the periodic characteristics of trafc fow is helpful to improve the prediction accuracy.(vi) Te model proposed in this paper achieves the best results in the trafc fow prediction task, which shows that the model is superior to some of the most advanced trafc fow prediction methods in the literature.
It can be seen from Figure 7 that the dimension size of the hidden layer inside the model will also afect the performance to a certain extent.Te richer hidden layers can play a positive role.However, when the number of hidden layer units is greater than 512, the model performance begins decline.
Figure 8 shows the comparison between the real trafc volume and the trafc volume predicted by CNNformer + at a single time step, i.e., 10 a.m., 2 p.m., and 5 p.m.Each time step includes 36(12 × 3) trafc movements.As can be observed, the model successfully captures the changing trend of the actual trafc volume in the majority of trafc fow directions where the predicted value is near the real value.
Figure 9 shows the comparison of real trafc volume at 10 a.m. with trafc volume predicted by informer and transformer.It can be seen from the marks in the fgure that CNNformer + , informer, and transformer have a huge deviation when predicting the trafc volume with movement number 3.However, when predicting the trafc volume with movement numbers 23-28, CNNformer + can better ft the real trafc volume than informer and transformer, which refects the superiority of the algorithm used in this model.
After introducing the overall performance of the proposed model, the prediction accuracy of single vehicle fow motion is now given.Table 3 provides the prediction accuracy for each movement at the second intersection.Te MSE of the trafc movement from east and west is better than that from north and south.Tis is because intersection 2 is located in the middle of the three intersections.Since the volume of trafc leaving from the north and south is lower than that leaving from the east to west, the trends of the trafc fow are more varied, which makes it more difcult to predict the direction of the trafc fow.
Table 4 provides the prediction accuracy of each of the three intersections.It can be seen that intersection 2 has the lowest MSE.Since intersection 2 is located in the center of the main road, the fow data of this intersection is also related to the trafc conditions of intersection 1 and intersection 3. Intersection 1 and intersection 3 are located at the boundary of the main road.Tere is only one upstream or downstream intersection, which is less afected.Terefore,  Security and Communication Networks the change in trafc volume is more regular, reducing the difculty of prediction.
Figure 10 shows the forecast results of trafc volume in diferent time step sizes.It can be seen that MSE and MAE also begin to decrease signifcantly with the increase of time step size.Tis is because the transformer requires a large     Figure 11 shows the infuence of diferent sequences on prediction accuracy."now" refers to the input only using the trafc fow data of the previous time window (X t−H , X t−H+1 , . . ., X t )."now + week" refers to the input contains the trafc fow data of the previous time window and the simultaneous data of the week before the previous time window (X t−H−week , X t−H−week+1 , . . ., X t−week )."now + week + month" refers to the input contains the trafc fow data of the previous time window, the simultaneous data of the week before the previous time window, and the simultaneous data of the previous month in the previous time window (X t−H−month , X t−H−month+1 , . . ., X t−month ).Te fndings demonstrate that the minimal MSE and MAE are reached by taking into account all three time windows.Security and Communication Networks

Conclusion
Transformer has advantages in dealing with time series tasks.Many current research works are based on the transformer architecture to establish models for various series tasks and have achieved good results beyond the traditional models in many application felds.To tackle the problem of trafc safety oriented multi-intersection fow prediction, in this research, a new architecture integrating CNN and transformer is proposed from the viewpoint of accuracy improvement, making it more suitable for the trafc fow prediction task of associated intersections.Te comparative experiment with informer and other baseline models proves the superiority of the new architecture.
In the research work of this paper, the following results have been achieved: (i) A new intersection trafc fow prediction model CNNformer + is proposed, which considers that the trafc fow data at the associated intersection is a group of spatiotemporal sequences, using CNN to extract the spatial features of the data can signifcantly improve the prediction accuracy of the transformer model.(ii) Te average pooling layer successfully learns the periodicity of the trafc fow data, increasing the model's forecast accuracy.Experiments on the simulated network dataset demonstrate the superiority of the proposed method.

4. 4 .
Experimental Setup.Tis paper uses the Python3.7 simulation environment and the deep learning framework PyTorch to build the model.Te CPU model used is Intel (R)

Figure 6 :
Figure 6: Te road network in the simulation network contains three intersections (Yinzhou district, Ningbo city, China).

Figure 7 :
Figure 7: Efect of a hidden layer's dimension on the efectiveness of the task of predicting trafc fow at related intersections.(a) MSE.(b) MAE.

Table 1 :
Comparison of simulation results.

Table 3 :
Volume prediction results for each intersection.With the increase of time step size, the number of input samples of the model decreases, and the number of learned trafc volume features decreases, this increases the difculty of prediction.Te results show that a small time step size should be selected as far as possible in trafc fow prediction.

Table 4 :
Volume predication results for each movement of intersection 2.