Short-Term Traffic Prediction considering Spatial-Temporal Characteristics of Freeway Flow

This paper presents a short-term traﬃc prediction method, which takes the historical data of upstream points and prediction point itself and their spatial-temporal characteristics into consideration. First, the Gaussian mixture model (GMM) based on Kullback–Leibler divergence and Grey relation analysis coeﬃcient calculated by the data in the corresponding period is proposed. It can select upstream points that have a great impact on prediction point to reduce computation and increase accuracy in the next prediction work. Second, the hybrid model constructed by long short-term memory and K-nearest neighbor (LSTM-KNN) algorithm using transformed grey wolf optimization is discussed. Parallel computing is used in this part to reduce complexity. Third, some meaningful experiments are carried out using real data with diﬀerent upstream points, time steps, and prediction model structures. The results show that GMM can improve the accuracy of the multifactor models, such as the support vector machines, the KNN, and the multi-LSTM. Compared with other conventional models, the TGWO-LSTM-KNN prediction model has better accuracy and stability. Since the proposed method is able to export the prediction dataset of upstream and prediction points simultaneously, it can be applied to collaborative management and also has good potential prospects for application in freeway networks.


Introduction
Intelligent transportation system (ITS) has become an effective way to reduce pollution and improves the performance of freeways, while the short-term traffic flow prediction is an important part to support the smart management and control of freeways. e trend of shortterm traffic flow prediction is changing from parametric statistical models to nonparametric models and mixed models. Time-series methods were widely used in parametric statistical models, including exponential smoothing [1][2][3], moving average [4,5], and autoregressive integrated moving average (ARIMA) model [6][7][8]. Kalman filtering was also used for traffic flow prediction, such as adaptive Kalman filter [9][10][11], hybrid dual Kalman filter [12], and noiseidentified Kalman filter [13]. With the rapid development of ITS and improvement of data quality, more nonparametric prediction methods are used in the prediction of traffic flow. K-Nearest Neighbor (KNN) nonparametric regression, a nonlinear prediction method, was used to calculate Euclidean distance to find the nearest neighbor for prediction [14]. e improved Bayesian combination model was proposed to increase the accuracy of prediction [15]. Support vector machines (SVM) were also used considering the weak sensitivity to outliers [16]. e combined algorithm based on wavelet packet analysis and least square support vector machines was used to resolve the uncertainty and randomness of data [17]. Particle swarm optimization (PSO) and other optimization algorithms were applied to SVM because of small model calculation and good prediction performance [18]. With the development of artificial intelligence (AI), deep learning models have been widely used in traffic prediction. Smith and Demetsky [19] used backpropagation (BP) neural network to do the prediction. Optimization algorithms such as PSO and genetic algorithm (GA) were also applied to BP, and the effect is obvious [20,21]. Recurrent neural network (RNN) can realize longterm memory calculation and was used in prediction, but it had the problem of gradient explosion [22]. Long short-term memory (LSTM) network was proposed to solve it by using a forget gate [23,24], which was not only used in natural language processing [25], for example, language generation [26], text classification [27], and phoneme classification [28], but also in prediction fields, such as short-term traffic flow prediction [29], housing load prediction [30], and pedestrian trajectory prediction [31]. Furthermore, improvements and combinations with other models have been proposed in many fields, from application in large-scale data problems [32] to the prediction of traffic flow, such as using GA to optimize the LSTM hyperparameters to get better performance [33]. e comparison of typical machine learning models is shown in Table 1.
Deep learning models are widely used in traffic flow prediction, especially in short-term prediction [41]. However, traffic flow has strong spatial-temporal characteristics on time series [42,43]. More attention was paid to this characteristic in recent years' research of short-term traffic flow prediction [44][45][46]. Luo et al. [40] proposed a spatialtemporal traffic flow prediction model with KNN and LSTM to screen highly correlated upstream points and produced the prediction. Ma et al. put forward a method to select input data for daily traffic flow forecasting through contextual mining and intraday pattern recognition [47] and produced the daily traffic flow forecasting with CNN and LSTM [48]. Supervisory learning was used to mine the relationship between the factors of historical data and current traffic flow to train the predictor in advance so as to reduce the predicting time [49]. In addition, the match-then-predict method [50] and the fuzzy hybrid framework [51] with dynamic weights by mining spatial-temporal correlations were both proposed. Attention mechanisms were also combined in LSTM to increase the accuracy of prediction [52]. ese methods that combine various factors using attention mechanism can reasonably allocate limited resources, increase the efficiency, and reduce computation.
In this paper, we propose the short-term traffic flow prediction model considering the spatial-temporal characteristics using LSTM and KNN under the concept of attention mechanism. First, the Gaussian mixture model (GMM) is used to select the upstream detection points to produce the prediction. Two parameters are used for the classification: one is the Kullback-Leibler Divergence (KL), also known as the relative entropy, which reflects the difference in the distribution of two datasets through approximate calculations, especially for large-sample traffic data. e other is the grey relation analysis (GRA) coefficient, which reflects the correlation between two groups of normalized data after similarity analysis. Second, the hybrid model of LSTM and KNN is proposed to produce the prediction using the selected data. LSTM is used to predict the traffic flow of upstream points as the training dataset of KNN. To solve the problem of time lag, input time of upstream data is changed in the model according to the space distance between the input point and the prediction point and the average speed of traffic flow. Moreover, transformed grey wolf optimizer (TGWO) is used to optimize key parameters, and Savitzky-Golay (SG) filter smoothing is used to reduce the noise in the model to improve the performance.
e proposed TGWO-LSTM-KNN prediction model in this paper gives greater consideration to the spatialtemporal characteristic of freeway traffic flow to improve the accuracy of prediction and reduces the complexity of computation by selecting and preprocessing input data. e rest of this paper is organized as follows. Section 2 introduces the methodology of the proposed model. Section 3 carries out the experiments and analysis of the proposed model with real-world traffic flow data. Section 4 presents the conclusions and the prospects of the research. e abbreviations used in the rest of the paper are listed in Table 2.

Framework.
is paper proposes TGWO-LSTM-KNN with the GMM classification model, which includes two parts: data preparation and prediction. GMM is used to choose the input data in the data preparation part considering the spatial-temporal characteristics of freeway traffic flow, while the prediction part is composed of LSTM parallel computing module, KNN module, and TGWO module. e framework of the proposed model is shown in Figure 1.

Data Preparation.
In the freeway network, the traffic flow of upstream correlates with the prediction point flow, which is considered as spatial correlation. Moreover, the traffic flow of the prediction point changes over time interand intraday, and each specific period may have different patterns, such as the morning peak hour and the off-peak hour. erefore, time series are divided into different parts according to the flow patterns, which will help to improve the accuracy of prediction. e temporal characteristic of prediction point flow is observed through a flow chart. en set the midpoint of two adjacent extreme points and complete the time series division task.
Related upstream sections are analyzed and selected using GMM binary classification. In this paper, two parameters are used as the classification criteria. One is the KL divergence, which is a commonly used method in information science to quantify the difference between two datasets. In a large-sample traffic dataset with complex distribution, the difference can be simply and quickly reflected. e other is the GRA coefficient, which can analyze the linear similarity between two datasets through a small amount of data. ese two parameters can well reflect the correlation of upstream sections and predicted sections. e steps of the classification part are as follows: Step 1: use time step and speed to determine the space range. Note the upstream points within the scope as Step 2: divide day time into T 1 ∼ T η .
Step 3: construct dataset. Divide the dataset into working days and nonworking days. Change the input time of upstream data to meet the time lag of the prediction point considering the distance and travel speed.
Step 4: calculate the KL divergence and the GRA coefficient in T i of working days and nonworking days.
Step 5: input the KL divergence and the GRA coefficient into GMM for binary classification. O 1 ∼ O m are divided into two groups in each T i . e group of points with the KL divergence close to 0 and the GRA coefficient close to 1 is used as the strongly related section of the prediction point for the next prediction.

Prediction.
e prediction part consists of three modules, which are LSTM module, KNN module, and TGWO module. KNN is selected at the bottom of the model considering the spatial features of freeway flow with the advantages of fast calculation speed and no lag. LSTM is used to predict the short-term traffic flow of upstream sections and then put the prediction results of upstream sections into KNN to predict the prediction point traffic flow. Because the relationship among upstream points is ignored in the model, multithread LSTM parallel computing (LSTMs) is used to reduce the time consumption of prediction. Also, to improve the performance of LSTM-KNN, TGWO is used to optimize the parameters of LSTM and KNN.
Step 1: use TGWO to optimize the steps and epochs in LSTM, K value in KNN.
Step 2: multithread LSTM parallel computing is used to reduce calculation by ignoring the relationship among upstream points. Each O i is input into the corresponding LSTM module and then the output P i set and D 0 together form a new dataset.
Step 3: input the dataset into the KNN module to predict the traffic flow and output D p 0 .

Data Preparation.
ere are three steps in data preparation: determination of spatial scope, time-series division, and GMM classification.

Determination of Spatial Scope.
Since the time step of short-term traffic flow prediction (T step ) is usually less than one hour and the highway speed (V) is limited, the radius range of spatial scope can be calculated: (1) e accesses and ramps within the radius of the prediction point centered are selected as upstream points.

SG Calculation
Method. SG includes two parameters, which are the window length n and the order number k. For the window length n, with the increasing of n, the deviation between the processed data and the real data increases, and also the smoothness. For order number k, with the increasing of k, the deviation between the processed data and the real data decreases, and also the smoothness. According to the characteristics of highway traffic flow and existing research, the choice of n and k in this paper is 31 and 1, respectively.
, where x denotes the length of data. Select the window length n and order number k. e data in a window ). e fitting polynomial is obtained using the k − 1 least square method as follows: (2) en form n equations to form k element equations. If n > k, equation has a solution, then e matrix is expressed as A is the least square fitting solution of different windows, and the value O m ′ is as follows: (4)  which is used to judge the correlation between upstream and predicted traffic flow, while EM is used to obtain the maximum likelihood estimation of GMM [55]. In this paper, KL-divergence and GRA coefficient are two parameters to do the binary classification.
(1) KL Divergence. KL divergence [56], which is also known as relative entropy, is broadly used as the measurement of the dissimilarity between two probabilistic models [57].
e closer KL divergence is to 0, the more similar the two distributions are.
(2) GRA Coefficient. GRA is a method to judge the similarity of different datasets. Compared with traditional Pearson correlation [58], it can use a smaller amount of data to reflect the linear similarity between traffic flows [59]. e steps of GRA are as follows.
Since there is little difference in the magnitude of traffic flow at the same point, the data are initialized by dividing the initial value of the flow o i (1) and d 0 (1). where where ξ denotes the coefficient to control the degree of differentiation, which is generally 0.5 [58]. Define the mean value as GRA coefficient of D 0 and O i : . GMM is classified by calculating probability. Two-dimensional Gaussian mixture model is as follows: where μ k denotes expectation, Σ k denotes covariance, n denotes data dimension, R(x|μ k , Σ k ) denotes k component in a hybrid model, ϑ k denotes mixture coefficient and en use EM algorithm to calculate ϑ 1 , μ 1 , Σ 1 , ϑ 2 , μ 2 , Σ 2 . EM-GMM pseudocode is as follows: (Algorithm 1)

Journal of Advanced Transportation
LSTM are the parameters with great influence [25], and the coefficient K in the KNN module is also one of the key parameters. en use the TGWO module to optimize these parameters.

LSTMs Module.
LSTM is a special RNN [24] with a forgetting gate. e sigmoid function is used to prevent gradient explosion and disappearance. e traffic flow data of upstream points are trained in different LSTM threads in parallel, which is defined as LSTMs. e data input time is also changed to reduce the lag of the module, which makes LSTM more accurate. e memory unit of the module is shown in Figure 5. e calculation processes of the memory unit o m (n) and parameters (see Table 3) are as follows:

KNN Module.
In this paper, the KNN model is used to predict D p 0 by using data of O 1 ∼ O m , P 1 ∼ P m , and D 0 . is method not only has high efficiency and less complexity but also can meet the needs of multifactor. Euclidean distance is calculated as follows [14]: where d n denotes the Euclidean distance between P i in the current time and the O(x) vector x time, P i denotes the current traffic flow vector of different upstream points where w i (x n ) denotes weight and d p 0 denotes prediction. LSTM-KNN pseudocode is shown in Algorithm 2. e number of units in the hidden layer is n unit , and the data dimension D of LSTM in each thread is 1, so the time complexity is 4 × (n 2 unit + 2n unit ). On account of the parallel calculation structure, the total time does not change too much with the increasing number of threads. Compared with the original complexity 4 × (n 2 unit + n unit × D + n unit ), which reduces a lot of computation and improves efficiency. e time complexity of KNN is O(n), which is only related to the number of data, so the calculation speed is fast.
LSTM has good robustness in traffic prediction [60] and also improves performance by setting reasonable time lag [61] or forgetting layer. KNN can avoid large deviation directly because it calculates the closest Euclidean distance to produce the prediction [62]. ese two positive aspects of robustness will support the proposed method to gain more adaptability on data fluctuations and environmental change.

TGWO Module.
Grey wolves algorithm was put forward by scholars Mir Jalili Australia in 2014, the Grey wolf groups according to social relations are divided into four grades. Each wolf represents a candidate solution, while the most optimal solution is α T , the suboptimal solution is β T , the third optimal solution is δ T , and the last is ω T . In each iteration, the three optimum solutions as α T , β T , δ T h n (p m )    [63]. By using the improved adaptive convergence factor [64], the extremum can be quickly found when the step size is large in a global search. Besides, the extremum can be prevented from missing when the step size becomes smaller in the local search. e weight step size formula [64] adds the weight decreasing strategy, which can reduce unnecessary iterative processes and improve efficiency. e calculation method is as follows.
(1) Initialize the Population. e upper bound U b and lower bound L b are defined, respectively. e number of wolves is N. e dimensions are S. M N×S denotes an N × S two-dimensional matrix, which is the searching field.
ere are 2m + 1 key parameters to be optimized in each element of the field array, which are step, epochs in the LSTMs module, and the K value in the KNN module. Generate integers randomly at their respective upper and lower bounds to form an element of field array: 1 , step 2 , . . . , step n , epochs 1 , epochs 2 , . . . , epochs n , k . (14) Each parameter has a different U b and L b . Set up two vectors to record the bounds.
(2) Calculate Fitness. Input the element corresponding to each wolf into LSTM-KNN and compare the error. Define the optimal solution as α T , β T , δ T , respectively.

(3) Update location with a T , A T , C T :
where D denotes the distance between the grey wolves and their prey, t denotes current iteration times, W P (t) denotes the position of the grey wolf in t iteration times, W(t) denotes the position of the prey in t iteration times, A and C denote the coefficient vectors, r 1 and r 2 denote random coefficients with scalars between 0 and 1, generally take 0.5, a denotes the convergence factor, φ denotes the weight of  Journal of Advanced Transportation inertia, φ max denotes the maximum weight of inertia, generally take 0.9, and φ min denotes the minimum weight of inertia, generally take 0.4.

Experimental Data.
Whitemud Drive is an in-city highway across Edmonton, Alberta, Canada. It is 28 kilometers long with a basic speed limit of 80 kilometers per hour. As a test road, Whitemud Drive is equipped with seven traffic video cameras and seven loop detectors (VDS1017, VDS1037, VDS1034, VDS1031, VDS1029, VDS1027, and VDS1019) from west to east on the main road and gate road to observe the vehicle flow, the vehicle speed, and the vehicle density. In this paper, data of 15 working days are used as historical data for experiments.

GMM Selecting
Test. VDS1019 is set as the prediction point D 0 , and the change of traffic flow within one day of the working day is plotted according to 5 mins (see Figure 6). To better carry out the time-division work, the data are smoothed by SG, and the image is reconstructed (see Figure 7). Find the extreme values, set up the midpoints, and divide the time series (see Figure 8). e time-division results are shown in Table 4.

Reconstructing the Dataset by Time-Division.
VDS1017, VDS1037, VDS1034, VDS1031, VDS1029, and VDS1027 are recorded as O 1 ∼ O 6 , their historical data on working days are divided into parts according to T1 ∼ T4, and the data in the same time part are put into the same column. Since the length of the road is 28 km and the speed limit is 80 km/h, we choose 60 km/h as the test speed. It takes 30 mins to go from VDS1017 to VDS1019. Considering the continuity of the road network, the vehicle passes through each point every 5 mins, so O 2 ∼ O 6 is delayed by 5-25 mins for input. And the prediction point D 0 is delayed by 30 mins for input. In this way, the data of each day are the delayed input to form a dataset. Calculate KL divergence and GRA coefficient (see Table 5).
Input the result table into the GMM module for classification.
Taking T2 as an example, the GMM classification results are shown in Figure 9.
e classification results are two types. e closer the GRA coefficient is to 1, the better the correlation is, and the closer the KL divergence is to 0, the more similar the distribution is. So, choose the yellow mark points as the input points (see Figure 9). e following is the final classification results of four datasets (see Table 6). e upstream points selected between T2 and T3 are the same, so T2 and T3 can be regarded as the same time part, which is 7 : 00-13 : 00, and the dataset T 2+3 can be reconstructed. e next experimental data are T 2+3 .

Comparative Experiments at Different Upstream Points.
Different upstream points are selected for model prediction and mean absolute percentage error (MAPE) comparison so as to verify the effect after classification. Results are shown in Table 7.
In this dataset, it can be seen that the abandoned upstream points have little difference from the selected points (see Table 5). So it is not obvious in the accuracy improvement, which is reasonable. If in datasets with a large difference, the meaning of GMM classification operation will be reflected more.

LSTM-KNN Structural Test.
Considering the operation time and efficiency, this paper constructs the following structures for testing. Since the training data are 15 days' traffic flow, the batch size is set to 15, and the data are divided into 15 groups, which will ignore the data relationship between each group and reduce the risk of overfitting. If the batch size is too small, it is not conducive to the training model and is easy to overfitting. e test step is 3, epochs are 100, and the comparison results are shown in Table 8.
e results show that increasing the number of LSTM layers can improve the prediction accuracy, but it increases a lot of computing time and overfits easily. Increasing the forgetting layer (the forgetting rate is 0.2) will reduce the accuracy. In the first-layer structure, there is only a small difference between 256 units and 128 units; therefore, the single-layer 128 units structure is selected for prediction in the following LSTM.

LSTM-KNN Time Steps Test.
e model is tested under different time steps (5 mins, 10 mins, 15 mins, and 30 mins). e result shows that the model has good accuracy and stability (see Figure 10). e overall prediction accuracy shows a downward trend, and the absolute error shows an upward trend. It is worth noting that when the time step is 10 mins, the trend of accuracy will change and the error is lower than 15 mins. In general, even if the time step is different, the model still has good performance for the prediction accuracy.  Figure 11.

TGWO Optimization
When the MAPE is 10.04 (green mark in Figure 11), it will reach the optimal solution M best � [5, 6, 5, 6, 120, 100, 120, 100, 25]. e steps of O 2 , O 5 and O 4 , O 6 are 5 and 6, the training epochs are 120 and 100, and K is 25.         solution. e step of LSTM-KNN is 3, K is 16, and the epochs is 100. LSTM has 256 units in the first layer and 128 units in the second layer of the double-hidden layer structure, with the step of 3 and epochs of 100. SVM uses three modes to fit, linear and poly, RBF. BP neural network is a three-layer structure, two fully connected layers, the middle increased the forgetting layer (rate is 0.2), and the epochs is 100. Results are shown in Figure 12.

Comparison of TGWO-LSTM-KNN and Other
e accuracy of LSTM-KNN can reach the level of the popular model. e accuracy of TGWO-LSTM-KNN can be improved by 15.27% compared with single-LSTM, 9.47% compared with BP, and 43.12% compared with poly-SVM (see Table 9). Besides, the advantage of the hybrid model is not accuracy, but being able to output the prediction set of upstream points for collaborative management.

Conclusions
In this paper, the TGWO-LSTM-KNN prediction model with GMM classification considering spatial-temporal characteristics under the concept of attention mechanism is proposed. e time series is divided into parts by using the temporal characteristic of the prediction point. And GMM through KL and GRA is used for further classification.
e upstream points with a small difference in distribution and high linear similarity are selected to increase the accuracy and reduce the complexity. en the hybrid model TGWO-LSTM-KNN is used to train and predict. Parallel computing is used in the LSTM module to improve efficiency.   GMM as an unsupervised model can be very flexible to classify. is model can be applied to the multifactor model to reduce complexity. KNN, as the next part of LSTM, fully combines upstream points data and prediction points data for prediction. Compared with SVM, KNN has the characteristics of fast speed and greater data processing ability, which is more suitable for multiple factors and complex data of freeway. LSTMs module can ignore the relationship between upstream points to perform the parallel computation. It can reduce the operation time and make the model more practical. TGWO is less likely to fall into local optimal solution and also has fast speed and good performance. To sum up, TGWO-LSTM-KNN with GMM classification can be better used in real freeways with complex data and multifactor with high accuracy, fast calculation speed, and strong adaptability. It can be applied in the real freeway to achieve the purpose of collaborative management.

Data Availability
e data used to support the findings of this study are openly available in the OpenITS platform for noncommercial purposes only and the website is https://www.openits.cn/ openData1/700.jhtml.

Conflicts of Interest
e authors declare that they have no conflicts of interest.