Traffic Status Prediction of Arterial Roads Based on the Deep Recurrent Q-Learning

,


Introduction
With the development of urbanization, there is a prominent contradiction between the transportation infrastructure and the vehicle population, and the problem of traffic congestion has become more serious, which inevitably leads to the increasing of travel time, intensified environmental pollution, and economic loss [1]. Prevention is the first way to control traffic congestion. According to the existing traffic states, the changing trend in a short time is predicted, and then, the information platform is used to issue an early warning to divert the traffic to avoid or ease congestion [2][3][4]. erefore, how to establish a long-term model for timely warning of traffic congestion is the research focus of urban intelligent transportation system optimization [5][6][7].
A variety of methods, including the time series, machine learning, and artificial neural networks, have been proposed for traffic congestion prediction. Since the time-series characteristics of traffic flow data were discovered [8], some scholars used autoregressive differential moving average models [9] to predict the traffic flow on expressway [10,11]. Because the temporal distribution of traffic flow data is interrelated, some scholars used nonparametric regression methods to build macrotraffic models and found that the prediction result is better than time-series algorithms [12][13][14]. However, these methods based on statistics and traffic models require a large amount of historical data and construct many assumptions, so they are difficult to apply to nonlinear traffic flow [15][16][17][18].
In recent years, machine learning algorithms, such as the back propagation neural network [19,20], have gradually been used in traffic prediction with the advantage to handle the nonlinearity problems. Because of the long training time of the back propagation neural network and the tendency to fall into the local optimum, some scholars also used the Support Vector Machine (SVM) [21][22][23] and K-Nearest Neighbor (KNN) [24][25][26] to predict the traffic status. Moreover, some scholars found that the time series of shortterm traffic flow has chaotic characteristics. To deal with the abovementioned issues, many methods, such as combined vector machine-based [27] and phase space reconstructionbased [28], have been proposed to achieve better results. However, most of these machine learning-based methods lack robustness to catch the huge data, resulting in the model generally lacking long-term effectiveness and scalability [29][30][31].
Facing on the lots of traffic flow data, scholars have gradually turned to use the deep learning method, a learning algorithm that can simulate the multilayered perceptual structure of the human brain to recognize the data patterns. At present, breakthroughs have been made in many fields such as computer vision, speech recognition, and natural language processing. Deep learning has gradually been adopted by Stanford University, Google, Baidu Research Institute, and other authoritative organizations with the strategic direction for the development of data mining and artificial intelligence [32,33]. Kuremoto et al. [34] combined the restricted Boltzmann machine with the time-series laws to obtain a prediction model, which fits the sample data with the minimum model energy. Lv et al. [35] proposed a deep learning model to predict traffic flow based on an automatic coding network using compression coding in the input data. Zhao et al. [36] proposed a traffic congestion prediction model based on the improved SVM, which can learn the characteristics of traffic flow parameters through the deep structure by digitizing different environmental and human factors. e abovementioned methods speed up data processing by applying the deep learning models but do not take into account the dimensional disaster caused by the highdimensional states of traffic flow parameters. To address the abovementioned problems, some scholars used data compression technology based on the LSTM, Principal Component Analysis (PCA) [37], CUR matrix decomposition algorithm [38], and Discrete Cosine Transform (DCT) method [39] to perform data dimension reduction.
Q-Learning can efficiently store and extract data to provide support for traffic prediction. e LSTM network reduces the frequency of gradient explosion and disappearance, so it is suitable for capturing the spatiotemporal evolution of traffic state parameters [40][41][42][43]. In this paper, considering the time sequence of traffic flow parameters and the continuity of traffic congestion effects, the recurrent neural network model is used to train the extracted features and to obtain low-dimensional vectors of historical information, and then, the resulting vectors are stitched to achieve classification training. Finally, an urban road traffic state prediction model based on the optimized deep recurrent Q-Learning method is established. e model proposed in this paper has the following contributions: (1) e model effectively solves the problem of gradient explosion and gradient disappearance in the prediction process of LSTM (2) e model effectively extracts the associated features of the traffic data, so it has better prediction efficiency and accuracy (3) e model will provide a feasible prediction method for the construction of an intelligent transportation system due to its efficiency and feasibility e rest of this paper is organized as follows. Section 2 points out the problems to be solved and the corresponding methods in this paper. en, Sections 3 and 4 lead to the principles and steps of the Q-Learning and the LSTM. After that, the deep recurrent Q-Learning network model is constructed in Section 5. Besides, the example analysis in Section 6 proves the stability and feasibility of the method. Finally, Section 7 concludes the paper.

Specific Problems.
e problems with urban traffic data are high repeatability, high loss rate, and poor correlation. e existing prediction methods mainly discuss the results of independent analysis and whether they meet the needs of further verification. erefore, the following problems exist in data preprocessing and optimization prediction.
Regarding the problem of data relevance: the relationship between the states at the previous moment and the next moment lacks effective connection. erefore, the information at different states is disconnected, and the timeliness of the data cannot be fully exerted. As a result, the prediction results are not sufficiently correlated with the data at the previous moment and lack of persuasiveness.
Speaking of the problem of data storage: based on the existing analysis methods, the storage capacity of the database will quickly reach the threshold, which is not conducive to long-term and durable prediction. Besides, repeated analysis steps will increase the feedback delay and cannot fulfill the requirements of low-latency traffic prediction.
Concerning the problem of comprehensive data analysis: the existing analysis focuses on fixed types of data, and the traffic environment is an integrated system. erefore, even if the prediction results are accurate, they cannot reflect the objective situation.

Solutions.
For the abovementioned three research problems, this paper will propose the corresponding solutions: For the problem of data relevance: based on the optimized LSTM model, the effective correlation and information accumulation of different data types are strengthened, and the correlation degree of data at different moments is strengthened.
For the problem of data storage: the Q-Learning functionalizes the data information, and each data cell can be realized by the expression of functions. is method not only reduces the pressure of data storage but also improves analysis efficiency and accuracy.
For the problem of comprehensive data analysis: the traffic conditions are affected by multiple factors. erefore, when selecting the characteristic data types, in addition to the basic parameters of traffic flow, climate and temperature are also considered. at is to establish a multidimensional data analysis system, making the prediction results more accurate and objective.

Q-Learning Principle and Application Steps
e steps of the Q-Learning are listed as follows: the state of the agent in the environment E is S, and the actions taken by the agent constitute the action space A. It takes different actions to transfer between states, and the reward function obtained is R. To achieve the optimal strategy, the Q-Learning estimates the value of each action choice in each state. e Q-Learning uses Q (S, A) to represent the value function of state-to-action and continuously updates the value of Q (S, A) according to the state transition. Finally, the Q-Learning obtains the optimal strategy based on Q (S, A). e value function Q (S, A) of the traffic state is updated as follows: assuming the state of the agent at time t is s, the action is a t . en, the state transitions to time t + 1, the state is s t+1 , and the reward is r t . Finally, the agent updates the value of Q (s, a) according to all records (s t , a t , t t , s t+1 ) to find the optimal strategy. e corresponding update function is shown in the following equation: where Q(s t , a t ) is the current Q-table, α is the learning rate, r t+1 is the benefit at the next moment, λ is the greed coefficient, and max a Q(s t+1 , a t ) is the best benefit in memory. e deep Q-Learning network combines deep learning and Q-Learning. e network uses the perceptive ability of the deep learning to transform the state to high dimensions and uses the decision-making ability of Q-Learning to map the high-dimensional state representation to the low-dimensional action space [44,45]. In the Q-Learning algorithm, the table is used to store the value of Q (s, a). In the deep Q-Learning, the state dimension of the agent is high, and the table obviously cannot meet the demand. is problem is solved by using f (s, a) to approximate Q (s, a) [46,47]. erefore, based on the corresponding value function neural network model, approximate values can be obtained, thereby reducing the storage pressure of the Qtable and providing ideas and methods for Q-Learning to be applied to traffic state prediction. Finally, the network obtains the action value of congestion and dissipation according to the accumulated experience pool. Figure 1 shows a schematic diagram of the principle of approximating the value of "state-action" through the neural network. e network helps solve the problems of processing huge data volumes. Due to the strong time series of traffic data, the application of this network will make the analysis results more reliable. Further demonstrations and experiments will be discussed in the following sections.

Overview of the Recurrent Neural Network.
e recurrent neural network is one of the optimized variants of deep neural networks. It is characterized by the output of the neurons at a certain moment as part of the input of the next moment, and the neural network has the function of memorizing the information of the previous moment which can realize the persistence of the information. As shown in Figure 2, the neural network reads the input x t of the current time t and obtains the output h t . At the same time, the information status is returned to the neural network as one of the inputs at the next time point. In order to show the execution action more intuitively, we express it by e output h t at each moment is related to the input h t−1 at the previous moment. e recurrent neural network is the most natural structure for processing sequence data which is exactly what we need to handle historical data and real-time data in this paper.
s t : the state of the hidden layer at time t, also known as the memory unit of the recurrent neural network.
U: the weight parameter matrix of input sequence information X to hidden layer state S. W: the weight parameter matrix between the hidden layer states S. V: the weight parameter matrix of hidden layer state S to output sequence information H.

Recurrent Neural Network LSTM.
If the dependency interval between sequences is long, the gradient disappearance of traffic data will happen in ordinary RNN which is difficult to retain the information at earlier times. e LSTM network remembers long-term historical information through the design of the network structure where the output of the network at time t + 1 is applied to itself at time t to avoid the gradient disappearance. Its network expands along the time axis. e schematic diagram and the detailed diagram of the three-layer gate are shown in Figures 3 and 4.   Figure 1: Schematic diagram of the neural network approximate establishment of "state-action" value.   It can be seen from Figure 3 that the LSTM defines the key concept of cell state with the horizontal line. ere is less information interaction in the cell with the purpose of memorizing long-term information achieved through cell transmission. For Figure 4, it is made by three-gate layers with the first one is the forget gate. is gate is determined based on the input of the current moment and the output of the previous moment and, then, passes through a Sigmoid layer to obtain the results. It determines how much of the cell states from the previous moment is retained to the current moment. e expression function of the forget gate is shown in the following equation:

Journal of Advanced Transportation
where f t represents the output of the forget gate and σ represents the Sigmoid function. W f and b f represent the weight matrix and the bias term, respectively.
represents the connection of two vectors into a longer vector. e second one is the input gate, which determines how much of the network input is saved to the cell state at the current moment. e expression functions of the input gate are shown in the following equations: where c t is calculated by multiplying the last element state c t−1 by the element forget gate f t , then multiplying the element state c t by element by the input gate i t , and finally, adding the two products. e cell information can be updated based on the results of forget gate and output gate. It is listed as follows: e last one is the output gate, which controls how many cell states output to the current output value of the LSTM. From Figure 4, the output gate is composed of two parts, one is the state of the cells processed by tanh, and the other is the input information processed by Sigmoid. e functions of the output gate are listed in the following equations: where o t represents the output of the output gate. W o and b o represent the weights and offsets, respectively. Step 2: due to the influence of the traffic states before and after the training, it is necessary to determine whether the action can get excellent feedback before execution. e action a is performed according to the strategy p, and the cumulative return is calculated, after the strategy is executed. e state value function expression is listed as follows:

Deep Recurrent Q-Learning Network
where V p (s) represents the degree of return according to the strategy p under state s. p (s, s′) represents the probability of state transition. R (s, s′) represents the reward obtained from s ⟶ s ′ , and c is a function coefficient.

Reward
Actions. e reward after training in the previous state s is represented by the difference in delay time. e neural network function uses s as input. Q (a, n_features) is the storage table, and n_features presents the number of input neurons. erefore, the output vector dimension is 2 n features � 32. e memory storage pool structure after the reward is [n_features, a, r, n_features].
During the process of predicting the future situation, inputting the current state and outputting the Q-value are studied under various possibilities with the largest one being selected. As the reward level continued to deepen, the target results gradually approach the actual situation in which the Q-value here refers to the traffic delay index.

Training Method.
Based on the construction of the abovementioned state space and reward actions, we will train the datasets from Wanjiali Road to Shuangtang Road on the elevated Wanjiali Road in Changsha. e main steps of training methods are listed as follows: Step 1: the preprocessing of traffic data and weather data (culling abnormal data, Lagrange interpolation, and normalization).
Step 2: the selection of training sets and test sets (the time interval of training sets is from 0:00 on May 17, 2019, to 24:00 on May 24, 2019. e time interval of test sets is from 0:00 on May 25, 2019, to 12:00 on May 25, 2019).
Step 3: determining the input and output of the variables and the number of network layers (the input variables are speed, delay time, travel time, temperature, and precipitation probability. e output variable is the delay index, the number of hidden neurons in the interval [4,13], and 3 layers of network layers).
Step 4: determining the initial weights, thresholds, learning rate, activation function, and training function (the interval of initial weight and threshold is [0, 1]. e learning rate is 0.01, the activation function uses the Sigmoid function, and the training function uses Adam).
Step 5: training the neural network model and stopping the network training when the feedback reaches the optimal state of the Q-value table. If it is not satisfied, modification and adjustment of the parameter values are required (learning rate and training function).
Step 6: adjusting the parameter to achieve the best prediction results which could be obtained from the prediction and input test set data.
Step 7: analyzing the prediction results to get the final experimental results.
In this paper, the LSTM forgetting, input, and output threshold activation functions are all Sigmoid functions. e return interval [0, 1] is consistent with human thinking. e pseudocode to build a deep recurrent Q-Learning network is shown in Algorithm 1.

Data Description.
is paper selected a part of the arterial road in Changsha, starting from Wanjiali Road to Shuangtang Road from north to south, as the research case. A crawler script written in Python 3.7 was used to capture the real-time traffic information from the big data platform of Gaode Map. e data were collected from 0:00 on May 17, 2019, to 12:00 on May 25, 2019, with a 5-min sampling interval. e collected data types include actual time, speed, delay time, travel time, temperature, probability of precipitation, and delay index. e data set sample is shown in Table 1.
e data of this case is divided into training sets and test sets after preprocessing. e time interval of the training sets is from 0:00 on May 17, 2019, to 24:00 on May 24, 2019, and the time interval of the test sets is from 0:00 to 12:00 on May 15, 2019.

Data
Preprocessing. Data preprocessing includes three steps: culling abnormal data, Lagrange interpolation, and normalization.
e detailed information is shown in Figure 5. e first step is to cull abnormal data. e abnormal data mentioned in this step refers to the data that deviates significantly from the normal interval. By deleting such kind of data, the experimental data are more realistic and the analysis results are more reasonable. Some samples of abnormal data are shown in Table 2. e second step is the Lagrange interpolation. Lagrange interpolation is used to fill in some missing data based on the neighboring traffic datasets to improve the value of the data.
is step is used to achieve data integrity and rationality. e data filling function for this step is listed as follows: where y i is the polynomial of degree i − 1 and x i is the parameter corresponding to the point i. e third step is data normalization. e purpose of this step is to control the magnitude of the data within a small fluctuation range, reduce the impact between the magnitudes of the horizontal data, and improve prediction accuracy. e function is listed as follows: where max is the maximum value of the sample data and min is the minimum value of the sample data. e preprocessed data are transformed into a list to form the matrix and, finally, transformed into a three-dimensional space. e three-dimensional space serves as the input of the LSTM unit to form the basic unit of the hidden layer. Every 15 rows of valid data are used as a training set and are continuously trained in 100 times. e test sets are predicted based on the training memory to obtain the prediction results.

Prediction Module Construction.
According to data analysis, visualization, and platform requirements, this paper introduces NumPy, Pandas, and Matplotlib as analysis tools. Tensorflow is used as an open-source database for deep learning to build a basic library. According to the needs of the model, a variety of modules is constructed including the traffic environment module, deep reinforcement learning module, memory module, behaviour selection module, neural network module, training main program module, loss curve module, and visualization module. e first step is to initialize the traffic data, network environment, and training parameters to build a neural network for prediction. e second step is to input training sets and test sets to the input layer. Multidimensional data introduction is performed in the hidden layer, and data prediction is performed based on the experience pool in the output layer. Meanwhile, the structural dimensions of the input and output are displayed at each stage. e detailed flowchart is shown in Figure 6.

Parameter Impact Analysis.
For the traffic prediction, the most critical indicators are prediction efficiency and accuracy. erefore, the parameter impact analysis, the optimization index analysis, and the accuracy analysis are performed in the following section. e parameter impact analysis and optimization index analysis are used for the evaluation of prediction efficiency, and the accuracy analysis is utilized for the evaluation of prediction accuracy.
During the neural network of traffic prediction, the key parameters affecting the efficiency of the experiment are studied including learning rate, reward decay, greedy, memory size, replacement interval, and batch size. e batch size is the fixed parameter, and the remaining items are variable parameters. Group 6 is the initial parameter group for comparison and analysis with other groups. erefore, the parameter groups are divided into weakened state parameter groups (group 1-5) and strengthened state parameter groups (group 7-11).  Note. e road delay index is used as an evaluation index for the urban congestion degree, which refers to the ratio of actual travel time to the free-flow travel time for the city residents.
(1) Initial network structure, the parameter is q. Initial target network, parameter q′ � q.
(2) Initial trials greedy parameters epsilon, learning rate, reward, attenuation coefficient gamma, number of iterations episodes. Each episode iteration round number T, training batch size, and neural network parameter rotation cycle transfer_cycle. (3) for an episode in Episodes do (4) Initial traffic state s t � s 0 (5) For t from 0 to T: (6) Selection behavior. (Output an integer with a range of 0 to 2 n features− 1 ): Select a t � argmax a Q(s, a, θ) with a probability of 1-epsilon, and randomly select the behavior a t with a probability of epsilon. (7) After the behavior is determined, find all states s all in the data table that match this behavior, and then randomly select one from s all as s t+1 (If no match is found in s all , the behavior is redetermined). (8) Put experience (s t , a t , r t ,s t+1 ) into the memory pool. (9) Take out batch size data randomly and calculate q_eval and q_next respectively. (10) Construct: y � r 1 + gmax t+1 Q(s t+1 , a | q) ⟶ q target (11) According to q_eval and q_target, back propagation to improve the network q. (12) If the number of iterations is an integer multiple of transfer_cycle, then updates q′ � q. (13) Current state � s t+1 . (14) When the maximum iteration number T of a single round game is reached, the training of this round is stopped, and the traffic state is returned to the initial trial.  Each group only weakens or strengthens one parameter for comparison with group 6. To improve the discrimination of the experimental results, the selection of parameter values has an obvious gradient with the specific values which is shown in Table 3.
In order to reflect the differences in the experimental results of each group, this paper selected the indicators with obvious discrimination in the experimental results for analysis.
ey are the highest loss index, the lowest loss index, the maximum volatility, training time, prediction time, and total time. e detailed index distribution of each group is shown in Table 4. e highest loss index and maximum volatility of group 4 have serious deviations from other groups and have exceeded the normal fluctuation range. It shows that the experimental efficiency analysis in group 4 has no research value. erefore, group 4 is eliminated before performing a comparative analysis. e qualitative analysis is performed first. Since the highest loss index, the lowest loss index, the largest volatility rate, and the total time are important parameters of experimental efficiency, visualization is performed, as shown in Figure 7.
From this, the following conclusions are reached: (1) e indicators of group 6 are all at the highest level, so the gradient weakening or gradient strengthening of the parameters can optimize the experimental results, but pay attention to the combination of extreme parameters. e prediction result at this time does not have actual value, such as group 4.
(2) With the gradient adjustment of parameters, all indicators can fluctuate within a relatively small range without a sharp rise or a sharp decline. erefore, the stability of the model proposed in this paper is confirmed.
(3) e weakened and strengthened states of memory size and replacement interval have only slight fluctuations compared to the initial state, indicating that these two types of parameters have a little effect on the experimental efficiency. e quantitative analysis is performed next. According to the optimization degree of each parameter, the effect of improving experiment efficiency is determined. e parameter optimization degree function is shown in the following equation: where O represents the optimization degree of each group, x 0 represents the parameter value of the initial group, x i represents the parameter value of the variable group, and w i represents the optimization weight of the corresponding parameter.
From the perspective of forecasting efficiency, the maximum volatility and total time are representative, followed by the highest loss index and the lowest loss index. erefore, the initial weight distribution of each parameter is shown in Table 5.
Based on equation (12) and weight distribution, this paper performed optimization calculations for the weakened group compared to the initial group and the strengthened group compared to the initial group. e following conclusions are drawn based on the quantitative results in Table 6.
(1) e optimization effect of the groups is adjusted by the memory size and replacement interval is weaker than other groups, which further confirms the conclusion (3) in the qualitative analysis. (2) e optimization effects of all parameters on the experimental efficiency are quite different which indicated that there is parameter emphasis. (3) e weakened or strengthened groups improve the experimental efficiency, indicating that group 6 is already at or near the worst parameter combination. From this, the lowest limit of the parameter combination can be determined.
is section analyses the effect of five parameters on the experimental efficiency. e results show that memory size and replacement interval have a small effect on experimental  Journal of Advanced Transportation   Journal of Advanced Transportation efficiency, while learning rate, reward decay, and greedy have a significant degree of tendency to experimental efficiency. erefore, the three parameters are analysed for efficiency in the next section.

Optimization Index Analysis.
In this section, we performed optimization index analysis on learning rate, reward decay, and greedy. Based on the abovementioned analysis, the replacement interval is set to 300 and the memory size is set to 500.
is part used the orthogonal experiments of three factors and three levels for evaluation. e three factors are learning rate, reward delay, and greedy, record as A/B/C. e corresponding levels are A 1 /A 2 /A 3 , B 1 /B 2 /B 3 , and C 1 /C 2 /C 3 , respectively. Fix A and B at the levels of A 1 and B 1 , and match three levels of C with A 1 B 1 C 1 , A 1 B 1 C 2 , and A 1 B 1 C 3 . If A 1 B 1 C 3 is optimal, fix the C 3 level. en, let A 1 and C 3 be fixed, and match two levels of B with A 1 B 2 C 3 and A 1 B 3 C 3 , respectively. After the tests, if A 1 B 2 C 3 is optimal, fix two levels of B 2 and C 3 , and try two tests A 2 B 2 C 3 and A 3 B 2 C 3 . If A 3 B 2 C 3 is optimal, it is the best level combination.
When the loss curve is more stable and the optimal loss coefficient is lower, the corresponding training is better. e stability of the loss curve is reflected by the amplitude of the curve fluctuation, and the curve formed by the ratio of the loss difference to the time difference can be visually seen. e optimal loss coefficient is obtained directly from the experimental results. e parameter combinations and corresponding results for the first test are shown in Table 7 and Figure 8. e first test adjusts the values of greedy. Greedy's weakening adjustment significantly improves the stability of prediction and the optimal loss parameters. erefore, reasonable adjustment of greedy helps optimize the fluctuations of the network and controls the training within a reasonable range. Based on the abovementioned experimental results and analysis, it is found that the optimal combination in the first test is group c. e parameter combinations and corresponding results for the second test are shown in Table 8 and Figure 9. e second test evaluates the changes in the reward delay. e effect of the reward delay on the stability of the system is more significant than that of greedy, indicating that the system is more sensitive to the reward delay. erefore, the parameter tuning of this indicator can be perfectly combined with greedy to achieve the optimal stability of the system with the optimal combination in the second test in group c.
Parameter combinations and corresponding results for the third test are shown in Table 9 and Figure 10. e object of the third test tuning is the learning rate, which not only puts forward higher requirements for system stability but also realizes the optimization progress of the loss index. erefore, the learning rate is one of the three factors which have the greatest impact on the system. Meanwhile, the optimal loss index does not decrease with the increase in the learning rate, which indicates that the system optimization has a threshold and is not negatively correlated. e optimal combination for the third test is group f. erefore, the best combination obtained in all the tests is group f. e following conclusions are drawn based on the analysis of experimental results: (1) e learning rate, greedy, and reward delay affect the stability of the system, among which the reward delay has a greater impact. e learning rate is the only effective parameter to improve the optimal loss index. (2) e values of the three parameters have corresponding valid intervals. When the interval is exceeded, the prediction process fluctuates sharply and affected experimental efficiency. 6.6. Accuracy Analysis. e accuracy analysis is divided into two stages: the comparison of the predicted delay index and the actual delay index and the accuracy analysis of the traditional methods and the method proposed in this paper. e first stage: the comparison between the predicted delay index and the actual delay index. Taking group f as the       standard group, first draw a comparison chart based on the prediction and actual delay index, and second, calculate the prediction accuracy of groups 1-11 and A-G. is enables a preliminary accuracy evaluation.
It is known from Figure 11 that the degree of agreement between group f and the actual delay index is extremely high, and the predicted efficiency is better in the first half than in the second half. It is known from Tables 10 and 11 that group f still has the highest prediction accuracy. erefore, the following conclusions are obtained in the first stage: (1) e neural network under group f is the best choice in terms of experimental efficiency and accuracy, which provides a strong guarantee for short-term traffic prediction. (2) ere is a loss of weakness in the prediction process.
If the neural network is used for long-term prediction, the network needs to be further optimized for design. (3) ere is no absolute correlation between experimental efficiency and accuracy, so it is necessary to analyse by yourself. For example, there is a serious deviation in the experimental efficiency of group 4, and its accuracy is kept within a reasonable range. e second stage: the accuracy analysis of the traditional method and the method mentioned in this paper. In order to further verify the superiority of the proposed method, the accuracy of the proposed method is compared with the LSTM, KNN, SVM, exponential smoothing, and BP neural network. All prediction processes are based on the data used in this paper. Finally, the two representative indicators of prediction accuracy and MSE are used to measure the effectiveness of the forecast.         In the LSTM, the data from 0:00 on May 17th to 24:00 on May 24th are used as the training sets, and the data from 0:00 to 12:00 on May 25th are used as the test sets. e prediction results are shown in Figure 12.
In KNN and SVM prediction, the prediction accuracy of the SVM and KNN based on data analysis is shown in Table 12.
In the exponential smoothing forecast, based on the May 20th 0: 00-22 o'clock to predict the traffic state of 23-24 o'clock, the second exponential smoothing and the third exponential smoothing are performed and, finally, compared with the actual delay index. e experimental results show that the fitting curve under quadratic exponential smoothing is better than the prediction result, as shown in Figure 13.
In BP neural network prediction, speed, delay time, travel time, air temperature, and precipitation probability are used as the input matrix, and delay index is used as the output matrix. 90% are used as the training set, 5% are used as the validation set, 5% are used as the prediction set, and 10 hidden layers are used to construct the BP neural network. Training according to the Levenberg-Marquardt algorithm is performed until the best effect is achieved. Finally, the error distribution and prediction results are obtained, as shown in Figure 14 and 15.
It can be seen from Table 13 that the method proposed in this paper is more accurate than LSTM alone. Comparing the proposed method with other representative prediction methods, it is obvious that the prediction effect is better. erefore, it further confirms the superiority of the method proposed in this paper, which can meet the demand for high efficiency and precision in traffic prediction and has the feasibility of practical application.

Conclusions
is paper proposed a short-term traffic flow prediction model for urban roads based on the LSTM and Q-Learning, which are used to solve the problems of low temporal correlation of traffic data, large inventory, poor comprehensive analysis, and slow feedback of prediction results. e analysis results showed that the model has excellent stability and prediction accuracy. erefore, this model has the feasibility to apply to actual traffic scenarios and to provide accurate information guidance to reduce traffic congestion and accident rates. Moreover, this model could provide substantial method support with the development of active safety.
At the same time, the problem with this model is that the amount of data and data dimensions predicted by this training are not big enough. If there are sufficient data volume and dimensions, it will bring more mature training effects and prediction results. erefore, the next research goal is to develop more multidimensional research directions based on deep mining of effective traffic data.
In the future, we will focus on exploring more efficient prediction methods based on the research results of this paper. Also, a series of traffic conditions such as future traffic flow, accident trends, and driving behavior trends will be predicted by introducing more relevant data.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.