Reinforcement learning method has a self-learning ability in complex multidimensional space because it does not need accurate mathematical model and due to the low requirement for prior knowledge of the environment. The single intersection, arterial lines, and regional road network of a group of multiple intersections are taken as the research object on the paper. Based on the three key parameters of cycle, arterial coordination offset, and green split, a set of hierarchical control algorithms based on reinforcement learning is constructed to optimize and improve the current signal timing scheme. However, the traffic signal optimization strategy based on reinforcement learning is suitable for complex traffic environments (high flows and multiple intersections), and the effects of which are better than the current optimization methods in the conditions of high flows in single intersections, arteries, and regional multi-intersection. In a word, the problem of insufficient traffic signal control capability is studied, and the hierarchical control algorithm based on reinforcement learning is applied to traffic signal control, so as to provide new ideas and methods for traffic signal control theory.
Traffic congestion has become a world-concerned problem all over the world. With the increasing number of vehicles, traffic congestion has deeply affected people’s daily life and the development of social economy. Traffic control is one of the most important technological means of regulating traffic flow, improving obstruction, and improving its safety and even energy conservation and emission reduction. At present, traffic signal control problem not only has a long-time congestion phenomenon at peak time, but also has obvious ability of grooming in peak time. In order to ease the traffic pressure, rational analysis and control are considered as an important tool. Its progress and development are always keeping pace with the times, accompanied by information technology, computer technology, and system science.
According to the system’s ability to adapt to the environment and the level of intelligent decision-making, Gartner proposed the evolution of urban transport control system development level in 1996 [
Digitized and informational infrastructure of urban road traffic and constructions of related systems have developed rapidly in the past ten years, and urban traffic control is developing from the “data poverty” times to the “data rich” times. Meanwhile, the appearance of ICV (intelligent connected vehicle) and autonomous vehicles will construct the future traffic environment jointly, which significantly differs from conventional manual driving vehicles in terms of individual information acquisition, perception ability, reaction time, interactive behavior, etc. New requirements of traffic control have formed a high-level demand for the next generation of traffic control [
Under the conditions of limited cross-section traffic flow data, many existing adaptive traffic control systems have adopted traffic models to actively predict the evolution of network traffic flows and then adopted the aggregative indicator method to optimize and solve timing parameters. However, the real-time detection of the spatiotemporal data based on urban road network traffic status can provide rich and high-quality basic data and fine-grained assessment of control effects for traffic control. In the face of the main defects encountered in the existing self-adaptive traffic control system, a closed-loop feedback self-adaptive control system with better uncertainty response capability and higher intelligent decision-making level is an inevitable result of the objective needs of the development and application technologies [
According to real-time collection of states, rewards, and punishments, the single intersection’s signal control of reinforcement learning can find an optimization strategy of traffic signal control suitable for traffic flow characteristics through the interaction. In recent years, more and more domestic scholars have studied principles of reinforcement learning and discussed the applications of reinforcement learning algorithms in traffic control. Reinforcement learning has developed rapidly in the optimizing control [
Scholars have done a lot of research on reinforcement learning theory, algorithms, and applications and have obtained many famous research results. Ma D proposed a new control method that brings significant and positive effects to the bottleneck link itself and to the entire test area [
Reinforcement learning control has the advantages of real-time online and feedback control, which especially accords with the control thoughts of signal adaptive control in urban intersections. However, there is a question as to whether the traffic signal optimization strategy based on the reinforcement learning is applicable to all the traffic environments.
Reinforcement learning is a typical data-driven control method. In this paper, the method of signal control scheme improvement is proposed. According to the different traffic flow characteristics, the subregions are divided. Based on the three key parameters of cycle length, arterial coordination signal offset, and green split, a set of hierarchical control algorithms based on reinforcement learning is constructed to optimize and improve the current signal timing scheme.
As for the regional coordination control, the primary content is the division of the coordination subregions. In the signal control road network, each intersection has its influence range, and the intersection and section within this range are greatly affected by it. To quantify the impact and define the scope of influence, literature defines direct relevance to describe the relationship between adjacent intersections, finding that when the upstream node traffic flows into the downstream node, it is close to or greater than the downstream node's import capacity. It is found that the path correlation is mainly affected by the traffic network topology and OD distribution between the two intersections. The more the OD paths through two nodes at the same time, the stronger the correlation between nodes. The higher the flow rate of OD path passing through both nodes at the same time, the stronger the correlation between nodes. The more the OD paths that pass through both nodes at the same time are unique, the stronger the correlation between nodes will be.
The optimization range is region-level road network optimization. The control subregions are divided by characteristic parameters such as average travel time; vehicle OD amount between intersections and traffic coordination control subregions are finally determined.
The signal cycle refers to the time required for the signal color to display one cycle in the set phase order, that is, the sum of the steps of each control step in one cycle. The signal cycle is the key control parameter that determines the effectiveness of traffic signal control. If the signal cycle is too short, it is difficult to ensure that the vehicles in all directions can pass through the intersection smoothly, resulting in frequent stops at the intersection and a decline in the utilization rate of the intersection. If the signal cycle is too long, it will cause the driver to wait for too long, greatly increasing the delay time of the vehicle. The cycle in the green wave control is taken as the common cycle by the maximum signal cycle of the key intersection of the arterial, and the signal cycle of the remaining intersections is reallocated to each phase according to the traffic flow ratio.
According to different evaluation indexes, the optimal cycle is obtained by using model-based algorithm. Regarding the evaluation indicators of traffic efficiency at intersections, traffic capacity, saturation, service level, travel time, number of stops, and queue length are commonly used at home and abroad. The delay is mainly due to the travel time loss caused by traffic friction and traffic control. It is closely related to the cycle duration, green split, and saturation. It is an important indicator for evaluating the traffic service level and operational efficiency of signalized intersections, including queue delay, parking delay, control delay, and lane approach delay.
The phase offset is also called the time offset or the green time offset. The phase offset includes the absolute phase offset and the relative phase offset. Absolute phase offset refers to the offset between the starting or ending point of the signal green light (red light) in the coordinated direction of the arterial at each intersection and the starting or ending point of the signal green light (red light) in the coordinated direction of the arterial at a certain intersection (generally a key intersection). Relative phase refers to the time offset between the starting or ending points of the green light (red light) signal in the coordinated direction of the arterial at adjacent intersections. The relative phase offset is equal to the difference value between the absolute phase offset of two intersections, which is determined by the actual vehicle speed.
According to the coordination effect between the intersections, it is divided into several control subregions, and internal coordination control is implemented for its traffic characteristics. The basic principles of control subregions division are as follows:
The following lines with inconsistent coordination effects should not be included in a subregional coordination:
The Bayesian optimization algorithm belongs to the sequential model-based optimization (SMBO) algorithm. This algorithm determines the value of the next (optimal) sample set by analyzing historical observations of a loss function
Calculate the posterior expectation of the loss function
Generate a new set of samples
Repeat the above steps until the preset convergence condition is reached. End the optimization process.
The algorithm will be described in detail below and the process will be summarized.
To calculate the posterior expectation of the loss function
For the prior distribution, we assume that the loss function f can be described by a Gaussian process (GP). The essence of the Gaussian process is the generalization of the multivariate Gaussian distribution to the function distribution. Therefore, just as the Gaussian distribution is determined by its expectation and variance, the Gaussian process is completely determined by its expectation function
One of the most widely used acquisition functions is the expected improvement (EI) function. The EI function is defined as
where
where
After the above analysis and introduction, the whole principle and process of Bayesian optimization can be summarized to form a Bayesian optimization algorithm:
Given the observed value
Solve the expected lifting function (EI function) to find the new best sample set:
Calculate the value of the loss function at
Repeat the above steps until the preset number of repetitions (i.e., the number of iterations) is reached or the convergence condition is met.
In (
On the basis that the parameters such as the optimal cycle length are determined, the phase offset of the intersections after deduplication
In the urban transportation system, the traffic flow, vehicle speed, and traffic density are the most intuitive reflections of traffic conditions. They are the three characteristic parameters of traffic flow and the research focus and foundation of traffic flow theory. Among them, the traffic flow refers to the number of vehicles passing through per unit time; the vehicle speed refers to the distance that the vehicle passes per unit time; and the traffic density refers to the number of vehicles on the section per unit length. The traffic flow theory is the basis for the establishment of urban traffic signal control system.
The traffic model uses a discrete-time difference equation or a continuous time subdivision tool to introduce a dynamic relationship between the concepts of traffic volume Q, vehicle speed V, and traffic density K, which summarizes the physical quantities of the traffic network and is used to describe the collective average behavior of a large number of vehicles. In the free flow, the interaction between vehicles can be neglected, and the traffic flow increases linearly with the vehicle density. The wide moving jam flow is usually characterized by stop-go-stop traffic, that is, a series of jams. The density of vehicles in the region is high and the average speed and flow of vehicles are small. The average velocity of the synchronized flow is significantly lower than that of the free flow.
At present, Q-learning algorithm is one of the most frequently used methods in the fields of reinforcement learning, proposed by Watkins in 1989 [
In Q-learning, the solution formula of the mainstream value function is as follows.
According to the formula, at the moment of t, the state of Q-learning is
Whether in theoretical research or in engineering practice analysis, road traffic density is an effective indicator for measuring the degree of traffic congestion. The operation of the traffic on the section is affected by the signal control of the upstream and downstream intersections. The release signal at the upstream intersection directly changes the density of the section, which indirectly affects the traffic capacity and saturation at the section of the stop line and indirectly affects the density of the queue section. The mutual influence of the two is especially noticeable in the supersaturated state. Since the penetration rate of the connected vehicles in different sections is unknown, it is impossible to visually reflect the actual flow of the road through the number of discrete connected vehicles. Even by expanding the sample, it is difficult to guarantee accuracy, but it can clearly reflect the speed of the overall traffic flow. Therefore, this paper uses traffic density as the core parameter to provide a basis for green split optimization.
Firstly, according to different evaluation indexes, the optimal cycle is obtained by using model-based algorithm. Using the combination of the average travel time of vehicles and the Bayesian optimization method based on the Gaussian process, which is commonly used in the optimization of machine learning algorithms, the arterial coordination control is set. The phase offset is optimized by the two-way flow ratio of the upstream and downstream roads and the reasonable setting of the pedestrian crossing phase. Then set different green wave bandwidths to match the upstream and downstream traffic of the morning rush hour and the tidal phenomenon with uneven travel speed. The intelligent algorithm such as Q-learning is used to optimize the green split of each intersection by using key traffic flow parameters at each intersection.
In conclusion, the flow of the traffic signal control strategy based on reinforcement learning is as shown in Figure
The flow chart of the hierarchical control algorithm based on reinforcement learning.
The intelligent algorithm such as Q-learning is used to optimize the green split of single intersection by using key traffic flow parameters at the intersection. We have compared the Q-learning control method adopted in this paper with the traditional timing signal control and the adaptive control method in second-generation traffic signal control system. The delay of intersection means the average delay for all vehicles passing through all of lane groups at the intersection in the same cycle. The results are shown in Figure
Comparisons of traffic control methods of single intersections.
Having compared the traditional timing control with the traffic signal control method based on Q-learning algorithm applied in this paper, the study has shown that the application of Q-learning control method has achieved good performance. In terms of effectiveness, compared with the traditional timing control, the optimization effects of traffic signal control based on Q-learning have, respectively, reached 31.68%, 30.10%, 37.59%, 38.07%, 40.69%, and 43.89%, which has shown that compared with traditional timing control, the traffic signal control based on Q-learning can achieve better optimization effects. However, compared with the existing traffic control strategy, the optimization effects of traffic signal control based on Q-learning have, respectively, reached -4.21%, -5.28%, 3.14%, 6.23%, 13.11%, and 9.72%. The optimization effects of traffic signal control based on reinforcement learning are inferior under low flow conditions, while they are better under medium and high flow conditions.
The green wave coordinated control has three important parameter conditions: the signal clocks of each intersection should be synchronized; the signal cycle should be the same and have phase offset (the travel time calculated by the adjacent intersection based on the actual average speed). Only with these three conditions can the validity of green wave be guaranteed.
Discrete connected vehicle trajectory data cannot directly obtain the data required for signal distribution and intersection channelization scheme under traditional conditions but can obtain more detailed and complete trajectory-level data. The complete physical trajectory of the vehicle during driving can not only reflect the driving path of the vehicle on the road network, but also reflect the changing characteristics of the vehicle speed with time and space. It is the most comprehensive and complete expression form of the traffic flow operating state, containing a wealth of traffic flow information which is key parameter for offset optimization (for example, travel speed, queue length, delay, and stop times).
For the coordinated control problem of the supersaturation state during morning rush hour, the above research basically follows the idea based on the strong mathematical hypothesis model, but the control system deviates from the original trajectory due to the inaccuracy based on the strong mathematical hypothesis model and the interference from the outside of the control system. In response to these shortcomings, some scholars have further proposed the idea of predictive control, which enables the system to correct the trajectory deviation in real time and achieve the purpose of optimal control. However, the establishment of the optimal control model is still a centralized processing idea. In the application of intersection control problems, it focuses on the control problem of single-point intersections. From the perspective of the structure of the control algorithm, the hierarchical control structure can integrate more control personnel’s design ideas, which is of great help to solve the problem of complex control state of the road network. With the development of intelligent control technology, Bayesian optimization methods based on Gaussian process, fuzzy control, reinforcement learning calculation, and neural network have been also widely used in traffic control. However, these applications present a similar feature that is loosely integrated with the actual traffic condition. The computational speed of the online control system is still a big obstacle, and it is more difficult to push it to practical applications. Therefore, traffic control at the network level for rush hours should be based on offline large-scale optimization calculation based on traffic model (based on travel time, then obtaining the relative phase offset between adjacent intersections) and intelligent algorithm (Bayesian optimization methods based on Gaussian process), seeking to achieve a system-optimized phase offset timing scheme.
On the other hand, in the traditional arterial coordinated control scheme, the green wave velocity, the forward green wave bandwidth, and the reverse green wave bandwidth between the starting point and the ending point are always the same or almost the same, and no or less consideration is given to the individualization of the velocity distribution between the sections and the tidal of the traffic flow. Therefore, we adopted different green wave speed optimization methods for different road sections, combined with the unique tidal phenomenon. In the direction of large traffic flow, based on the calculated green wave bandwidth, the bandwidth of the reverse green wave is appropriately increased to match the traffic demand of rush hour. At the same time, the vehicle traffic phase is, respectively, covered to the forward and reverse two-way green wave at the pedestrian crossing, which minimizes the probability that the vehicle stops at the signal control pedestrian crossing.
As is shown in Figure
Evaluation of the traffic control method of arterial multiple intersections.
The concept of regional signal control can be divided into a narrow sense and a broad sense. In a narrow sense, regional signal control is a signal control method that unifies several intersections with strong correlation and carries out mutual coordination, namely, the so-called regional signal coordination control. In a broad sense, regional signal control refers to the monitoring of all intersections within the region under the management of a command and control center. It is a comprehensive signal control for single isolated intersection, multiple intersections of the arterial and the highly connected intersection group. It can be classified according to control strategy (timed offline control system, adaptive online control system), control mode (scheme selection, solution generation), and control structure (centralized, distributed).
The purpose of vehicle path feature extraction is to obtain the information of the nodes (i.e., intersections and OD points) that each vehicle passes through, so as to be able to calculate other dynamic features of the sections and road networks (such as traffic, average speed, and road network OD matrix). However, the trajectory data does not contain information such as when the vehicle passed through which node, and we only know the coordinate points of the vehicle trajectory. Then, can we extract the vehicle trajectory by using the trajectory coordinate points and the node information?
In the beginning, we tried clustering based on clustering methods, trying to cluster the coordinate points according to the separated road segments. However, after experimenting with various mainstream clustering methods, it is found that clustering cannot solve the tagging problem of coordinate points. How do we mark the vehicle coordinate points with the tag of the node? After further experimentation, we thought that the vehicle path extraction can be carried out by using the function inpolygon that comes with MATLAB. The core idea is as follows:
In this paper, the method of signal control scheme improvement is proposed. According to the different traffic flow characteristics, the subregions are divided. Based on the three key parameters of cycle, arterial coordination signal offset, and green split, a set of hierarchical control algorithms based on reinforcement learning is constructed to optimize and improve the current signal timing scheme. Firstly, according to different evaluation indexes, the optimal period is obtained by using model-based algorithm. Using the combination of the average travel time of vehicles and the Bayesian optimization method based on the Gaussian process, which is commonly used in the optimization of machine learning algorithms, the arterial coordination control is set. The phase offset is optimized by the two-way flow ratio of the upstream and downstream roads and the reasonable setting of the pedestrian crossing phase. Then set different green wave bandwidths to match the upstream and downstream traffic of the rush hour and the tidal phenomenon with uneven travel speed. The intelligent algorithm such as Q-learning is used to optimize the green split of each intersection by using key traffic flow parameters such as traffic flow, density, and speed at each intersection. In the end, this paper uses the hierarchical traffic signal control algorithm based on reinforcement learning and combines the relevant knowledge of traffic engineering and engineering project experience to fine-tune the phase offset and green split to solve the problem of green wave bottleneck point of the arterial and signal interference caused by the right-turning vehicle and then obtain the optimal solution.
There are four key evaluation indexes included the number of vehicles leaving the road network at the end of the simulation, and the total delay time, the total travel time, and the total stopping number have been reduced to varying degrees shown in Figure
Evaluation of the traffic control method of regional road network.
For the different traffic flow states, the optimization effect caused by the Q-learning signal control method proposed in this paper in the high flow conditions (32.73%) is better than that in the medium flow and low flow conditions (22.32%, 17.11%). This also proves that the traffic control strategy based on reinforcement learning is more suitable for complex traffic environment (medium and high traffic flow, multi-intersection).
Through the above comparisons and analysis, it can be concluded that the traffic signal optimization strategy based on reinforcement learning is not applicable for all traffic environments. As to single intersections and arteries, their control effects are inferior to the current adaptive traffic signal control strategy in the low flow conditions. However, the traffic signal optimization strategy based on reinforcement learning is suitable for complex traffic environments (high flows and multiple intersections), and the effects of which are better than the current optimization methods in the conditions of high flows, as to both single intersections and arteries. In the future, we will focus on the network, continue to study the network traffic signal optimization method based on reinforcement learning, and then compare the effects with traditional optimization algorithms.
This paper uses the hierarchical traffic signal control algorithm based on reinforcement learning and combines the relevant knowledge of traffic engineering and engineering project experience to fine-tune the phase offset and green split to solve the problem of green wave bottleneck point of the arterial and signal interference caused by the right-turning vehicle and then obtain the optimal solution.
In terms of the temporal dynamics of traffic control, reinforcement learning does not have complex optimizing modules and instant decisions can be made to respond to the uncertainty of time-varying traffic flow according to the characteristics of traffic flow observed in real time, which also corresponds with the actual conditions. Therefore, this paper focuses on the application of reinforcement learning in the field of traffic control and concludes that the traffic control method based on reinforcement learning has a better applicability in the complex traffic environment (high flows and multiple intersections), but it is not applicable to all traffic conditions. Furthermore, different from single intersection signal control, facing the integrated control for mainline and networked level traffic, it is still necessary to make further analysis in the aspects of data models and samples, coordination optimization techniques and multi-agent strategies, and mechanism analysis of the interaction between heuristic guidance and higher-level optimization mechanisms such as the pure stochastic optimization and hierarchical algorithm.
According to the data support, the authors have obtained data in the field and simulated them by VISSIM—C# to implement secondary development with kernel algorithm.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The authors would like to acknowledge the Intelligent Transportation System Research Center of Tongji University for data support. The research is supported by Project of National Natural Science Foundation of China (Project No. 61773293 and No. 61773288) and Key Project of National Natural Science Foundation of China (Project No. 51238008).