Adaptive Traffic Signal Control Model on Intersections Based on Deep Reinforcement Learning

Controlling traﬃc signals to alleviate increasing traﬃc pressure is a concept that has received public attention for a long time. However, existing systems and methodologies for controlling traﬃc signals are insuﬃcient for addressing the problem. To this end, we build a truly adaptive traﬃc signal control model in a traﬃc microsimulator, i.e., “Simulation of Urban Mobility” (SUMO), using the technology of modern deep reinforcement learning. The model is proposed based on a deep Q -network algorithm that precisely represents the elements associated with the problem: agents, environments, and actions. The real-time state of traﬃc, including the number of vehicles and the average speed, at one or more intersections is used as an input to the model. To reduce the average waiting time, the agents provide an optimal traﬃc signal phase and duration that should be implemented in both single-intersection cases and multi-intersection cases. The co-operation between agents enables the model to achieve an improvement in overall performance in a large road network. By testing with data sets pertaining to three diﬀerent traﬃc conditions, we prove that the proposed model is better than other methods (e.g., Q -learning method, longest queue ﬁrst method, and Webster ﬁxed timing control method) for all cases. The proposed model reduces both the average waiting time and travel time, and it becomes more advantageous as the traﬃc environment becomes more complex.


Introduction
People's living standards have increased all over the world, leading to an increase in the ownership of private vehicles. While private vehicles have improved people's traveling experience, they have also contributed to traffic congestion, particularly in urban areas. According to data released by China's Ministry of Communications, economic losses caused by static traffic problems account for 20% of the disposable income of urban residents, equivalent to a 5%-8% GDP loss. e residents of 15 large cities in China spend 2.88 billion minutes more than the residents of developed European countries spend to get to work. Further, indirect losses (such as those associated with traffic accidents, social security, and environmental pollution) incurred as a consequence of traffic delays are even more difficult to quantify.
Two types of solutions are commonly employed to address the problems of traffic congestion, travel delays, and vehicle emissions. e first type of solution involves increasing capacity by expanding roads, which can be quite expensive and is too static to address the rapid changes in traffic conditions. e second and more reliable type of solution involves increasing the efficiency of the existing road structure. As an important part of the road network, traffic signal control is one of the most essential steps for improving operation efficiency and traffic safety at intersections [1]. With the rise of connected and automated vehicles (CAVs), many researchers believe that it may introduce great opportunities of reforming the conventional traffic signal operation, i.e., multivehicle cooperative driving around nonsignalized intersections [2]. However, we believe that traffic signal control will be still critical in the near future, where CAVs and traditional vehicles co-exist in a mixed environment for a long process [3]. Many traffic networks worldwide still use fixed signal timings, i.e., they periodically change the signal in a round-robin manner. Although such a strategy is easy to implement, it does not consider the actual traffic conditions and may result in more congestion. us, it is vital to control the traffic signal intelligently and dynamically.
In industrial circles, most existing systems that optimize the specific settings of a traffic controller are based on complex mathematical models. e well-known Split Cycle Offset Optimization Technique (SCOOT, England) [4] and Sydney Coordinated Adaptive Traffic System (SCATS, Australia) [5] are examples of such systems that have improved traffic conditions in many countries. However, they suffer from inefficient handling of emergent traffic conditions, owing to a lack of real-time adaptability and flexibility [6], especially when undesirable human interventions like accidents or important events occur. Even systems that solve dynamic optimization problems in a real-time manner, such as the Real-Time Hierarchical Optimizing Distributed Effective System (RHODES) [7], suffer from exponential complexity that prevents them from being deployed on a large scale [8]. Longest queue first (LQF) is proved to be a robust adaptive algorithm which chooses to let the direction with the highest number of cars be green [9]. However, LQF may be unfair to vehicles waiting in a short queue that cannot accumulate enough length to be scheduled [10].
With the rapid development of artificial intelligence and computer technology, reinforcement learning (RL) has been widely used in academic circles as a method for achieving traffic signal control. Trial-and-error search and delayed reward are the two most important distinguishing features that make RL suitable for traffic signal control. RL can precisely represent the elements associated with the problem: agent (traffic signal controller), environment (state of traffic), and actions (traffic signals) [11].
Balaji et al. [12] proposed a Q-learning based traffic signal control model for optimizing green timing in an urban arterial road network to reduce the total travel time and delay experienced by vehicles. Due to the characteristics of discrete and limited action space in traffic signal control, Q-learning becomes the most common algorithm of RL used in this area. Related works [13][14][15][16][17][18][19] using this algorithm all achieve satisfying results. However, with an increase in the complexity of the environment, a computer may run out of memory; further, searching for a certain state from a large Qtable is time-consuming. Fortunately, in machine learning, neural networks are effective for overcoming the aforementioned drawbacks. Wan and Hwang [20] applied deep Q-network (DQN) in 8-phase traffic signal control and efficiently reduced the average system total delay. A DQN algorithm is a type of RL that combines the benefits of Qlearning and neural networks. Previous studies [11,[21][22][23][24][25][26][27][28] achieved good results when applying DQN methods using continuous state representations.
With regard to state space, action space, and reward function, people's choices vary. In general terms, the definitions and representations of state space in existing papers (e.g., total number of queued vehicles [12, 19-21, 27, 29], length of queued vehicles [12], speed of vehicles [11,18,23,27], or traffic flow [15,30]) can be modified to relay more effective information about the environment, which leads to more accurate judgments about the actions. e action space has been defined as all available signal phases [11,18,20,27,30,31], or alternatively, it has been defined to maintain a sequence [22]. As for the definition of a reward function, most studies choose a reduction in the travel time of a vehicle [11,22,23], length of a vehicle queue [13,15], or the time delay in queuing [11,19,20,26,28,30]. Others [18] use the increase of throughput as reward, or the difference in queue length in different directions [24,25,27].
However, the current research has common problems and still requires improvements. First, most of these works [11,13,15,[22][23][24][25] focus on improvements at a single intersection and will not be satisfying enough when used in reallife situations. ey may not result in an overall performance improvement, as such a policy only focuses on a small range and can cause congestions in upstream and downstream roads. Second, even studies that consider a larger range of networks [13,14,18,19] all use relatively static synthetic data. Traffic conditions often show cyclical changes, and the flow rate also varies in directions in different time periods. e synthetic data used in most of the research studies until now are supposed to be distributed in a uniform manner, implying that the flow rates of all directions are equal. Even if an agent performs well under such traffic conditions, it cannot handle more complex environments, e.g., congestion in the northsouth lanes with no vehicles in other directions. ird, the action options are not set in a proper way. A traffic signal usually changes in a round-robin manner, which is set with respect to the principles of transportation, as well as in line with people's habits and fairness guarantee. However, most of the previous research studies [11,18,30,31] randomly choose one phase in each step, regardless of the sequence. is hopping phase design can be confusing for the driver, as the driver cannot prepare for the next phase in advance. Moreover, as the agent always chooses the optimal action, a loss of fairness may occur. For example, a lane with a minimum number of vehicles may never see a green light. In addition, the traffic signal phase setting in some of the previous works [19,[22][23][24][25][26] is too simple to represent a real road environment, which consists only two phases. Fourth, the interval of decision-making is not realistic. For example, studies like [20,27] choose optimal action every second, which may lead to chaos and even accidents, Because very few drivers can react to such rapid changes. For other works, fixed time interval (e.g., 8 s or 15 s) is chosen without verifying reasonableness. Different traffic conditions may need different intervals, and either too long or too short interval can affect model effects.
is study proposes a truly adaptive traffic signal control agent, using DQN technology in the traffic microsimulator "Simulation of Urban Mobility" (SUMO). e function of the agent is defined as follows: given the state of traffic at one or more intersections, the agents will provide an optimal traffic signal phase and duration that should be implemented. Based on the above analysis of the previous studies, our approach offers several important contributions: (1) Multiagent model that controls a large road network: Both single-agent case and multiagent case are demonstrated in this work. In particular, four agents that represent four adjacent intersections are trained at the same time, so as to achieve the effects of collaborative work and maximize the efficiency of the entire network. (2) Global state and information sharing between agents: In a multi-intersection case, each agent can not only observe global traffic situation, but also obtain the current action of other agents. at is used to achieve cooperation between the agents. (3) Action options that match the actual situation: e traffic signal in our approach contains four complete phases, while the action space only contains two options: change to the next phase or maintain the current phase. e agent must change to the next phase if the current phase has been maintained for three rounds. e action options in our approach match actual situations, and habituation and fairness are simultaneously guaranteed. ree different traffic conditions are tested in a simulation containing uniform and nonuniform distributions, sudden changes in traffic directions, and even more complex environments. e rest of this paper is organized as follows. Section 2 describes related knowledge on the RL and DQN methods. Section 3 defines the general framework of the system, including the agent, state space, action apace, and rewards. e experiment results are presented in Section 4, and Section 5 provides concluding statements on our work.

Introduction to Reinforcement Learning and Deep Q-Network (DQN)
Inspired by behaviourist psychology, RL is concerned with how software agents should take actions in an environment so as to maximize expected benefits [31]. Unlike most machine learning methods, learners in RL are not told what action to take, but by trying to find out which behaviour produces the highest return [32]. In the most interesting and challenging cases, actions can not only affect direct rewards, but also affect the subsequent situation and all subsequent rewards. e framework of RL is shown in Figure 1. An agent is composed of three modules: state sensor I, learning machine L, and an action selector P. State sensor I maps an environmental state s to an agent internal perception i; action selector P selects an action a to act on the environment W according to the current strategy; learning machine L updates the agent's strategy based on the reward value r and the internal perception i; and finally, environment W facilitates a change to a new states' under action a. e basic principle of RL is that if a certain action of the agent leads to a positive environmental reward (strengthened signal), then the tendency of the agent to produce this action will strengthen. Conversely, if it leads to a negative reward, the tendency of the agent to produce this action will weaken [33]. Q-learning (Watkins and Dayan) [34] is a form of model-free, value-based, and off-policy reinforcement learning. It works by learning an action-value function that ultimately gives the expected utility of a given action a in a given state s, following optimal tactics. e policy π is the rule that the agent follows when choosing an action, given the state it is in [35]. When learning this action-value function, the optimal strategy can be constructed by selecting the action with the highest value in each state. e core of the algorithm is a simple value iteration update, as shown in equation (1), using the weighted average of the old value and the new information. e learning rate α(0 ≤ α ≤ 1) determines to what extent the newly acquired information overrides old information, whereas the discount factor c(0 ≤ c ≤ 1) determines the importance of future rewards [36].
In Q-learning, a Q-table is used to store each state and a corresponding Q-value owned by each action in this state. However, as discussed above, maintaining a Q-table is quite expensive when the environment becomes very complex. DQN, which combines the benefits of Q-learning and convolutional neural networks (CNNs), can overcome this problem very well. Receiving states and actions as input, the neural network can analyse and return the Q-value of each action [37], so that there is no need to record the Q-value in a table. A CNN is a class of deep, feed-forward artificial neural networks which has been successfully employed to analyse visual imagery. Since the state space in our model includes several large matrixes that can be regarded as pictures, CNNs are the best choice since they behave well in extracting spatial features from images so as to fully understand the Journal of Advanced Transportation spatial characteristics around the intersection. A CNN consists of an input and an output layer, multiple convolutional layers, as well as optional hidden layers such as pooling layers, fully connected layers, and normalization layers. Figure 2 is a demonstration of how these layers can be combined to build a CNN according to the requirement. Convolutional layers apply a convolution operation to the input and pass the result to the next layer, so as to achieve feature extraction [38]. DQN modifies standard Q-learning in two ways, to make it suitable for training large neural networks without diverging. First, we use a technique known as experience replay, in which we store the agent's experiences at each time step e t � (s t , a t , r t , s t+1 ) in a data set D t � e 1 , . . . , e t pooled over many episodes (where the end of an episode occurs when a terminal state is reached) into a replay memory. During the inner loop of the algorithm, we apply Q-learning updates or mini-batch updates to samples of experience (s, a, r, s ′ ) ∼ U(D) drawn at random from the pool of stored samples. e second modification to Q-learning is aimed at further improving the stability of neural networks. It uses a separate network for generating the targets y j in the Q-learning update. More precisely, after every C update, we clone the network Q to obtain a target network Q and use Q for generating the Q-learning targets y j for the following C updates to Q. is modification makes the algorithm more stable as compared to standard online Q-learning [39].

Approach
Our truly adaptive traffic signal control system is divided into three modules: a signal control core algorithm, an interaction and control module, and a simulation module. e flowchart of information transfer between them is shown in Figure 3. First, the interaction and control module feeds the current environment state to the core algorithm. Second, the core algorithm passes the optimal action back according to ϵ-greedy strategy. ird, the interaction and control module changes the traffic signal, and the results are passed to the simulation module to be displayed in the SUMO GUI. Fourth, the interaction and control module calculates the rewards and passes them to the core algorithm. Fifth, the core algorithm learns and updates the policy according to the rewards received.

Agent Design.
e three most essential parts of the agent are the state space S, action space A, and reward R.
e definitions and representations of the state space are very important, as the accuracy of judgments depends on the effectiveness of the information received about the environment. us, the system has very high requirements for the detector. Besides the two most common methods for acquiring traffic data, loop, and video detectors, CAVs can be utilized as "mobile detectors" to overcome those problems in the near future. CAVs can provide real-time vehicle location, speed, acceleration, and other vehicle information [40]. To take advantage of the CNN, the environment is processed as four pictures in our model: a map of vehicle locations, a map of the vehicle speed, a map of the trained intersection signal phase, and a map of the rest signal phase. It is worth noting that the map of the rest signal phase is specifically for multi-intersection case, which separates the signal of the intersection that the agent controls from the signal of other intersections. A representation of this process is shown in Figure 4, with triangles representing vehicles traveling on the road and the red line on the rightmost representing the right traffic signal in Figure 4(a). Notice that vehicles are supposed to have standard length, and the dotted lines in Figure 4(a) shows how the picture is divided into grids that is long in standard vehicle length and wide in lane width. Figure 4(b) shows the presence or absence of a vehicle in each location, and their corresponding speeds (m/s) are shown in Figure 4(c). e vehicles across two grids are presented in the grid to which its centre point belongs to. In Figure 5, the map of signal phase is processed as follows: the traffic lanes with green signal are set to 1, and others with red signal is set to 0. Considering information sharing between all agents in a multi-intersection case, a global traffic situation is used to achieve co-operation between the agents. e four input pictures are processed in the same way, only the size of the picture is larger. ese settings ensure that the environment is accurately and sufficiently represented and that the state space is not too complex.

Action Space.
In consideration of people's driving habits, a signal should be changed in a round-robin manner: NSG ⟶ NSLG ⟶ EWG ⟶ EWLG ( Figure 6). e action is defined as a � 1: change the signal to the next phase; and a � 0: maintain the current phase. A decision is made every 15 s, and according to the simulation results, the action time interval Δt has a negligible influence on performance as long as it is between 8 and 25 s (mentioned in details in section 4). No phase is allowed to be maintained for more than three rounds, and a yellow light is added for 3 s whenever a phase change occurs.

Reward.
In each time step, all of the vehicles in the network are iterated. As shown in equation (2), if the speed v i of vehicle i is below 2 m/s, then it is regarded as low-speed driving or waiting, and its waiting time W i adds one. Once its speed reaches 2 m/s, W i resets.
e reward is calculated by equation (3) so as to make it inversely proportional to the average waiting time of each vehicle, which satisfies a target of RL, i.e., maximizing the reward. As Figure 7 shows, the reward r i decreases faster as W i increases. When W i reaches a threshold value W m , r i will become negative, indicating that vehicle i has waited too long and green signal should be scheduled. Constant c is a parameter to control the upper bound of r i . To test the performance more comprehensively, the average travel time 4 Journal of Advanced Transportation (from departure to arrival) and average speed of all vehicles is also output as an indicator.

Signal Control Algorithm Using DQN.
e process using a DQN for optimal signal control (signal control core algorithm) is given in Algorithm 1 At each step t, the agent stores the observed environment experience e t � (s t , a t , r t , s t+1 ) in the replay memory pool D. If D with finite capacity N is full, old experiences will be replaced by new ones. For the decision-making process, the agent chooses the action following a ε_greedy strategy. Because in the initial stage, Exploration (random exploration of the environment) is often better than Exploitation (fixed behavioral model choosing the action with highest value), so a parameter ε is imported to control the level of greediness (i.e., random action with probability ε and optimal action with probability 1 − ε). As the training time increases, ε will gradually increase until equals to 1. Before the training process begins, the agent will observe without training for n steps until the replay memory reaches a certain size to guarantee a diverse interaction sample for the training. Once the training process begins, input data set is drawn randomly from the memory pool D. As mentioned in section 2, the corresponding target y j in line 21 is generated by a separate Target_net with parameter Q. After collecting training data, network parameters θ is updated by perform a stochastic gradient descent step, where the loss function (Mean Squared Error) defined as equation (4) is minimized by Adam optimization algorithm [41]. For every fixed C steps, the Target_net updates its parameter Q to Q.  In the multiagent case, each agent is trained individually, which means they keep their own neural network parameters.

Network Structure.
As mentioned in Section 2, two separate neural networks are introduced in this model. Target_net is used to predict the Q_target value, and it does not update the parameters in time. Eval_net is used to predict Q_eval, and it has the latest neural network parameters.
ese two neural networks have completely identical structures, but they contain different parameters.
Each neural network receives four pictures (301 × 301) as input in multi-intersection case, and after processing the picture through six layers (four convolutional layers and two fully connected layers), they output a list (2 × 1) representing the value of each action. e structure of the entire network, including the processing method in each layer and the picture size With probability ε select a random action a t (17) Otherwise select a t � arg max a Q(s t , a; θ) (18) Execute action a t in SUMO and observe reward r t and environment state s t+1 (19) Store experience e t � (s t , a t , r t , s t+1 ) in D (20) Sample random batch_size experiences e j � (s j , a j , r j , s j+1 ) from D (21) Set y j � r j , if episode terminates at step j + 1, r j + cmax a′ Q(s j+1 , a′; θ − ), otherwise. (22) Updating network parameters θ by perform a gradient decent step on (y j − Q(s j , a j ; θ)) 2 (23) Every C steps reset Q � Q (24) Set s t � s t+1 (25) End for (26) End for ALGORITHM 1: DQN with experience replay Journal of Advanced Transportation 7 before and after each layer, is shown in Figure 8. e network structure of single-intersection case is not presented here.

Experiment and Results
In this section, 6 simulation tests are performed to show the performance of the system, including three different traffic conditions under the single-intersection case and the multiintersection case, respectively.

Experiment Settings. SUMO is a free and open traffic
simulation suite, available since 2001, that allows intermodal traffic systems, including road vehicles, public transport, and pedestrians, to be modelled [42]. e "Traffic Control Interface" (TraCI) is an interface of SUMO that provides access to a running road traffic simulation, retrieves values of simulated objects, and manipulates their behaviour "on-line". e simulation network environments of the singleintersection case and multi-intersection case are shown in Figure 9, where the numbers within the parenthesis is the coordination of each node with meters as unit. Each intersection is connected with four road segments (Figure 6

Parameter Settings.
e parameter settings of our method are listed in Table 1.

Data Settings.
As discussed in Section 1, three data sets are designed to cover a variety of traffic environments. e three data sets for the single-intersection case pertain to three different traffic conditions: No. 1-evenly distributed steady traffic; No. 2-sudden change in traffic direction; and No. 3-unevenly distributed steady traffic. e data sets are shown in Table 2. e data sets for the multi-intersection case are similar to those shown in Table 2 [14,14], [14,14]    two conditions. As for the LQF method, it performs smartly in SL No.1, but fails when there is a short queue that cannot accumulate enough length to be scheduled as mentioned in section 1. For example, when the vehicle flow direction suddenly changes in SL No.2, the small number of vehicles accumulated in North-South direction will never meet the green signal. And that also leads to the low value of reward according to equation (3).  Figure 10 shows the episode rewards in 200 training episodes (40000 steps) under the three simulations. Our model converges within 90 episodes, and then remains steady afterwards.

Multi-Intersection
Case. As Table 5 shows, the performances differ in the multi-intersection case. Our model is more efficient than the Webster method under all conditions, but the Q-learning model does not show a considerable improvement in this case. e failure of Q-learning is evident. When the state space becomes too complex, the number of rows in the Q-table will exponentially increase. For example, in 40000 steps, the number of rows in the Qtable is more than 20000. is means that the agent takes longer to randomly select an action under a new state than  3. e base version of DQN model is always the second best method, but still performs poor compared to our method when the environment becomes more complex. at is because the state space and rewards of our proposed method are all global. e agent can finally learn a policy that gains the best overall performance, rather than only improving the traffic condition of its own intersection. at proves the importance of information sharing and global environment observation, which guarantees overall optimization. As shown in Table 6, our model still achieves the best performance under SL No. 2, where the travel time is reduced by 35.1% and the average speed is increased by 63.7% as compared to the travel time and average speed achieved when using the Webster method. Figure 11 shows the episode rewards in 200 training episodes (40000 steps) under the three simulations. Due to the complexity of the environment, the convergence speed is lower compared with single-intersection case. All three simulations converge and perform steady after 170 episodes. e training time and space usage of our method for the whole 200 episodes is listed in Table 7. In addition, our experiment platform is a personal computer with Core (TM) M-5Y71 CPU @ 1.20 GHz 1.40 GHz/RAM: 8.00 GB. Python 3.6 and Tensorflow 1.0.0 are used to realized the models.

Influence of Action Time Interval Δt.
e action time interval Δt is another important parameter to the model. It should be kept within reasonable limits, and either too long or too short can affect model effects. We study the performance of our model using different values of Δt under SL No.1 in single-agent case. e result is shown in Figure 12, where 10 sets of value are taken nonequidistantly between 3 s and 40 s. e travel time is satisfactory (lower than 230 s) when Δt is in the range of [8,27] and reaches the minimum value at 15 s. It is out of reality when Δt is below 8 s, because very few vehicles can pass through in such a short interval since people need time to react and start the vehicle. Also, the model will fail if Δt is too long. Once applied practically, the more frequent the decision-making, the higher the operating expenses (e.g., cost to switch light and observe the     environment). According to the analysis above, Δt is set as 15 s in our system. However, what must be acknowledged is that the influence of Δt varies under different traffic conditions. Due to time constraints, influence under other sets of simulation is not studied here.

Conclusions
In this paper, an intelligent and adaptive traffic signal control model based on a deep RL method is proposed. Using the advantages of DQN, agents learn how to determine an optimal signal phase and duration in reaction to a specific environment, in order to reduce waiting time and travel time and increase vehicle speed. e multiagent model observes global state and achieves information sharing between agents, so as to improve overall performance in a large road network. Various traffic conditions are considered to make our model suitable for all kinds of scenarios. Simulation results prove that our model performs better than three existing popular methods, Q-learning, LQF and Webster methods, and another base version of DQN method under all cases. e more complex the environment, the better the performance of our model.
Our study proves the reliability and efficiency in using RL for traffic signal control. With regard to future work, we acknowledge that this project is not perfect and that there are still many aspects that can be improved upon and researched. First, the experiment can be extended to use more complicated real map information. Second, real-world data and even real-world experiments should be carried out to further validate the performance of our method. Lastly, strengthen communication and co-operation between agents in the multi-intersection case may lead to better overall performance.

Data Availability
All data and program files included in this study are available upon request to the corresponding author.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.