The Study of Reinforcement Learning for Traffic Self-Adaptive Control under Multiagent Markov Game Environment

Urban traffic self-adaptive control problem is dynamic and uncertain, so the states of traffic environment are hard to be observed. Efficient agent which controls a single intersection can be discovered automatically viamultiagent reinforcement learning.However, in the majority of the previous works on this approach, each agent needed perfect observed information when interacting with the environment and learned individually with less efficient coordination. This study casts traffic self-adaptive control as a multiagent Markov game problem. The design employs traffic signal control agent (TSCA) for each signalized intersection that coordinates with neighboring TSCAs. A mathematical model for TSCAs’ interaction is built based on nonzero-sum markov game which has been applied to let TSCAs learn how to cooperate. A multiagent Markov game reinforcement learning approach is constructed on the basis of single-agentQ-learning.This method lets each TSCA learn to update itsQ-values under the joint actions and imperfect information. The convergence of the proposed algorithm is analyzed theoretically. The simulation results show that the proposed method is convergent and effective in realistic traffic self-adaptive control setting.


Introduction
As car ownership rates and traffic volume have steadily increased over the last decades, existing road infrastructure today is often strained nearly to its limits.Continuous expansion of this infrastructure, however, is not possible or even desirable due to spatial, economic, and environmental reasons.It is therefore of paramount importance to try to optimize the flow of traffic in a given infrastructure.Traffic self-adaptive control of multiple intersections is synergetic and has the potential to significantly alleviate traffic congestion in urban transportation networks as opposed to the commonly used fixed timing and actuated control systems.Existing traditional traffic adaptive control systems such as TRANSYT, SCOOT, SCATS, and sophisticated dynamic programming approach [1] do not have a mechanism for learning from feedback on the quality of their model, which may lead to systematic errors.
Several researchers have employed classical control methods such as fuzzy logic [2], neural networks [3], and evolutionary algorithms [4] to traffic self-adaptive control.These methods perform well but cannot be adapted to the changing characteristics of traffic flow.Reinforcement learning (RL) [5] is able to perpetually learn and improve the service over time.Multiagent reinforcement learning is an extension of RL to multiple agents in stochastic environment.The decentralized traffic control problem is an excellent test for multiagent reinforcement learning due to the inherited dynamics and stochastic nature of the traffic system [6][7][8][9].
There are two shortages in the application of multiagent reinforcement learning to traffic self-adaptive control problem as discussed below.
(1) With Less Efficient Coordination.The majority of the previous studies consider independent learning agents which do not include any explicit mechanism for coordination [6][7][8][9][10][11][12][13][14].Only a few previous studies consider coordination mechanism between the learning agents.Kuyer et al. [15] consider explicit two-level coordination mechanism between the learning agents that extends Wiering [6] using the coordination graphs.Max-plus algorithm is used to estimate the optimal joint action by sending locally optimized messages among connected agents.However, Maxplus algorithm is computationally demanding, and therefore the agents report their current best action at any time even if the action found so far may be suboptimal.Tantawy and Abdulhai [16] presented multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC) which maintains a coordination mechanism (indirect coordination and direct coordination) between agents without compromising the dimensionality of the problem.Indirect coordination is realized by best-response multiagent learning in nonstationary environments, and the direct coordination is typically based on communication.
(2) On the Assumption of the Complete Knowledge.In addition, since the traffic environment is changing in time and intersection cannot fully understand other intersections' information such as traffic arrival rate, vehicle queue, or delays, it is over idealized that the utility matrix of each agent is public; that is, perfect observed information is required when agents are interacting with the environment.According to this assumption, agents may select individual actions that are locally optimal but that together result in global inefficiencies.So this assumption is not too realistic.
It is argued that the use of a model-based RL approach adds unnecessary complexities compared with using modelfree Q-learning.To overcome the upper deficiencies of the previous approach, this paper conducts a multiagent Markov game reinforcement learning method for optimizing traffic self-adaptive control.We define a TSCA for each signalized intersection that coordinates with neighboring agents.A mathematical model for TSCAs' interaction is built based on nonzero-sum Markov game which has been applied to let TSCAs learn how to cooperate.A multiagent Markov game reinforcement learning approach is constructed on the basis of single-agent Q-learning.This method let each TSCA learn to update its Q-values under the joint actions and imperfect information.Convergence and effectiveness of the improving algorithm are verified.

Structure Model for TSCA
We defined a TSCA for each urban signalized intersection that takes the charge of controlling all signal phases.Its main function is to establish corresponding control strategy which will be implemented by signal light according to the current traffic state of both of its own and of its neighbors [14].Therefore, the intersection's traffic flow conditions are improved.Figure 1 shows the structure model for TSCA.The reinforcement signal is a reward function   which will be defined in Section 3.
The structure model of TSCA has been shown in Figure 1.As we can see in Figure 2, it is mainly composed of learning module, action decision making module, communication module, and coordination module.Learning module infers whether there are reasonable regulations from real-time observational data.If some reasonable regulations exist, learning module executes these regulations and determines signal control plan.Coordination module analyzes the present traffic state of TSCA to decide if it is necessary to send messages to the adjacent TSCAs, and it deals with the TSCA coordination.Communication module is mostly responsible for the communication with the adjacent TSCAs.Action decision making module resolves function of reasoning and decision making of TSCA.In most cases, average delay time is sufficient to determine intersection's relative traffic performance; that is, lower average delay time implies lower ratio of stopped vehicles and total queue length.Vehicle's delay of coordinated controlled intersection is composed of normal delay, random delay, and oversaturation delay which is the same as isolated controlled ones.The computation of normal delay needs vehicle arriving and leaving graph.Since every incoming traffic flow of an intersection is up to green light time and released ratio of upstream intersections, arrival rate is not a constant but a time-varying functional expression.For random over saturation delay, the degree of traffic flow's random fluctuation within each cycle in coordinated control is far away less than that in isolated control, so delay value will decrease.Intersection's delay time can adopt the following transient functional model [17]: where  is the average delay time per vehicle (s/pcu);  is a signal cycle length (s);  is a split;  is a ratio of flow volume;  0 is the average length of over saturation stopped fleet (pcu);   is the intersection's traffic capacity (pcu/h);  is a time interval (h);  is the intersection's saturation;  is the vehicle's arrival rate (pcu/h); and  is the length of red light time.

Mathematical Model for TSCAs' Interaction Based on Nonzero-Sum Markov Game
Game theory is the best mathematical tool to study human society's interaction.The interaction between TSCAs comes down to game model [18].In the dynamic traffic signal control system of multiple intersections, each intersection's signal timing scheme not only affects directly its neighbor's traffic but also indirectly affects nonneighboring intersection's traffic.Although, in traffic networks, the TSCA is incapable of observing the conditions of the entire network, it is possible to observe the conditions of the neighboring TSCAs.In order to avoid the excessive complexity and frequency of TSCA's interaction brought by the increase in the number of controlled intersections, we limit that every TSCA to only interact with the adjacent one.Since two adjacent TSCAs can form a coalition to acquire a whole and local optimal performance in view of the openness of bilateral information, the interaction between two adjacent TSCAs conforms to a two-matrix nonzero-sum cooperative game [19].Markov decision process (MDP) is widely researched which represents the problem about single agent in multiple states.By contrast, two matrix games were used to solve the problem for multiple agents in single state.Markov game can be regarded as the combination of MDP and two-matrix game which defines the frame of multiple agents and multiple environment states [18][19][20].Since the traffic environments which TSCA is confronted with have the characteristics of dynamic, complexity, uncertainty, and openness, n-player nonzero-sum Markov game is suitable to be used to establish the interaction model for TSCAs.
An n-player nonzero-sum Markov game can be defined by a tuple  = ⟨, , , , ⟩, where the parameters , , , ,  are explained in detail as follows according to traffic self-adaptive control.
is a set of finite interactive TSCAs. =  1 ×  2 × ⋅ ⋅ ⋅ ×   × ⋅ ⋅ ⋅   is a set of finite states.Let   be a local state of TSCA ;  = ( 1 ,  2 , . . .,   ) is a global state which is decomposed into local states.  is defined by a vector of two components.The first component is the position for the first vehicle approaching TSCA  from the directions of west, north, east, and south.Since the size of the state space grows rapidly if the accurate distance from the vehicle to TSCA represents local state, all lanes connected to TSCA are divided into a certain number of equal sections.The sections are sequentially encoded from the vehicle nearest to TSCA.So, the section's code can be used to define the local state   .The second component is the maximum queue lengths associated with each direction defined as follows [16]: where    is the number of queued vehicles in lane  at time .A vehicle is considered at a queue if its speed is below a certain speed threshold (Sp thr ).   is computed as follows: where V   is the set of vehicles traveling on lane  at time .
is a set of joint signal timing actions for multiple TSCAs, where   is the subset of finite signal timing actions for TSCA .  represents the current phase duration which depends on intersection's phase and the adjustment of green time for the current green phase,   ∈   .The adjustment of green time is to be determined by the relation of section's length and vehicle's velocity, for example, {green time plus 1 s, green light time plus 2 s, green time minus 1 s, green time minus 2 s, unchangeably}.Phase timing signal scheme generically consists of east-west straight and right turn, south-north straight and right turn, east-west left turn, and south-north left turn. represents a joint signal timing action for the  TSCAs,  ∈ .
:  ×  → [0,1] is a state transition function mapping a present state  and a joint signal timing action  to a probability over states. = { 1 ,  2 , . . .,   , . . .,   } is a set of reward functions for all TSCAs, where   :  ×  →  is a reward function for TSCA  mapping state-action tuples to immediate scalar rewards.  (, ) represents a reward function for TSCA  when the TSCAs take the joint action  in the state .  (, ) can be expressed by the division value between total volume for passing vehicles and accumulative waiting time.
Suppose that the traffic flow is random and corresponds to Poisson distribution.When green light of TSCA  in southnorth direction is open,   (, ) is described in a mathematical equation as follows: When green light of TSCA  in east-west direction is open,   (, ) is computed as follows: where   is a signal timing action for Let   be the probability distributions over action set   of TSCA .With each  ∈ , there is an n-player game Γ = {( 1 ,  2 , . . .,   )}.  (,  1 ,  2 , . . .,   ) is TSCA 's total discounted reward in state  and under joint strategy ( 1 ,  2 , . . .,   ).Suppose that ∏ = {  ,  ∈ } is a joint strategy when each TSCA selects action   with a probability   .We define ∏ − = {  ,  ̸ = , ,  ∈ }; for each given ∏ − , TSCA  chooses a corresponding optimal strategy  *  = argmax{∏ − ∪   }.
In the nonzero-sum Markov game, TSCA  can gain more rewards via cooperation than that under independent action.A Nash equilibrium point can be reached when none TSCAs gain more optimal policy.

Multiagent Markov Game Reinforcement for Traffic Self-Adaptive Control
4.1.Single-Agent Q-Learning.Q-learning which was presented by Watkins defines a learning method within a Markov decision process [21].The basic idea of Q-learning is that we can define a function  such that By this definition,  ∈ [0, 1) is a discount factor and is used to discount future rewards.(  | , ) is the probability of transiting to state   after taking action  in state .A solution  * that satisfies ( 9) is guaranteed to be an optimal policy. * (, ) is the total discounted reward of taking action  in state  and then following the optimal policy thereafter.
If  * (, ) is given, then the optimal policy  * can be found by simply identifying the action that maximizes  * (, ) under the state .The problem is then reduced to finding the function  * (, ) instead of searching for the optimal value of (,  * ).
Q-learning provides us with a simple updating procedure, in which the TSCA starts with arbitrary initial values of (, ) for all  ∈ ,  ∈  and updates the Q-values as follows: where  ∈ [0, 1] is the learning rate sequence.
Watkins and Dayan proved that sequence (8) converges to  * (, ) under the assumption that all states and actions have been visited infinitely often and the learning rate satisfies certain constraints [21].
Even in single-agent Q-learning approach that is proven to optimally converge to the joint policy, each TSCA has to keep a set of tables whose size is exponential in the number of agents: In addition to the dimensionality issue, the method requires each TSCA to observe the state of the whole system which is infeasible in the case of transportation networks.Single-agent Q-learning takes other TSCAs as a part of environment and it updates future rewards based on merely the TSCA's own maximum payoff regardless of other TSCAs' actions.In MAS, we adopted a multiagent Markov game reinforcement method in which each TSCA updates its (, ) according to immediate reward by interacting with other TSCAs and observing actions taken by all other TSCAs and others' rewards.

Multiagent Markov Game
TSCA  updates its Q-values according to Information about other TSCAs' Q-values is not given, so TSCA  must learn about them too.TSCA  updates the beliefs about TSCA 's -function according to the same rule (12) which is applied to its own, Note that ( 1 , . . .,   , . . .,   ) is a joint mixed strategy to a Nash equilibrium point.   and  +1 are TSCA 's observed information.In order to update Q-values, TSCA  would know the priori knowledge about other TSCAs' strategy value   ( ̸ = ).That is to say, we should solve a set composed of ( 13) when  = 1, . . ., −1, +1, . . ., .This is an n-order nonlinear problem which has no practical solutions.In addition, since the TSCA's observed information is imperfect, and in order to avoid the curse of dimensionality, we use probability statistics and Bayes method to estimate beliefs about other TSCAs' mixed strategies.And in such coordination mechanism, TSCA can reach a unique equilibrium.
Let us define that TSCA  conjecture TSCA  takes the action   in the probability of   (  ).The probability   (  ) can be calculated as follows via Boltzmann formula: where  is a temperature parameter which reflects exploring degree and decreases with time.
Each TSCA takes its own actions in state   ; then TSCA  observes other TSCAs' taken actions and new state  +1 ; after that, TSCA  updates the belief about other TSCAs' actions.According to Bayes formula, the belief about TSCA 's actions can be computed as follows: where According to the analysis above, the multiagent Markov game reinforcement algorithm is summarized as follows by taking TSCA  for example.
(i) If the Nash equilibrium does not reach a global optimum, then the TSCA which takes the policy of Nash equilibrium will get more payoffs when the other TSCAs' policies deviate from Nash equilibrium: (ii) If the Nash equilibrium reaches a global optimum, then Assumption 2. Learning sequence {  } satisfies the following: , and the latter two hold uniformly and with probability 1, where  is constant, (ii) if (,   ,  − ) ̸ = (  ,    ,  −  ), then   (,   ,  − ) = 0, where  ∈ .Under Assumption (i), given a finite MDP, the Q-learning algorithm proposed by [21] converges with probability 1 to the optimal Q-function.The second item in Assumption 2 states that the agent updates only the Q-function element corresponding to current state   and actions  1  , . . .,    .

Simulation
6.1.Analysis of the Method's Convergence.We consider the traffic network shown in Figure 2 used for the scenery to test the proposed approach described in Section 4. Paramics, a microscopic traffic simulator, is used to build the testbed network.The multiagent Markov game reinforcement learning algorithm is written in Matlab as a stand-alone application.The interaction between the multiagent Markov game reinforcement learning algorithm and the Paramics environment is implemented through the application programming interface (API) functions in Paramics., defined as the intensity of Poisson flow observed entering the traffic network, obeys uniform distribution in the interval of [5,18].The traffic network includes 5 intersections in which each direction has four lanes.Each intersection sets two phases.The length of the lanes is 60 m.The average velocity of the traffic flow is 2.5 m/s.Let  = 0.1,  = 0.98,  = 0.25, and  = 40.Since some general constraints posed by safety rules should be respected when designing the signal plan, we let minimum and maximum green time for each phase be 20 s and 90 s, respectively.The subsets of actions for TSCA 1, TSCA 2, TSCA 3, TSCA 4, and TSCA 5 are given by  1 = {15, 30, 40},  2 = {30, 50},  3 = {35, 45},  4 = {25, 50}, and  5 = {30, 35, 40, 55}, respectively.Each lane is equally divided into a certain number of segments in an interval of 20 m.As a result, the values of joint state and actions are set as  = ⟨⟨1, 2, 3, 1⟩, ⟨2, 3, 1, 2⟩, ⟨2, 1, 2, 1⟩, ⟨1, 1, 3, 2⟩, ⟨1, 2, 3, 2⟩⟩, and  = ⟨30, 30, 45, 25, 35⟩, respectively.Under the state  and in the taken joint action  given above, how the Q-values of TSCA 1, TSCA 2, and TSCA 5 vary with learning time, respectively, was shown in Figure 3.
As can be seen from Figure 3, the multiagent Markov game reinforcement learning approach presented in this paper is convergent; that is, it can reach a Nash equilibrium.In general conditions, an urban traffic network has a relatively stable flow of vehicles, so the time to solve such problem is acceptable.In the traffic network shown in Figure 2, supposed sometime TSCA 5 is about to choose a red light timing action in north-south direction; then we specify that the local state of TSCA 1, 2, 3, 4 is  1 =  2 =  3 =  4 = ⟨1, 1, 1, 1⟩, respectively, so the finally learned Q-values of TSCA 5 are shown in Table 1.

Analysis of the Method's
Effectiveness.We use the average delay time per vehicle () which was defined in (1) of Section 2 as the performance index of each method.

In Comparison with LQF.
Local traffic consists of vehicles that cross a single intersection and then exit the network, thereby interacting with just one learning agent.According to [24], when the saturation is greater than 0.90, the level of intersection's service is unbearable.So we think that a highly saturated condition in this paper refers to one in which the saturation greatly exceeds 0.9, and as a result the road is congested and the service level is relatively poor.In this section, we compare the novel approach described in Section 4.2 to LQF traffic signal scheduling algorithm proposed in [25] for an isolated intersection.The LQF algorithm was designed for a signal control problem employing concepts drawn from the field of packet switching in computer networks.It utilized a maximal weight matching algorithm to minimize the queue sizes at each approach yielding significantly lower average vehicle delay through the intersection.The primary limitation of LQF is that every agent only considers its own local traffic volume and thus controls its traffic signals in isolation.Consequently, agents may select individual actions that are locally optimal but that together result in global inefficiencies.Therefore, we focus our experiments on comparisons between the novel method and LQF in highly saturated conditions.
In particular, we also consider the scenery shown in Figure 2. In such traffic network all routes contain at least two intersections, and destinations are selected uniformly, thereby eliminating local traffic.The scenery is challenging  and realistic, as it requires the methods to cope with an abundance of nonlocal traffic.The experiment is designed to test the hypothesis that, under highly saturated conditions, coordinated learning is beneficial when the amount of local traffic is small.If this hypothesis is correct, coordinated learning with multiagent Markov game reinforcement learning should substantially outperform LQF when most vehicles pass through multiple intersections.We use the average delay time per vehicle () which is defined in (1) of Section 2 as the performance index of each method.
The result is averaged over 10 independent runs.Figure 4 shows the results from the nonuniform destinations and nonlocal traffic scenery.As can be seen from the figure, multiagent Markov game reinforcement learning substantially outperforms the other noncoordinated method.This result is not surprising since the lack of uniform destinations and local traffic create clear incentive for the TSCAs to learn to coordinate their actions.This approach allows the TSCAs to learn different state transition probabilities and value functions when the outbound lanes are congested.For example, the lane from intersection 1 to 5 is likely to become saturated as all traffic from edge nodes connected to intersection 1 must travel through it.When such saturation occurs, it is important for the two TSCAs to learn to coordinate since allowing incoming traffic to cross intersection 1 is pointless unless intersection 5 allows that same traffic to cross in a "green wave".The cost of including such congestion information is a larger state space and potentially slower learning.So, the simulation results show that multiagent Markov game reinforcement learning is effective.

In Comparison with Fixed Timing Control Approach
and Independent  Reinforcement Learning Approach.In this section, we compare the novel approach described in Section 4.2 to fixed timing control and independent reinforcement learning.We consider the scenery shown in Figure 2 again.Assume that the vehicle arrival rate in the direction of north-south is and the vehicle arrival rate in the direction of east-west is The yellow light time length is 4 s; TSCA schedules traffic signal every two seconds.The learning time length is 5400 s.During the independent reinforcement learning, TSCA does not consider other TSCA's actions and states, the reinforcement signals from the belief allocation modules are only associated with their own states and actions, then the goal is to maximize the local rewards.In the fixed timing control, the signal cycle length is 120 s.The effective green time length in the direction of north-south and east-west is 56 s. Figure 5 shows the results.The results show that in low traffic flow the differences between the effectiveness of the three control methods are not particularly evident.When the traffic flow increases gradually, the differences between the performance of the three methods are more and more apparent.Since independent Qlearning process does not take other TSCA's states and actions into consideration, it is not easy to achieve global optimum, which may lower the performance of the system.When the vehicle arrival rate is more than 1000 vehicle/h, independent Q-learning leads to a lot of heavy traffic.On the contrary, in our method, each TSCA must consider the influence of other TSCA's states and actions, so the results have a certain amount of global properties.
Next, we analyze why multiagent Markov game reinforcement learning approach outperforms the other two approaches.Fixed timing control method can not be adapted to the changes of the traffic environment.In the independent Q-learning algorithm, each TSCA learns and decides at the local level (i.e., using its local state and local action) by using (10).Multiagent Markov game reinforcement learning method biases action selection toward actions that are likely to result in good rewards.The likelihood of good values is evaluated using models of the other agents estimated by the learner through observing their behavior in the past.The efficiency of multiagent Markov game reinforcement learning approach is more profound in cases of traffic fluctuations which assure the adaptability of the approach as the highly saturated condition triggers the TSCAs to coordinate their actions.

Conclusion
Previous work about urban traffic control used multiagent reinforcement learning, but the TSCAs selected only locally optimal actions without coordinating their behavior and needed perfect observed information when interacting with environment.In this paper, a multiagent Markov game reinforcement learning approach based on n-player nonzerosum Markov game is designed for optimizing urban traffic on the basis of the analysis of TSCA's structure model and single-agent Q-learning.Theoretical analysis and experimental result show that the proposed method is convergent and effective.Multiagent Markov game reinforcement learning substantially outperforms the other non-coordinated method like LQF, fixed timing control, and independent reinforcement learning especially under highly saturated conditions when the amount of local traffic is small.It will be demonstrated that the novel method offers the capability to provide distributed control as needed for scheduling multiple intersections.
In future work, it would be interesting to incorporate the effects of driver behavior and transit signal priority in our framework.Moreover, the basic five-intersection network considered here will be expanded to include larger traffic networks and more extensive collaboration among TSCAs.

Figure 1 :
Figure 1: The structure model of TSCA.

Figure 2 :
Figure 2: Traffic network used in the simulations.

Figure 5 :
Figure 5: The results comparison with fixed timing control, independent Q reinforcement learning.
TSCA  and   ,   ,   , and   represent the time to reach TSCA  for the first vehicle incoming but not passing TSCA  from the directions of west, north, east, and south, respectively, beginning with the alternation of red and green light.If   <  ( represents   ,   ,   , or   ), we set   − = 0.   ,   ,   , and   represent the intensity for Poisson flow of vehicles incoming TSCA  from the directions of west, north, east, and south, respectively. depends on the state and timing signal action of TSCAs adjacent to TSCA .It can generalize from the average value of statistical historical data. is a positive constant which represents the degree of punishment.
( +1 |   ,   ) is the probability of transiting to state  +1 after TSCA  and TSCA  take joint actions in state   and ( +1 |   ) is the probability of transiting to state  +1 after TSCA  takes its own action independently in state   .The probability of ( +1 |   ) and ( +1 |   ,   ) can be acquired from environment knowledge.(  ) represents the probability in which TSCA  will take the action   .  (  ) can similarly replace the estimate of   (  ).

Table 1 :
The learned Q-values of TSCA 5 in specified state.