Intelligent Ramp Control for Incident Response Using Dyna-Q Architecture

Reinforcement learning (RL) has shown great potential for motorway ramp control, especially under the congestion caused by incidents. However, existing applications limited to single-agent tasks and based onQ-learning have inherent drawbacks for dealing with coordinated ramp control problems. For solving these problems, a Dyna-Q based multiagent reinforcement learning (MARL) system named Dyna-MARL has been developed in this paper. Dyna-Q is an extension of Q-learning, which combines modelfree and model-based methods to obtain benefits from both sides. The performance of Dyna-MARL is tested in a simulated motorway segment in the UK with the real traffic data collected from AM peak hours. The test results compared with Isolated RL and noncontrolled situations show that Dyna-MARL can achieve a superior performance on improving the traffic operation with respect to increasing total throughput, reducing total travel time and CO 2 emission. Moreover, with a suitable coordination strategy, Dyna-MARL can maintain a highly equitable motorway system by balancing the travel time of road users from different on-ramps.


Introduction
Traffic congestion occurs when the traffic demand for a road network approaches or exceeds its available road capacity.Even slight losses of the balance between demand and capacity on motorways can lead to long travel delays, high energy consumptions, and severe environmental problems.Therefore, how to alleviate traffic congestion and maintain the demand-capacity balance has become one of the main concerns of the transport community.To this end, a number of traffic control devices, such as variable speed limit (VSL), variable message sign (VMS), and ramp control systems, are developed under the umbrella of intelligent transportation systems (ITS).Among these advanced systems, ramp control (also known as ramp metering) has been widely used and proved to be an effective control method for different kinds of congestion on motorways [1].
Generally, traffic congestion can be classified into two categories: recurrent congestion and nonrecurrent congestion.Recurrent congestion is caused by the daily traffic operation with temporarily increased traffic demand in peak hours [2].Considering the daily peak traffic on motorways, recurrent congestion is the main concern of many existing ramp control systems.For instance, fixed-time systems (also known as pretimed systems) use historical data collected from daily peak hours to generate control strategies offline and trigger these strategies at fixed times (e.g., morning or evening peak hours) of each day [1].Local traffic-responsive systems such as demand-capacity method, ALINEA [3], and its variations [4] can respond to the real-time traffic and keep the outflow or road density of the motorway mainline close to some target value (e.g., road capacity or critical density).Usually, these target values should be defined in advance according to the so-called fundamental diagram which is derived from the daily traffic data.To deal with network-wide problems, traffic-responsive systems have been extended to coordinated ramp control systems, such as Flow [5], System Wide Adaptive Ramp Metering (SWARM) [6], and Zone algorithms [7].Similar to local traffic-responsive systems, these coordinated systems also attempt to make the outflow of motorway mainline approach a predetermined target value which is usually the road capacity.Another group of systems focuses on formulating different control scenarios as optimisation problems and using optimal control techniques (e.g., model predictive control) to solve them.The purpose of these systems is to maximise or minimise an objective function, not to achieve some predefined target value.Examples of these systems can be found in [8][9][10][11][12], where macroscopic traffic flow models were combined with control systems to formulate optimal control problems.
Although the aforementioned systems have shown their effectiveness in different scenarios, recurrent congestion is still the main focus of these systems and a component that can deal with nonrecurrent congestion is not included in these systems.Unlike recurrent congestion caused by the increased traffic demand in peak hours, nonrecurrent congestion is mainly induced by incidents, and thus, it is usually referred to as incident-induced congestion [2,13].Traffic incidents are nonrecurrent events such as road accidents, vehicle breakdown, and unexpected obstacles that may block one or more lanes of the motorway mainline.The temporary lane blockage will interrupt the normal operation of traffic flow and lead to a rapid reduction of road capacity [14].In this case, fixed-time and simple traffic-responsive systems, which are dependent on the information collected from daily traffic operation or a predefined target value, are not applicable.Therefore, more sophisticated systems that can respond to incidents are required.During the last decades, a series of such kinds of ramp control systems have been designed, most of which are based on optimisation techniques.For example, an optimal control structure using a simple macroscopic traffic flow model was proposed in [15] to deal with incident-induced congestion.A more complex system with consideration of dynamic incident duration was developed in [16] which can be solved by the linear programming technique.In the research presented in [17,18], both lanechanging and queuing behaviour during the incident were incorporated into a modelling structure and solved by a stochastic optimal control system.Although these systems are based on different technologies, they all need a model to predict traffic conditions and use these predictions to accomplish the control process.
Model-based methods usually have poor adaptability when the mismatch between simulation models and the real controlled environment emerges [19][20][21].To overcome this limitation, another optimisation-based method, reinforcement learning (RL), was introduced to the ramp control area.This method is based on the Markov decision process (MDP) and dynamic programming (DP), which can approximately solve the optimisation problem through continuous learning without any models.The first ramp control system using RL to solve incident-induced problems was developed in [19,22].The basic RL algorithm named -learning was adopted by this system to alleviate traffic congestion caused by incidents.After this work, several -learning systems considering both local (e.g., [23,24]) and coordinated (e.g., [25,26]) control problems were proposed.However, -learning can only learn from real interactions with the traffic operation and cannot make full use of historical data (or models).Because of this limitation, -learning usually has a low learning speed and needs a great number of trials to obtain the best control strategy in some complex scenarios, such as incidentinduced congestion [27].This problem is even worse in the coordinated ramp control problems with exponentially increased state and action spaces, which will lead to the socalled "curse of dimensionality" [28].One solution to speed up the learning process and deal with incidents efficiently has been proposed in our previous work [27,29].This system used the Dyna- architecture to combine model-free learning with a model-based method and can be used to accomplish single-agent tasks.
In this paper, the previous single-agent system is extended to a multiagent case that can deal with a network-wide problem with multiple ramp controllers.We refer this system to Dyna-MARL which adopts a multiagent RL (MARL) strategy based on Dyna- architecture.The rest of this paper is organised as follows.Section 2 briefly introduces the basic knowledge of RL including single-agent and multiagent cases.The architecture of Dyna-MARL is described in Section 3.After that, Sections 4 and 5 give the detailed description of the models, elements, and related algorithm of Dyna-MARL.The simulation experiments and relevant results are discussed in Section 6. Section 7 finally gives some conclusions and introduces the future work.

Reinforcement Learning
RL is a subclass of machine learning.In the following subsections, two kinds of RL problems, namely, single-agent and multiagent RL, will be briefly introduced.

Single-Agent RL.
The problem of single-agent RL is usually defined as an MDP that can be represented by a tuple (, , , ) [30]. is the state space used to describe the external environment. is the control action set containing executable actions of the agent. is the state transition probability.For state pair (,   ∈ ),   (,   ) represents the probability of reaching state   after executing action  at state . :  ×  → R is the reward function.(, ) denotes the immediate reward after taking action  at state .Based on these definitions,  value is defined for each state-action pair (, ) and shown below: where  is the time index and  is the number of time steps.  ∈  and   ∈  are the environment state and executed control action at time step , respectively. ∈ [0,1] is the discount factor which indicates the importance of the following predicted rewards.For   ,  is the power. is the policy corresponding to a sequence of actions.The optimal policy can be obtained by maximising the  value.The most widely used algorithm in literature for estimating the maximum  value is -learning [31].By using the updating equation as given below, -learning can maximise  value for each state-action pair: where  +1 (  ,   ) and   (  ,   ) are the  value for stateaction pair (  ,   ) at the +1th step and th step, respectively, and   ( +1 ,  +1 ) is the  value for the state-action pair ( +1 ,  +1 ) at the th step. ∈ [0, 1] is the learning rate. and  can be regulated according to different problems.

Multiagent Scenarios.
In multiagent scenarios, an MDP for single-agent case can be extended to a stochastic game (SG) or Markov game, in which a group of agents try to obtain some equilibrium solutions through coordination or competition [28].
In the absence of competition, all agents involved in a game have a common goal to maximise the global  value, which forms a coordinated MARL problem.In this case, the policy optimisation is determined by actions executed by all agents.
For solving a coordinated MARL problem, the update equation ( 2) for -learning can be easily extended to represent the global  value update [28]: The only difference with (2) is that  and  in (3) relate to  actions  1 , . . .,   executed by  agents rather than to a single action .

Solutions for Coordinated MARL.
It can be seen from ( 3) that as the number of agents grows, combinations of actions and the resultant computational complexity are increased exponentially, which may make the problem unsolvable within a required time limit [28].Therefore, a commonly used method is to decompose the global  value to several local  values, each of which can be maximised by a few relevant agents rather than all agents [32].Based on this distributed method, several strategies have been proposed.In [28], these strategies fall into three categories including coordinationbased, coordination-free, and indirect coordination strategies.
Coordination-based strategies need local  values to be updated according to actions executed by all relevant agents (named joint actions) at each time step [28].The decision making process of each agent is based on the information received from all other related agents with sufficient communication.This will complicate the problem.On the other hand, coordination-free (or independent) strategies, such as distributed -leaning algorithm, make each agent update the corresponding local  values based on its own actions [33].Therefore, each agent makes its decisions independently without increasing computational complexity.However, this computational efficiency is at the expense of nonguaranteed convergence [32].Indirect coordination strategies try to find a balance between the above two methods.By applying indirect strategies, each agent can maintain models for its cooperative partners and update local  values without knowing all the information of other agents at each step [28].Based on high-quality models, this method can reduce the problem complexity and guarantee convergence with limited coordination.

Dyna-𝑄 Based Indirect Coordination Strategy
Because of the benefits introduced in the above section, the indirect coordination strategy has been applied in [34] for solving urban traffic control problems.In their work, each agent maintains a model for estimating the action selection probability of its neighbours and uses this information to optimise control strategies.In this paper, we extend this method to motorway systems by applying Dyna- architecture.
Under the Dyna- architecture, a modified macroscopic flow model named asymmetric cell transmission model (ACTM) and -learning algorithm are combined together to deal with coordinated MARL problems.In this section, the application of Dyna- will be introduced.
3.1.Dyna- Architecture.Dyna- architecture is an extension of standard -learning that integrates planning, acting, and learning together [30].Unlike -learning which learns from the real experience without a model, Dyna- learns a model and uses this model to guide the agent [35].After capturing the real experience, two loops run to learn optimal policies that can obtain the maximum  value in Dyna- architecture (see Figure 1).
In loop I, direct RL is the standard -leaning process that can be used to interact with the real external environment.Loop II contains two main tasks: (1) model learning is used to improve the model accuracy through obtaining new knowledge from real experience; (2) planning is the same process of direct RL except that it is using the experience generated by a model.Acting is the action execution process.
Applying a model, the agent can predict reactions of its external environment and other agents before executing a specific action, which provides an opportunity for agent to update  value before receiving the real feedback.Simultaneously, direct RL is running to update the  value through the real interaction.Therefore, optimal policy is learned through both real experience and predictions.By using this strategy Real traffic arrival and departure rates: departure rates of section i and i + 1 Figure 2: System architecture.
Dyna- can learn faster than -learning in many situations [30].Although a model is maintained in the Dyna- architecture, the whole system is different from the model-based control method such as model predictive control (MPC).The model in Dyna- architecture is a complementary component, which is used to speed up the learning process and simplify the coordination of agents.The optimal control actions are learnt from both real and simulated experience.Without models, the Dyna- architecture is equivalent to the -learning technique and can still work as a model-free system.MPC, on the other hand, is dependent on the model, which means it cannot work without models.Therefore, Dyna- can be considered as a combination of model-free and model-based method [27].

System
Architecture.Each agent in the motorway control system is designed on the basis of Dyna- architecture which controls one prespecified motorway section.
A simplified motorway segment is shown in Figure 2 for analysis.This segment contains three motorway sections (, + 1,  + 2) with detectors located at boundaries.Each motorway section is divided into a number of cells (,  + 1, . . .,  + 8) according to its layout and geometric features.Generally, three kinds of cells exist in the motorway, such as on-ramp cells that are linked with on-ramps ( + 2,  + 5,  + 8), offramp cells linked with off-ramps (,  + 1,  + 6), and normal cells ( + 1,  + 4,  + 7).In this paper, we define that each motorway section can have at most one on-ramp cell.
The typical Dyna- architecture presented in Figure 2 is detailed for each agent here.Take agent  + 1, for example; experience consists of traffic arrival and departure rates observed from the detectors of motorway section  + 1, as well as the information received from agent , which is applied to improve models.In the model component, two models are maintained.An asymmetric cell transmission model (ACTM) with estimated traffic arrival and departure rates is used to simulate the traffic flow dynamics in relevant motorway sections.A probability model of action selection of agent  at the current state is updated for further planning process.
To reduce the complexity of MARL, like many real applications, some conventions are used to restrict the action selection of an agent [28].Specifically, in our design, each agent only communicates with its spatial neighbours.For instance, agent  + 1 receives the control action and traffic information from agent  and sends its own information to agent  + 2. For the case shown in Figure 2, we assume motorway section  is the critical section where an incident occurs.In this situation, agent  plays a more important role than other agents for dealing with incidents.Agent  can be considered as the chief controller that makes decisions according to its own knowledge about the traffic and incident situations.Other agents should regulate their control policies based on the reaction of agent .
Therefore, two  values are defined for two kinds of agents.If motorway section  is the critical section, the  value of agent  is only related to its own state and action space, which can be updated by the same equation denoted by (2).
If motorway section  is the normal section without incidents, the  value of agent  can be calculated by where −1 at state  +1  .Models and the related symbols shown in Figure 2 will be specified in the flowing section.

Modified Asymmetric Cell Transmission Model
A first-order macroscopic traffic flow model named asymmetric cell transmission model (ACTM) is applied as one of the models in the Dyna- architecture.This model is derived from the widely used cell transmission model (CTM) [36] and has been used for ramp control problems [11,37].In this paper, we modify ACTM to incorporate the traffic dynamics under incident conditions.

Traffic Dynamics during the Incident.
As shown in Figure 3(a), when an incident happens in the critical section, one or more lanes of the motorway will be blocked according to the incident extent.Because of the lane blockage, incident may reduce the normal road capacity and spatial storage space, which will produce a new relationship between traffic flow and road density, that is, fundamental diagram presented in Figure 3(b).As suggested by [38], additional parameters can be used to regulate fundamental diagram for incident situations.We introduce three parameters ( 1 ,  2 ,  3 ∈ [0, 1]) to reflect this new dynamics.These three parameters are defined as  Departure rates of the mainline and on-ramp: if  is unmetered on-ramp cell. ( Conservation of the mainline and on-ramp: where   ,main and   ,main are the mainline arrival and departure rates for the cell  at step .  ,on and   ,on are the onramp arrival and departure rates in cell  at step .  ,off is the off-ramp departure rate for cell  at step  (if cell  is not an off-ramp cell,   ,off = 0).  ,main represent the number of vehicles on the mainline of cell  at step . max ,main is the maximum number of this value limited by the mainline space of cell .Similarly,   ,on and  max ,on denote the current (at step ) and maximum number of vehicles in the on-ramp of cell , respectively.Δ (min) is the time duration between each two time steps.   is the metering rate for the on-ramp cell of the th motorway section at step .  ∈ [0, 1] is the flow allocation parameter of cell .  ∈ [0, 1] is the flow blending parameter of traffic flow from the on-ramp to the mainline of cell .The unit of all the arrival and departure rates is modified to veh/min in this study.
For motorway section  with  cells, the number of vehicles in the mainline can be calculated by   ,main = ∑  =1   ,main , while the number of vehicles in the on-ramp of motorway section  is presented by   ,on =   ,on .In this way, the maximum number of vehicles in the mainline and on-ramp of motorway section  is presented by  max ,main = ∑  =1  max ,main and  max ,on =  max ,on .

Estimation of Arrival and
Departure Rates.Arrival rates of the boundary cells in each motorway section (such as  + 2,  + 5, and  + 8) and all the on-ramps, as well as the departure rates of off-ramps, are inputs of the ACTM for each planning step between two real control steps.Considering the short time of planning process (10 steps), we assume these rates can remain stable during the planning and are estimated directly from the recent flow data collected from detectors.The method described by Wang [16] is used here to do the estimation, which simply averages the most recently observed data to get the predicted flow rates.In our model, we use the flow data collected from the last  time steps ( = 5).
Therefore, these three rates can be calculated by where  ,+1 ,main and  ,+1 ,on are the estimated arrival rates of mainline and on-ramp of cell  for the planning step between real step  and +1. ,+1 ,off is the estimated off-ramp departure rate of cell .If cell  is the boundary cell of motorway section , the arrival or departure rate of this cell is also the arrival or departure rate of motorway section .

Definition of RL Elements
Except for the architecture and models defined in Section 3, three basic elements, environment state, control action, and reward function, should be specified to form a RL problem.This section details these three elements and the relevant algorithm.

Environment State.
Environment states of a motorway section are composed of mainline states and on-ramp states.The same method mentioned in [27,29] is used here to obtain the state space.Generally, for the mainline of motorway section , the number of vehicles ranges from 0 to the maximum number  max ,main which is uniformly divided into   intervals.Each interval represents a state of the mainline.Therefore, each mainline section can be represented by a state set  ,main with   states.Similarly, on-ramp traffic is represented by a state set  ,on with   states according to the maximum number of vehicles  max ,on .  and   should be adjusted for different motorway sections according to the section length.In this way, if motorway section  is the critical section, the external traffic environment is represented by which contains   ⋅   states.At each time step, a state    will be selected from   as the environment state.If motorway section  is a normal section, state sets of its neighbour agent should be incorporated.Thus, traffic state is represented by which contains   ⋅   ⋅  −1 ⋅  −1 states.

Control Action.
In a ramp control problem, the aim of the control action is to regulate the number of vehicles entering mainline in each control step.Similar to [29], we adopt flow control as the control action which can be presented by an action set  = {4, 6, 8, 10, 12, 14, 16, 18, 20} with 9 flow rates between the minimum (4 veh/min) and maximum (20 veh/min) values.Exploitation and exploration are two basic behaviours of the RL agent.Exploitation means the agent takes the control action that can get the most rewards from the previous experience.Exploration instead means the agent tries new actions with less rewards.In order to balance these two behaviours, we use the -greedy policy to select control actions [30].Specifically, this policy takes a random action with probability  and chooses the greedy action (with the maximum  value) with probability 1− for each control step.
The action selection probability can be formally expressed as , otherwise. (10)

Reward Function.
Reward function is used to calculate the immediate reward after executing a specific action at each time step, which guides the agent to achieve its objective.Considering a common objective of traffic control system (i.e., minimising total travel time), we define our reward to guide the agent to minimise total time spent (TTS) through learning process.
TTS is defined as the total time spent by vehicles in the network during a period of time.For our case, TTS can be obtained from the following equation: In the above equation, Δ is a fixed value; therefore, minimising TTS is equivalent to minimising the number of vehicles on the network ∑  =0 (  ,main +   ,on ).To minimise this value, the reward function defined here is composed of two negative rewards used to indicate penalties for vehicles on the mainline and on-ramp.The formal reward function at step  is defined according to two situations.
( (2) Motorway Section  Is Not the Critical Section.Here a new negative reward is introduced to maintain the system equity, that is, to make sure that the on-ramp queues and related travel times at different on-ramps should be close to each other: ) is added into (13), which is a penalty for on-ramp queue difference in motorway section  and  − 1.As two adjacent agents cooperated in this situation,    (   ,   −1 ,    ) is related to two control actions   −1 and    .max(⋅, ⋅) returns the maximum value of two given parameters, which is used for normalisation.

Description of the Algorithm.
Based on the Dyna- architecture and RL elements defined in previous subsections, an algorithm Dyna-MARL is developed and described in this subsection.Two main loops corresponding to direct RL and planning shown in Figure 1 are detailed in Dyna-MARL.
Between two real control steps in loop I, 10 planning steps will be run in loop II.The pseudocode of Dyna-MARL can be seen from Algorithm 1.An episode in Dyna-MARL represents a control cycle which starts from incident occurrence and terminates when the traffic state returns to initial state  initial that is the traffic state before the incident occurrence.Incident duration is assumed to be known in advance.

Case Study and Results
One of the metered motorway segments (southbound direction) of M6 in the UK is chosen for the case study.This segment is between junction 21A (J21A) and junction 25 (J25) with an approximate length of 12.4 km (see Figure 4).Making the noncontrolled (NC) situation as the base line, we designed a series of experiments to compare the proposed Dyna-MARL algorithm with Isolated RL (-learning without coordination).Experiments and relevant results are described as follows.

Partitions of the Test Segment.
The test motorway segment with a three-lane mainline, three metered on-ramps, and five off-ramps is simulated by AIMSUN [39] which is a microscopic traffic simulation package.According to the detectors location and road layout, the whole segment is divided into three sections.Each section contains a metered on-ramp.Motorway section 3 is divided into 4 cells, and motorway sections 2 and 3 are both divided into 3 cells.The partitions of each section can be seen from Figure 5.According to the section length, the maximum number of  Road section length (unit: km) Flow direction vehicles in each mainline section and on-ramps is as follows:  max 1,main = 1860,  max 2,main = 2880,  max 3,main = 2880,  max 1,on = 108,  max 2,on = 90, and  max 3,on = 120.

Real Data Source.
Real detector data collected from 17 loop detectors located in the motorway segment (including both mainline and on-/off-ramps) are used for case study, which can be extracted from Traffic Information System (HATRIS) [40].These traffic count data are averaged from April 2012 to March 2013 with 15-minute intervals.Only working day data (from Monday to Friday) are used due to the dramatic reduction of traffic load in weekends.Some of the detector data collected from mainline and three on-ramps are presented in Figure 6, from which we can see that two peak periods including AM peak period (around 07:00:00-09:00:00) and PM peak period (around 16:00:00-18:00:00) exist during the daily traffic operation.
In the test site, ramp metering only works at peak hours.Meanwhile, it is valuable to test the performance of the proposed algorithm in the high demand situation.If it can work under the high traffic load, it should be also useful for common situations.Therefore, AM peak period with heavy 12:00:00 18:00:00 00:00:00 00:00:00 traffic load is considered for case study.Specifically, we use the averaged traffic data during AM peak period collected from TRADS to estimate O/D (origins and destinations) matrix for the simulation.A model proposed in [41] is adopted by AIMSUN to do the estimation where the number of iterations is set as 1000 to get convergence.Table 1 shows the O/D matrix estimated from real traffic data.

Incident Scenarios.
Considering the difficulty of capturing real incident data, we simulate some incident scenarios in AIMSUN.To make each ramp meter work during the incident, the incident is located near the most downstream motorway section, that is, motorway section 1.Therefore, three incident scenarios A, B, and C are designed corresponding to three different incident locations in a, b, and c (as illustrated in Figure 5), respectively.The simulation experiment lasts for one and a half hours from 07:00:00 to 08:30:00 during AM peak period.After 30minute normal operation (for warm-up), the incident is triggered at 07:30:00 and lasts for 30 minutes.In the preliminary experiments designed in this paper, the incident with one lane blocked is considered.Parameters introduced here can also be regulated for multiple lane-blockage situations.The incident extent is 50 meters which is assumed to be constant during the incident.
Learning-related parameters are set as typical values [30]; that is,  is 0.2,  is 0.8, and  is 0.1.Other parameters related to ACTM are calibrated and summarised in Table 2.All the cells have the same  and .

Results.
The comparison of Dyna-MARL, Isolated RL, and NC is conducted from three aspects: density evolution, some general indicators, and the system equity.The experimental results are described as follows.
(1) Density Evolution.We can see from Figure 7 that four dense areas exist during the traffic operation.Three of them near on-ramp entrances (motorway length around 0.5 km, 5 km, and 10 km) are caused by heavy traffic loads from onramps.The dense area close to the segment end forms due to the incident.
In scenario A, incident location is close to on-ramp 1 (O 1 ).Without control, this incident leads to sever congestion which blocks on-ramp 1 and propagates to motorway section 2 (around 9 km in Figure 7(a)).Under this scenario, Isolated RL cannot alleviate incident-induced congestion effectively (see Figure 7(b)).In the beginning of congestion formulation, without coordination, only the nearest ramp controller reacts to the congestion.Because of the space limit of on-ramp, one ramp controller is insufficient to dissolve this congestion that still propagates to motorway section 2. Dyna-MARL, on the other hand, coordinates all three ramp controllers and makes full use of the storage space of three on-ramps to deal with incident-induced congestion.In this way, mainline congestion can be restricted in a smaller area and will not propagate to motorway section 2 (see Figure 7(c)).
For scenarios B and C, incidents are near the motorway end and far from on-ramp 1.Without blocking on-ramp 1, incidents do not lead to sever congestion.Under such circumstances, both Isolated RL and Dyna-MARL work well on easing congestion in the mainline.As shown in Figures 7(e)-7(i), compared with the NC situation, both Isolated RL and Dyna-MARL can restrict the congestion in a small range near the on-ramp entrances.
(2) General Indicators.In this comparison, some general indicators, including total travel time (should be reduced), total throughput (should be improved), and total CO 2 emission (should be reduced), are used to show how the proposed system can benefit road users.These indicators are widely used in the transport community to test the performance of newly developed traffic control systems.
As shown in Figure 8(a), compared with the NC situation, both Isolated RL and Dyna-MARL can reduce the total travel time of road users in all three scenarios.Specifically, Isolated RL decreases total travel time by up to 6.2%, while Dyna-MARL achieves a maximum reduction of 12.2% (see Figure 8(d)).The comparison of total throughput is presented in Figure 8(b).Dyna-MARL can improve the total throughput by up to 2.3% (see Figure 8(d)) which outperforms Isolated RL in all three scenarios.In scenario B, Isolated RL even fails to improve the total throughput.For the comparison of total CO 2 emission (shown in Figure 8(c)), both Isolated RL and Dyna-MARL achieve their best performance in scenario B with a reduction of 4.7% and 4.6%, respectively.In scenarios A and C, Dyna-MARL has a much better performance than Isolated RL.
Through the above comparison, we can see that Dyna-MARL outperforms Isolated RL for almost all the scenarios and indicators.
(3) System Equity.Although the general indicators presented in comparison (2) have shown their effectiveness on testing the performance of different systems, they cannot measure the issue of system equity, which is also an important aspect of the system performance.In this paper, we only consider the spatial equity issue that is defined as a measurement of equity of user delays on different on-ramps [42].In this study, we assume the road users from all three on-ramps have the same importance.If all users from different on-ramps can experience the similar travel time, the control system is defined as an equitable system.This term is used to measure the system equity; that is, a large queue difference leads to a highly inequitable system.In [43], the variance of travel time on different on-ramps is used as an indicator to measure system equity.Similar to [43], for the sake of comparison, the standard deviation is considered in our case.This indicator is defined as where SD() is the standard deviation of travel time of different on-ramps at time step .   is the estimated total travel time of on-ramp  at step .  is the averaged total travel time of  on-ramps at step .
Results about the comparison of system equity can be seen from Figure 9.For the NC situation, good equity can be maintained due to no restrictions of entering vehicles in scenarios B and C (as shown in Figures 9(b) and 9(c)).However, when one of the on-ramp entrances is blocked by the congestion in scenario A, a long queue forms and leads to imbalance and resultant inequity for users on different on-ramps (see Figure 9(a)).For controlled cases, Isolated RL performs poorly in all scenarios.This is because the ramp controller near congestion takes much more restricted measures than other controllers on the controlled traffic.Because of the coordination strategy, Dyna-MARL outperforms Isolated RL on maintaining system equity in all scenarios, especially during the incident (from 07:30:00 to 08:00:00).

Conclusions and Future Work
A Dyna- based multiagent reinforcement learning method referred to as Dyna-MARL for motorway ramp control has been developed in this paper.Dyna-MARL is compared with Isolated RL (-learning without coordination) and noncontrolled situation under the simulation environment.Real traffic data collected from a metered motorway segment in the UK are used to form the simulation.
Through a series of simulation-based experiments, we can conclude the following: (1) Isolated RL can improve the motorway performance in terms of increasing total throughput, reducing total travel time and CO 2 emission, but this improvement is at the expense of poor system equity on different on-ramps; (2) with a suitable coordination strategy, much higher system equity can be achieved by Dyna-MARL; (3) in addition to the system equity, Dyna-MARL outperforms Isolated RL in almost all scenarios regarding all indicators, which means Dyna-MARL can deal with the network-wide problems effectively.
Although the simulation tests have shown some positive results regarding the performance of Dyna-MARL, a simplified incident scenario with fixed duration is considered in the current work.In the practical situation, incident duration is highly unstable and affected by a number of factors, such as weather conditions, road conditions, and arriving time of the incident management team.Therefore, incident duration should be considered as an uncertainty which will be investigated in our future work.

Figure 3 :
Figure 3: Fundamental diagram during the incident.

4. 2 .
Modified ACTM.Given three incident-related parameters, the traffic dynamics in each cell can be derived from the fundamental diagram illustrated in Figure3(b) and represented by the following equations.

Figure 8 :
Figure 8: Comparison of general measures for different scenarios.

Figure 9 :
Figure 9: Standard deviation for different scenarios: (a) scenario A, (b) scenario B, and (c) scenario C.
) is the immediate reward obtained by agent  at time step , when actions   −1 ,    are actions executed by agent  − 1 and .Similarly,  +1  (   ,   −1 ,    ) and    (   ,   −1 ,    ) are the  values for agent  at step  + 1 and step , respectively. −1 is the action set of agent  − 1.
In  /  , and  3 = In,max,main / max ,main .V  and   are the free flow speed and congestion wave speed of cell . max ,main is the maximum departure flow of cell .V In  ,  In  , and  In,max ,main are these three variables during the incident. , and  In , are the critical densities for normal and incident situations. , and  In , are the jam densities for normal and incident situations.