A Reinforcement Learning Based Traffic Control Strategy in a Macroscopic Fundamental Diagram Region

Urban traﬃc control systems (UTCSs) are deployed to a great number of urban cities despite lacking feedback when adjusting the traﬃc signals. The development of reinforcement learning (RL) makes it possible to apply feedback to UTCS, and great eﬀorts have been made on RL-based traﬃc control strategies. However, those studies are regardless of the traﬃc ﬂow theory of the network and the road users’ perspectives on the performance of traﬃc. This study proposes a multiagent reinforcement learning (MARL) based traﬃc control strategy, in which each intersection in a macroscopic fundamental diagram (MFD) region was controlled by one agent using the level of services (LOS) and MFD-based parameters as rewards. The proposed MARL strategy was evaluated by simulation in a 3 × 3 grid network compared with pretimed, actuated, and MFD-based traﬃc control strategies. The evaluation results showed that, at diﬀerent demand levels, the proposed MARL strategy outperforms the other three traﬃc control strategies in terms of average intersection queue length and average intersection waiting time to a diﬀerent extent. Results also showed that the proposed MARL dissipated the congestion faster than the other three control strategies. Results of the Friedman test indicated that the diﬀerences in performances between the proposed MARL and other strategies were statistically signiﬁcant regardless of the demand level. The MFD in the testbed network controlled by the proposed MARL was diﬀerent from that controlled by the pretimed strategy, especially the MFD scatter plot. It provides insights on considering the traﬃc ﬂow theory of the network when applying MARL to traﬃc control strategies.


Introduction
Traffic congestion is considered one of the most significant problems in almost all cities worldwide. To control traffic congestion, a great number of urban traffic control systems (UTCSs) have been used, including SCOOT (Split, Cycle, and Offset Optimization Technique) [1,2], SCATS (Sydney Coordinated Adaptive Traffic System) [3,4], RHODES (Real-time, Hierarchical, Optimized, Distributed, and Effective System) [5,6], and OPAC (Optimized Policies for Adaptive Control Strategy) [7,8]. A previous study has pointed out that UTCS can reduce traffic delay by nearly 10% compared with pretimed traffic control strategies [9] and thus be applied to a great many cities worldwide.
Despite widespread usage, these UTCS are all dependent on the same principle: they adjust traffic signals only by the current or future state of traffic [10]. e state of traffic is described or predicted by microscopic, mesoscopic, or macroscopic models using data from detectors. However, these UTCSs barely check whether the state of traffic has been improved or achieved the desired state, which makes these UTCSs less efficient, especially in cases with disturbances of traffic [11].
In order to compensate for the disturbances, researchers have investigated the application of feedback in traffic control strategies. Traffic control strategies with feedback consist in using a relationship between the actual and desired states of traffic to adjust the traffic signals so that the differences between these two states are continually reduced [12]. One example of traffic control strategies with feedback is reinforcement learning (RL) based traffic control strategies. In an RL-based traffic control strategy, an agent learns to find a control policy from trial and error to maximize the long-term reward, which is related to the feedback of the states of traffic [13]. RL is used in different ways to control traffic. Some studies regard a macroscopic fundamental diagram (MFD) region as an agent and use the agent to control the transfer traffic flow between two MFD regions [14,15], while some studies regard a traffic signal of an intersection as an agent and use the agent to affect the state of traffic [13]. In this study, we focus on the latter situation, which is also the central issue in the literature review. In this kind of RL-based traffic control strategy, each agent of the system senses the state of traffic S τ and adjusts the traffic signal A τ according to S τ at each time interval τ. One time interval later, the state of traffic changes into S τ+1 and the agent evaluates the performance of the state of traffic P τ+1 and gets a reward R τ+1 according to P τ+1 [13].
At first, RL-based traffic control strategies are applied to an isolated intersection, in which one agent adjusts the traffic signals of one intersection. Since these traffic control strategies show the potential in reducing the side effect of disturbances of traffic [16], researchers try to extend the RLbased traffic control strategies by applying several agents in adjusting the traffic signals of one intersection or several intersections in an urban network. Several studies have shown that multiagent reinforcement learning (MARL) based traffic control strategies also outperform the traditional traffic control strategies (i.e., pretimed and actuated traffic control strategies) in cases with disturbances of traffic. For example, Wiering developed three MARL traffic control strategies and found all these strategies perform better than traditional traffic control strategies in a 3×2 grid network [17]. Mannion et al. developed three MARL traffic control strategies with different rewards and found that these strategies perform worse than pretimed traffic control strategies in cases with fixed flows but better in cases with variable flows in a 3×3 grid network [18].
Despite great efforts on RL-based traffic control strategies, three major limitations remained: (i) Previous studies focus mainly on dealing with the problems caused by the scale of the RL-based traffic control strategy [19,20] or strengthening the coordination and information sharing between agents [19,21]. When defining reward at the network level, previous studies used parameters in the network level (e.g., the throughput of traffic flow within the network) or the difference between the parameters of two successive decision points (e.g., the reduction in the total cumulative delay within the network). To the best of the authors' knowledge, those studies have hardly considered the impact of traffic flow theory on the network, such as macroscopic fundamental diagram (MFD). ere is a key need to integrate MFD into RL-based traffic control strategies to make those strategies capable of capturing the traffic flow characteristics at the network level. (ii) MFD is a reproducible, unimodal, low-scatter, and demand-insensitive relationship between network vehicle density (veh/km) and network space-mean flow or outflow (veh/h) in a homogeneous urban region [22] and has been used for traffic control strategies. One of those MFD-based traffic control strategies is the perimeter control strategy, which consists in restricting the inflow of the urban region with MFD to remain the vehicle density as close as possible to the critical vehicle density where the maximum throughput of the MFD region can be achieved [23]. Perimeter control strategies have great potential in reducing the side effect of disturbances of traffic [23][24][25]. However, whether RLbased traffic control strategies can outperform perimeter control strategies is still unknown. (iii) Since an appropriate long-term reward is important for the RL-based traffic control strategies, many different long-term rewards related to the intersection or vehicles have been considered, including reducing average intersection waiting time, average intersection signal delay, or fuel consumption [26]. However, these parameters used in long-term rewards are usually the same or similar parameters used in state definitions regardless of the road users' perspectives [18]. Since the urban networks serve the road users, it is important to describe the performance from the road users' point of view.
To fill these gaps, this study proposes a MARL traffic control strategy in an MFD region. In the proposed model, two unweighted sum rewards are used. One reward is defined according to the performance of the intersection itself and the level of service (LOS) is used to evaluate the performance, which reflects both the road users' perspectives and network conditions [27]. e other is defined according to the characteristics of MFD in the MFD region. e performance of the proposed MARL strategy is compared with three traffic control strategies: pretimed, actuated, and perimeter control strategies. e remainder of this study is organized as follows. e next section introduces the RL and previous studies on RLbased traffic control strategies. en, the experiment design in this study is discussed. After that, results and discussion are presented. is study concludes with some comments and insights into the proposed MARL strategy.

Reinforcement Learning (RL)
. RL models a system controlled by an agent or multiagents, of which the environment can be in some state affected by the actions of agents at each time interval. e agent/agents' goal is to make the environment stay in "good" states and avoid "bad" states [28]. Agents are not told which actions to take but instead must discover that by trying. At each time interval τ, each agent of the system senses the state of the environment S τ and takes an action A τ according to S τ . One time interval later, the state of the environment changes into S τ+1 and the agent gets a reward R τ+1 [13].
Besides the agent and environment, there are four main subelements of RL: a policy, a reward, a value function, and optionally, a model of the environment. A policy π defines the agent's way of choosing actions. It is usually given as P sa (π) � P(A τ � a|S τ � s), which gives the possibility of the agent at time interval τ faced with the state s to choose the action a.
e reward defines whether the agent's action is good according to the specific state of the environment immediately, while a value function defines whether the agent's action is good in the long run. For example, assume the reward is defined by the throughput of the intersection: more throughput of an intersection than that of the previous time interval is "good" state and less is "bad". Intersection A decides a longer red time than the previous time interval, and it may get a negative reward immediately since it leads to lower throughput than that of a previous time interval. However, this action relieves the congestion of the intersection upstream and increases the throughput of intersection A afterward. As a result, this action may get a positive value. Equation (1) is usually used to compute the value function.
v π (s) � E π R τ+1 + cR τ+2 + c 2 R τ+3 + · · · |S τ � s , where v π (s) is the value function of taking actions according to policy π with the state s. E π represents an expectation. c ∈ (0, 1] is the discount rate, which represents the idea that the reward achieved immediately is more important in the learning process than those achieved in the future. e fourth element is a model of the environment, which is not necessary for all RL. According to whether a model of the environment exists, RL can be classified into two types: model-free RL (e.g., Q-learning, SARSA) and model-based RL (e.g., Dyna, Prioritized Sweeping) [18]. Model-free RL uses the state transition of the real world to evaluate the actions, while model-based RL builds a model of the environment and uses that model to evaluate the actions [29].

RL-Based Traffic Control Strategies.
Since RL makes it possible to use feedback to reduce the side effect of disturbances of traffic, researchers have made great efforts on RL-based traffic control strategies. e simplest RL-based traffic control strategy is the single-agent reinforcement learning (SARL) traffic control strategy, in which one agent is controlling one intersection. Many SARLs have been proposed and proved to outperform traditional traffic control strategies in cases with disturbances of traffic [30,31]. However, SARL is only appropriate for isolated intersections [16], which makes it not suitable for urban traffic control systems (UTCSs).
Since the limitation of SARL, researchers have focused on multiagent reinforcement learning (MARL) traffic control strategies. Some researchers use several agents for one intersection to improve the performance of that intersection. For example, Jin proposed a MARL traffic control strategy in which an agent formulates a signal group and several agents are used to control one intersection. Results showed that their proposed model outperforms the pretimed groupbased control strategy regardless of the demand levels [32].
is type of MARL traffic control strategy is still not suitable for UTCS since the complexity increases vastly in urban networks.
To make MARL more suitable for urban networks, most MARL traffic control strategies assume that one agent formulates one intersection and several agents are used to control an urban network. According to the level of coordination and information sharing between agents, MARL traffic control strategies can be classified into three types: totally independent MARL, partially cooperative MARL, and joint-action MARL. Totally independent MARL is a simple extension of SARL. Each agent in MARL takes actions only according to the state of the traffic of the intersection it controls. e main disadvantage of totally independent MARL is that these models cannot handle the network under oversaturation conditions [18].
To improve MARL's efficiency for oversaturation conditions, more efforts have been made to partially cooperative MARL and joint-action MARL. In partially cooperative MARL, it is assumed that the agent can share information with agents controlling the upstream and downstream intersections. Partially cooperative MARL can be achieved by adding the state of upstream and downstream intersections to the state definition or adding the reward of upstream and downstream intersections to the long-term reward. For example, Aziz et al. proposed an R-learning-based MARL adding the serial numbers of upstream and downstream intersections with maximum queue length to the state definition and results showed their proposed model performs better than Q-learning-based and SARSA-based MARL at higher congestion levels [33]. Prabuchandran et al. proposed a decentralized MARL adding the reward of upstream and downstream intersections into the long-term reward and found that their proposed model outperforms pretimed control strategy significantly [34].
Although the partially cooperative MARL can relieve queue overflows under oversaturation conditions, previous studies pointed out that partially cooperative MARL may lead to local optimum instead of system optimum [20,35]. As a result, joint-action MARL is proposed, which achieves the coordination between agents by applying game-theoretic approaches. e main difference between the partially cooperative and joint-action MARL is that in joint-action MARL, each agent is faced with a changeable long-term reward since the best policy changes as other agents' policies change [19]. For example, Zhu et al. used a junction tree algorithm to guarantee convergence for urban networks and Journal of Advanced Transportation tested the algorithm with a network of 18 signalized intersections in VISSIM. Results showed that the proposed model outperforms the Q-learning-based model [36]. e main challenge of joint-action MARL is the exponentially increasing space of states with the increase in the number of agents [20], which makes it less suitable for real world than partially cooperative MARL.
In summary, while there is substantial literature related to RL-based traffic control strategies, previous studies focus mainly on dealing with the problems caused by the scale of the problem [19,20] or coordination and information sharing between agents [19,21]. Little attention is paid to the characteristics of urban networks when designing the RLbased traffic control strategies. Since the macroscopic fundamental diagram (MFD) has been proven as a powerful concept in understanding network-level traffic dynamics, there is a key need to integrate the MFD into RL-based traffic control strategies.

Experimental Design
In this section, a case study is presented and AnyLogic is the basis of the experimental setup. Agent logic is defined in Java and each agent is controlling one intersection. In this section, details of the proposed MARL strategy, including algorithm, state definition, action definition, reward definition, action selection strategy, and testbed network, are discussed.

Q-Learning
Setting. Among various RL algorithms, Q-learning has been the most widely used algorithm learning the optimal action-value function in a model-free way [13,37]. Equation (2) is used for updating the Q values.
where Q τ (s τ , a τ ) is the Q values of the agent at time interval τ faced with state s choosing the action a. α ∈ (0, 1] is the learning rate and defines how much Q values are updated at each time period τ. In the proposed MARL strategy, the learning rate α for all agents is set to 0.08, and the discount rate c is set to 0.8 [18].

State
Definition. e state definition determines how an agent senses the environment around it. e state is defined by a vector of 2 + M components, where M is the number of approaches.
e first 2 components are the index and elapsed time of the current green phase, respectively. e last M components are for traffic parameters. Previous studies have used different traffic parameters to define the state of the environment, including queue length [19,38,39], delay [40], waiting time, or a combination of two or more parameters. In 2014, El-Tantawy et al. compared three state definitions and found that using the arrival of vehicles to the current green phase and queue length at the red phase to define the state performs better than models using queue length or cumulative delay [41]. us, the M components are defined as where q τ la is the number of queued vehicles of approach app in lane la at a time interval τ. A τ la is the number of arrived vehicles of approach app in lane la at a time interval τ.

Action Definition.
In this study, the phasing sequence is variable, and the action is defined as a number, which represents the sequence number of the phase. e decision time Δ τ is dependent on the previous and current phases. If the previous and current phases are the same, the decision time Δ τ equals 1 second. If the previous and current phases are different, the current phase should be kept for several seconds to ensure the safety of vehicles and pedestrians, as shown in where a τ is the indication of the phase at the current time interval τ. Δ τ is the decision time. G min a τ is the minimum green time. Y a τ is the yellow time, and R a τ is all red time. It is assumed that the maximum phase elapsed time is less than 30 seconds.

Reward Definition.
e reward is defined as the unweighted sum of two rewards: a reward for LOS of the intersection and for the weighted average density within the MFD region. Each reward ranges from -1 to 1 as (5) shows.
where R(s τ , a τ , s τ+1 ) is the total reward, R 1 (s τ , a τ , s τ+1 ) is the reward according to the LOS of the intersection, and R 2 (s τ , a τ , s τ+1 ) is the reward according to the property of MFD region. LOS τ is a number from 1 to 6, 6 representing LOS A and 1 representing LOS F. e calculation of LOS is based on Table 1. It defines that R 1 (s τ , a τ , s τ+1 ) is positive when the traffic state of the intersection is under primarily free-flow, reasonably unimpeded, or stable operations. When the intersection is under unstable or even congested operations, it gets a negative reward. D τ acc is the weighted average density in the MFD region at time interval t. D op min and D op max are the weighted average density in the MFD region when the stationary state begins and ends. D max is the maximum weighted average density in the MFD region.
According to the equation of R 2 (s τ , a τ , s τ+1 ) in (5), when D τ acc is lower than D opmin , R 2 (s τ , a τ , s τ+1 ) is the ratio of D τ acc to D opmin . e closer D τ acc is to D op min , the closer R 2 (s τ , a τ , s τ+1 ) is to 1. When D τ acc is between D op min and D op max , the reward is 1. When D τ acc is between D op max and D max , the reward is between -1 and 0. It can be seen that the smaller D τ acc is, the less punishment the agent can get. When D τ acc is over D max , the reward is set to be -1. It defines that when the traffic state of a network is stationary or better than that, each intersection within this network gets a positive reward. When the traffic state of the network is worse than stationary, each intersection within it gets a negative reward.
Note that D op min , D op max , and D max are all achieved by the MFD fitting curve. e weighted average flow and weighted average density are taken as vertical and horizontal coordinates. According to the MFD theory, there are two intersection points of the MFD data fitting curve and horizontal coordinate. e first one is zero point and the second one is considered as the maximum weighted average density D max .
Four steps are used to calculate D opmin and D opmax . First, take a derivative with respect to MFD fitting curve, and the critical weighted average density D c can be calculated. en, F max can be calculated based on the MFD fitting curve when the value of the weighted average density is D c . ird, determine the values of weighted average flow when the stationary state begins and ends. In this paper, it is assumed that the stationary state's weighted average flow is higher than 80% of F max . Finally, calculate the D opmin and D opmax based on the MFD fitting curve when the weighted average flow is 0.8F max .

Action Selection Strategy.
Since the agent must discover the "best" action by trying and no model of the environment can be used to evaluate the actions in the Q-learning algorithm, it is important to balance the exploration and exploitation [13]. Exploration means that the agent takes a new action, which may lead to new "best" action, while exploitation means that the agent takes the "best" action. Two action selection strategies are used in previous studies: e-greedy and SoftMax [13]. e e-greedy method means that the agent mainly chooses the "best" action except for the e amount of time when the agent chooses a random action uniformly.
A disadvantage of the e-greedy method is that the agent explores actions equally and ignores the value function of each action, which makes it possible for the agent to choose the worst action instead of the next best action. To avoid this, SoftMax algorithms are used, which vary the action probabilities as a graded function of the estimated value. As a result, actions with the highest estimated value have the highest probabilities of being chosen. In this study, SoftMax is used as the action selection strategy.

Testbed Network.
e case study is performed for a macroscopic fundamental diagram (MFD) region of 9 intersections with 12 entrances and 12 exits, shown in Figure 1. Each link in Figure 1 is bidirectional and has each road lane for each direction. e distance between two adjacent intersections is assumed to be 500m.
In the proposed MARL strategy, the phasing sequence is defined as Figure 2. G min a τ was determined by the time for the pedestrian to cross the intersection. Y a τ and R a τ were set as 3s and 2s. An exponential SoftMax distribution was used for action selection (15). e performance of the proposed MARL strategy is compared with three strategies: pretimed, actuated, and perimeter control strategies. e pretimed traffic control strategy is defined as a cycle length of 60 seconds. e available green time is divided equally and the phasing sequences are defined as V2, V3, V6, and V7. Because of commercial reasons, the operational details of urban traffic control systems (UTCSs) are not available. e actuated traffic control strategy used in this study is proposed by Zhang and Wang in 2011 [42]. e MFD-based traffic control strategy (i.e., perimeter control strategy) used for comparison is proposed by Xu et al. in 2017 [43]. Since Xu et al. did not consider the control strategy for the internal intersection, the same actuated control strategy proposed by Zhang and Wang was used to control it.
Two simulations of 24 hours are presented before the comparison. e first 24-hour simulation is controlled by pretimed traffic control strategy, statistical data of which was recorded every 5 min to get the MFD curve of this MFD region.
Based on the results of the first 24-hour simulation and the MFD fitting curve, another 24-hour simulation is used to train the proposed MARL strategy. During that simulation, the demand level is set to ensure that the weighted average flow is lower than 50% of the maximum weighted average flow. en, the performance of the proposed MARL strategy is compared with pretimed, actuated, and perimeter control strategies. Five scenarios are used and named scenario 1, scenario 2, scenario 3, scenario 4, and scenario 5. Each scenario has thirty 2-hour simulations to exclude the randomness. Details of each scenario are as follows. Note that all traffic demand data is changed every 5 min.
In the first three scenarios, the first one-hour OD data is the same demand level as training 24-hour simulation to ensure the demand level is low. In scenario 1, the second one-hour demand level is set so that the weighted average flow of this MFD region in each 5 min interval is less than F F Journal of Advanced Transportation 5 50% of F max and the weighted average density of this MFD region in each 5 min interval is less than D op min . In scenario 2, the second one-hour demand level is set so that the weighted average flow in each 5 min interval is among the 50% and 100% of F max . Meanwhile, all the weighted average density is less than D op max . In other words, both scenario 1 and scenario 2 can be considered noncongestion scenarios. In scenario 3, the second one-hour demand level is set so that the weighted average density of this MFD region in each 5 min interval is higher than D op max . In scenario 4, the first one-hour demand level is set so that the weighted average density of this MFD region in each 5 min interval is higher than D op max , while the second one-hour demand level is the same as training 24-hour simulation. at is used as a case of the process of congestion dissipation. In the last scenario, the demand level for 2 hours is set so that the weighted average flow of this MFD region in each 5 min interval is higher than D op max , which can be considered a congested scenario. Traffic demand was changed every 5 min for each entrance. In other words, 3,456 (288 * 12 entrances) pieces of data for the MFD fitting curve, 3,456 pieces of data for training MARL, and 43,200 (24 * 12 entrances * 5 scenarios * 30 times) pieces of data for comparisons were randomly generated.

Construction of MFD in the Testbed Network.
e leastsquare method was used to analyze the MFD fitting curve in Figure 3, and the regression analysis was done by SPSS 19.0. e expression of the MFD fitting curve is as equation (6). e correlation coefficient is 0.92, which shows that equation (6) achieves a high degree of fitting and is used for follow-up calculation.
where x represents the weighted average density and y represents the weighted average flow. Calculation of N max , N op min , and N op max . According to (6), when dy/dx � 0, D c � 41.1 veh·km -1 ·lane -1 . When the weighted average density is 41.1 veh·km -1 ·lane -1 , F max can be calculated as 414.0 veh·h -1 ·lane -1 .    Journal of Advanced Transportation As it is assumed before, the stationary state's weighted average flow is higher than 80% of F max , which is 331.2 veh·h -1 ·lane -1 . us, when y � 331.2 veh·h -1 ·lane -1 in (6), D op min and D op max can be calculated as 22.7 and 59.5 veh·km -1 ·lane -1 .

Results of Five Scenarios.
Results of average intersection queue length and average intersection waiting time for five scenarios are compared, and all the values are aggressive in 5 min in the MFD region. Box plots for average intersection queue length and average intersection waiting time controlled by the proposed MARL, pretimed strategy, actuated strategy, and perimeter control strategy are compared in Figure 4. Note that 2-hour data were collected thirty times in each scenario. us, the sample size for each box plot is 3600.
We can see, in both Figures 4(a) and 4(b), that the average intersection queue length and intersection waiting time controlled by the proposed MARL are lower than those controlled by other strategies. During all scenarios, the average intersection queue length controlled by the proposed MARL is 6.9 veh. It is 22.5% lower than that controlled by the pretimed strategy (i.e., 8.9 veh) and 11.5% lower than that controlled by actuated or perimeter control strategies (i.e., 7.8 veh). In terms of the average intersection waiting time, the proposed MARL also outperforms the other three strategies.
e average value during the five scenarios is 30.9s, which is 18.9% lower than that controlled by the pretimed strategy (i.e., 38.1s). It is also 11.2% and 12.0% lower than that controlled by actuated strategy and perimeter control strategy (i.e., 34.8s and 35.1s), respectively.
When comparing the interquartile ranges, it is obvious that both the average intersection queue length and intersection waiting time controlled by the proposed MARL (i.e., 10.7 veh and 33.9s) are smaller than those controlled by the pretimed strategy (i.e., 13.6 veh and 37.3s). It shows that during all five scenarios, the average intersection queue length and intersection waiting time controlled by the proposed MARL are less dispersed than those controlled by the pretimed strategy. e interquartile range of the average intersection queue length controlled by the proposed MARL is similar to that controlled by the actuated strategy and perimeter control strategy (i.e., 10.8 veh and 10.7 veh). Moreover, the interquartile range of the average intersection waiting time controlled by the proposed MARL is slightly larger than that controlled by the actuated strategy and perimeter control strategy (i.e., 34.2s and 33.9 veh). e difference in performance between the proposed MARL and other strategies was analyzed using the Friedman test.
e Friedman test is a nonparametric statistical test approving the null hypothesis that the multiple group measure has the same variance to a certain level of significance [44]. e significance level is set as 0.05. e p value of all six cases is all 0, which is smaller than 0.05. It means that the average intersection queue length and average intersection waiting time controlled by the proposed MARL are statistically different from that controlled by other strategies.
To learn more about the five scenarios, box plots for traffic parameters controlled by different strategies under different demand levels are also compared in Figure 5. e low demand level represents the situation where the weighted average flow within the testbed network is lower than 50% of the maximum weighted average flow. In the five scenarios, the low demand level includes the 2-hour simulation of scenario 1, the first-hour simulation of scenario 2 and scenario 3, and the second-hour simulation of scenario 4.
us, the sample size of low demand level is 12 * 5(hours) * 30(times) � 1800. e medium demand level is the situation where the weighted average flow in each 5 min interval is among the 50% and 100% of the maximum  weighted average flow. Only the second-hour simulation of scenario 2 belongs to that demand level, and the sample size is 12 * 1(hour) * 30(times) � 360. e high demand level means that the weighted average flow of this MFD region in each 5 min interval is higher than N op max . e sample size of that situation is 12 * 4(hours) * 30(times) � 1440. Note that, in Figure 5, the sequence of box plots in each demand level is the proposed MARL, pretimed strategy, actuated strategy, and perimeter control strategy. e Friedman test is also conducted for different demand levels and the results are shown in Table 2. e significance level in Table 2 is the same as 0.05. p values of all 18 cases are lower than 0.05, which indicates that the differences in performances between the proposed MARL and other strategies are statistically significant regardless of the demand level.
As shown in Figure 5, during different demand levels, the proposed MARL outperforms the other three strategies in terms of the average values of both intersection queue length and waiting time. It can be found that during the low, medium, and high demand levels, the performances of the actuated strategy and perimeter control strategy are similar.
In the low demand level, outliers can be found in both Figures 5(a) and (b). ese outliers are data in the secondhour simulation in scenario 4. e number of outliers of box plots for both average intersection queue length and waiting time controlled by the proposed MARL is 17 and 14, respectively. Although the number of outliers is not the smallest among the four strategies, the average value of those outliers is the smallest. It indicates that during the process of congestion dissipation, the proposed MARL dissipates the congestion faster than other control strategies.
n terms of the interquartile ranges, both the average intersection queue length and intersection waiting time controlled by proposed MARL of different demand levels are smaller than those controlled by the other three strategies, especially during the medium demand level. One potential reason is that the proposed MARL model adapts best to that demand level. Another potential reason lies in the small sample size.
Previous studies have pointed out that the performance of the perimeter strategy is better than actuated strategy [43], no-control strategy [45][46][47], or coordination strategies like   Journal of Advanced Transportation MAXBAND [48]. However, the box plots for either all five scenarios or different demand levels did not support that. A potential reason lies in the small network size. Note that this may also influence the performance comparisons between the proposed MARL and the perimeter strategy. Another potential reason is that previous studies focus on perimeter strategies among several MFD regions [45][46][47], while this study focuses on one MFD region.

MFD Controlled by the Proposed MARL.
Use data for comparisons controlled by the proposed MARL to check the MFD and fitting curve within the testbed network, and results are shown in Figure 6. e expression of the MFD fitting curve is as equation (7). e correlation coefficient is 0.98.
It seems that the MFD in the testbed network controlled by the proposed MARL is different from that controlled by the pretimed strategy, especially the MFD scatter plot.
It is in accordance with the idea that the MFD curve can be affected by different traffic control strategies [49][50][51]. However, how the proposed MARL affects the shape of MFD needs further study since the data samples of the high, medium, and low demand levels differ a lot. Moreover, the proposed MARL was trained by the data of low traffic demand, which makes it in the process of trial and error for controlling medium and high traffic demand.

Concluding Comments
is study proposes a multiagent reinforcement learning (MARL) traffic control strategy for a macroscopic fundamental diagram (MFD) region. In the proposed MARL strategy, a reinforcement learning (RL) agent adjusts the traffic signal for each intersection in the MFD region. e reward is defined as the unweighted sum of rewards for the level of services (LOS) of the intersection and for the weighted average density within the MFD region. e proposed MARL strategy was trained and evaluated in a simulated 3×3 grid urban network in AnyLogic. 24-hour simulation with low traffic demand was used to train the proposed MARL strategy, and another 24-hour simulation with random traffic demand was controlled by the pretimed traffic control strategy to get the MFD curve. en, five scenarios with different traffic demand levels were applied with four different traffic control strategies: the proposed MARL strategy, pretimed, actuated, and perimeter control strategies. In each scenario, thirty 2-hour simulations were conducted to exclude the randomness. e evaluation results showed that the proposed MARL strategy adapts well to the variable traffic demand levels. At different demand levels, the proposed MARL strategy outperforms the other three strategies to different extents. When the weighted average flow in each 5 min interval is less than 50% of the MFD's maximum weighted average flow, all these four control strategies show satisfactory results. When the weighted average flow in each 5 min interval is over 50% of the MFD's maximum weighted average flow, the proposed MARL strategy shows a greater advantage than the other three strategies. In a congestion dissipation scenario, the proposed MARL seems to dissipate the congestion faster than the other three control strategies. e Friedman test is used to analyze the differences in performances between the proposed MARL and other strategies. e results indicate that the differences are statistically significant regardless of the demand level. Moreover, the MFD curve has been changed when applying the proposed MARL to the testbed network.
is study was the first step in considering the traffic flow theory of the network when applying RL to traffic control strategies. At the same time, it provides insights on the performances of RL-based traffic control strategy and perimeter control strategy. Future work can include the following: (i) comparing RL-based traffic control strategy with other MFD-based traffic control strategies; (ii) adding LOS of nonmotor vehicles or/and pedestrians to the reward; (iii) applying the proposed MARL to a larger real-world network; (iv) using deep reinforcement learning method to improve the efficiency; (v) considering how the proposed MARL affects the MFD curve.
Data Availability e data are part of the first author (Lingyu Zheng)'s doctoral dissertation and is available on request through e-mail, lyzheng@shmtu.edu.cn.