Autonomous Bus Fleet Control Using Multiagent Reinforcement Learning

. Autonomous buses are becoming increasingly popular and have been widely developed in many countries. However, autonomous buses must learn to navigate the city eﬃciently to be integrated into public transport systems. Eﬃcient operation of these buses can be achieved by intelligent agents through reinforcement learning. In this study, we investigate the autonomous bus ﬂeet control problem, which appears noisy to the agents owing to random arrivals and incomplete observation of the environment. We propose a multi-agent reinforcement learning method combined with an advanced policy gradient algorithm for this large-scale dynamic optimization problem. An agent-based simulation platform was developed to model the dynamic system of a ﬁxed stop/ station loop route, autonomous bus ﬂeet, and passengers. This platform was also applied to assess the performance of the proposed algorithm. The experimental results indicate that the developed algorithm outperforms other reinforcement learning methods in the multi-agent domain. The simulation results also reveal the eﬀectiveness of our proposed algorithm in outperforming the existing scheduled bus system in terms of the bus ﬂeet size and passenger wait times for bus routes with comparatively lesser number of passengers.


Introduction
Autonomous vehicles (AVs) are bringing about a radical transformation in the public transportation sector. e fleet control problem associated with AVs has led to new challenges and topics of research. Fagnant and Kockelman [1] defined the current opportunities, barriers, and policy recommendations for AVs. Winter et al. [2] determined the optimal fleet size for a shuttle service of AVs considering the minimal total cost as an objective. Boesch et al. [3] explored the relationship between the served demand and the required AV fleet size. Yap et al. [4] studied the preferences of travelers in using AVs as a last-mile feeder system. Zhang et al. [5] analyzed the generalized cost for autonomous buses, which is one of the key elements for analyzing passenger preferences in using autonomous buses. Chen and Kockelman [6] explored the impact of pricing strategies on the market share of shared autonomous electric vehicles.
Scheltes and de Almeida Correia [7] compared the utility of AVs with that of other transportation modes as the lastmile connection. Montes et al. [8] developed an experimental platform for autonomous buses. Zhu and Kornhauser [9] developed strategies for the fleet management of a large-scale autonomous taxi system with the objective of a minimum fleet size. Hyland and Mahmassani [10] presented a taxonomy for AV fleets and developed a model for AV fleet management problems. Optimal assignment strategies for fully AVs were then explored with the objectives of the minimum fleet miles and minimum traveler wait time [11]. Shen et al. [12] simulated an integrated AV and public transportation system and proposed repurposing low-demand bus routes using shared AVs as an alternative. Salonen and Haavisto [13] presented passenger experiences, perceptions, and feelings when using a driverless shuttle bus in Finland. Abe [14] provided an overview of the impacts of autonomous buses and taxis by quantifying the costs of travel in Japan.
While the above studies and assessments cover diverse research topics, most of them treat the AV fleet control problem as a centralized architecture. Moreover, none of these previous studies have considered that AVs are naturally decentralized and self-governing intelligent agents, which learn to make decisions in an uncertain environment without external intervention.
Fleet control problems are widely considered using module-based dynamic programming to optimize a selected objective function with general heuristics by which agents abide.
ese mathematical programming models require fixed rules and assumptions to facilitate model convergence on a solution. is is a difficult task in a dynamic environment, as a solution may become outdated when the demand distribution changes. Reinforcement learning is a module-free method in which agents learn the action policy by interacting with a complex environment; this method has been shown to be effective in stochastic and high-dimensional situations. e number of autonomous shuttle trials has increased rapidly over the last few years, and these trials together with research on autonomous bus operation schemes and mobility services [15][16][17] have begun to attract the interest of various cities, universities, and private companies aiming at improving the safety in urban areas, reducing the cost for last-mile transportation, decreasing congestion, and improving the network connectivity for the user [18].
us, this study aims to develop an optimal fleet control scheme for autonomous buses using the reinforcement learning method. e challenge in our work is to teach these agents to make optimal decisions for satisfying passenger trip demand using highly dynamic and stochastic unstructured data and incomplete observations of the environment. In this study, we adopt the multi-agent reinforcement learning (MARL) model and a temporaldifference (TD) learning algorithm to provide an effective method for solving optimal fleet control problems. e major contributions of this study are as follows: (1) An MARL algorithm was developed to solve the autonomous bus fleet control problem, proving to be a promising approach in such systems (2) An agent-based simulation platform for autonomous bus fleet control was developed for training and evaluating the proposed algorithm (3) Experimental results revealed that the proposed algorithm outperformed other MARL methods and could decrease the fleet size and wait times for lowfrequency routes in comparison with existing scheduled bus systems e remainder of this paper is organized as follows. In Section 2, related studies on reinforcement learning and multi-agent issues are reviewed. In Section 3, a new MARL algorithm is formulated and applied to an autonomous bus fleet control problem. In Section 4, the major components and a schematic for autonomous bus fleet control simulator are depicted. In Section 5, an approach to the problem based on the MADDPG algorithm is described, and the test scheme is described in detail, including the operation environment and simulation process for evaluating the performance of the proposed algorithm. In Section 6, the performance of the proposed algorithm is compared with that of the deep Q-network algorithm and the existing scheduled bus system. Finally, concluding remarks are presented in Section 7.

Related Works
2.1. Reinforcement Learning. Reinforcement learning (RL) has garnered increasing attention recently. It has a natural application to the case of autonomous agents, which receive sensations as inputs and take actions that affect their environments to achieve their own goals [19]. One of the most important breakthroughs in RL was the development of Q-learning (Watkins, 1989). Q-learning produces a Q-table, which is used by an agent to determine the best action to take based on a given state. However, Q-tables may be ineffective in large state-space environments. e Google DeepMind team created a neural network as an alternative to the Q-table, resulting in the deep Q-learning (DQN) method [20], which has proven to be successful in learning humanlevel performance for Atari games. To use Q-learning for control problem, first, the values of state-action pairs must be learned, and then these action values should be used directly to implement the policy and select greedy actions. All methods of this form can be referred to as value-based methods. A policy gradient (PG) is another reinforcement approach [21].
is method approximates a stochastic policy. e policy is represented by a neural network whose input is a representation of the state and whose output is the action selection probabilities. e weights in this neural network are represented by the policy parameters, thus producing a policy-based method. e merging of these two algorithmic families (Q-learning and PG) results in the actor-critic (AC) method, which has a separate memory structure to represent the policy explicitly independent of the value function. e policy structure is known as the actor because it is used to select actions, whereas the estimated value function is referred to as the critic because it criticizes the actions made by the actor [19].
Silver et al. [22] introduced an off-policy AC algorithm that learns a deterministic target policy from an exploratory behavior policy with continuous actions, referred to as the deterministic policy gradient (DPG) algorithm. e deep deterministic policy gradient (DDPG) algorithm is a variant of the DPG in which the policy, μ, and critic, Q μ , are approximated with deep neural networks [23]. Schulman et al. [24] proposed and analyzed trust region methods for optimizing stochastic control policies, referred to as trust region policy optimization (TRPO). In TRPO, an objective function (the "surrogate" objective) is maximized subject to a constraint on the size of the policy update. e problem can be efficiently solved approximately using the conjugate gradient algorithm after making a linear approximation of the objective function and a quadratic approximation of the constraint.
is algorithm is effective in optimizing large nonlinear policies, such as neural networks. However, TRPO is relatively complicated and is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks).
Schulman et al. [25] introduced an algorithm that can realize the data efficiency and reliable performance of TRPO using only first-order optimization with clipped probability ratios.
is new method, proximal policy optimization (PPO), demonstrates superior performance in comparison with that of other online PG methods and has become the default RL algorithm at OpenAI. An overview of the RL methods described above is illustrated in Figure 1.

Multiagent Reinforcement Learning.
In MARL methods, autonomous agents learn to solve dynamic tasks online using algorithms that originate in TD RL [26]. Challenges in applying MARL arise from the curse of dimensionality in the number of agents, which incurs high computational costs. However, MARL approaches are becoming increasingly popular. For example, Foerster et al. [27] demonstrated multiple agents sensing and acting in complex environments having partial observability with the goal of maximizing their shared utility. Leibo et al. [28] analyzed the dynamics of policies learned by multiple self-interested independent learning agents using the DQN method, and the results indicated the effect of the sequential nature of real-world social dilemmas on cooperation. MARL also has been applied on transportation system as well such as Sukhbaatar and Fergus [29] that introduced CommNet, which is a backpropagation method for MARL that can learn continuous communication between a dynamically changing set of agents; this method was applied to a four-way traffic junction. Nguyen et al. [30] explored an AC-RL method with a particular decomposition of the approximate action-value function applied to a real-world taxi fleet optimization problem. Lin et al. [31] proposed a contextual MARL framework (DQN) to achieve explicit coordination among a large number of agents for online ride-sharing platforms. However, the above assessments apply traditional RL methods, such as DQN or AC, which are poorly suited to multi-agent environments because the best policy of each agent is affected by changes in the policies of other agents, resulting in a nonstationary environment.
e OpenAI team proposed a counterfactual PG method by expanding DDPG for multi-agent domains in the MADDPG method, which uses the framework of centralized training with a decentralized execution. Using this method, agent populations can discover complex physical and communicative coordination strategies that can coordinate agents in mixed cooperative-competitive environments [32].
e MADDPG method introduces a training regimen utilizing an ensemble of policies for each agent, resulting in more robust multi-agent policies. MADDPG has been used in many applications such as Wang et al. [33] that proposed a data-driven multiagent power grid control scheme using MADDPG for the large-scale energy system with more control options and operating conditions. Zhu et al. [34] applied MADDPG to solve the flocking control problem of multi-robot systems in complex environments with dynamic obstacles. Lei et al. [35] introduced edge computing between terminals and the cloud using MADDPG to address the drawbacks of the traditional power cloud paradigm. In practice, MADDPG operates efficiently on a variety of cooperative and competitive multi-agent environments. However, MADDPG is deterministic policy gradient, intended for continuous action only. erefore, we replace the DPG with a state-of-the-art stochastic policy gradient method, PPO, to satisfy the discrete action of introducing autonomous bus operation.

Methodology
To formulate the autonomous bus fleet control problem (using the notation), we consider a standard RL setting in which an agent (i.e., an autonomous bus) interacts with an environment over a number of discrete time steps. At each time step, t, the agent observes a state, s t , and selects an action, a t . e goal of the agent is to maximize the expected reward, r, from each state, s t , for policy π. is is also called a value-based RL method. One example of such an algorithm is Q-learning, which can be defined as follows: (1) When the action-value function is represented using a neural network with parameters θ, the DQN algorithm is obtained. DQN learns the action-value function, Q * , corresponding to the optimal policy by minimizing the loss as follows: (2) DQN sets a target function, Q, whose parameters are periodically updated with the most recent θ. e learning process can be stabilized by including the use of an experience replay buffer, D, containing tuples (s, a, s ′ , a ′ ).
In contrast to value-based methods, policy-based methods directly adjust parameter θ of policy π to maximize the objective, J(θ), by taking steps in a direction through gradient ascent, ∇ θ J(θ). is results in the PG algorithm, which can be defined as follows: If the PG works as an actor learning an approximation of the true action-value function, Q π (s, a), through TD or Monte Carlo predictions, this Q π (s, a) is called the critic and leads to a variety of AC algorithms.
It is possible to extend the PG framework to deterministic policies, μ θ : S↦A with continuous actions, resulting in the DPG algorithm. Under certain conditions, we can formulate the objective function as follows: When the policy, μ θ , and critic, Q μ , are approximated with deep neural networks, the DDPG algorithm is obtained. e above algorithm families generally act on a single agent. For an environment with N agents and the set of deterministic policies μ � μθ 1 , . . . , μ θN , we can formulate the gradient of the expected return for agent i as where s � (o 1 , . . . , o N ) and consists of the observations of all agents, i.e., the MADDPG algorithm. e framework of the MADDPG method is represented by centralized training with decentralized execution, enabling the policies to use additional information to ease the training. e centralized action-value function, Q μθ i , can be updated as In order to satisfy the discrete action of introducing autonomous bus operation, we replace the DPG with a stateof-the-art stochastic SPG method, PPO. PPO is based on trust region methods (TRPO), and the "surrogate" objective is maximized subject to a constraint on the size of the policy update. e theory governing TRPO is justified by using a penalty instead of a constraint, as follows: Because it is difficult to select a single value of β, an additional modification is proposed as follows: where r t (θ) denotes the probability ratio,r t (θ) � πθ(a t |s t )/πθ old (a t |s t ); A t is an estimator of the advantage function [36] at time step t; and ε is a clip parameter that moves r t (θ) away from one to avoid excessively large policy updates. e modified MADDPG algorithm for autonomous bus fleet control is presented in Algorithm 1.

Simulator for Autonomous Bus Fleet Control
Unlike mathematical programming problems in which the data are stationary relative to the algorithm and can be evaluated by paradigms, RL applies naturally to the case of autonomous agents, which introduce complex difficulties in training and evaluation. One solution in traffic studies is to create a simulator that is representative of the real world to define a specific problem domain. In this study, we generated large amounts of simulated experience to accelerate the learning process beyond what would be possible using actual experience. e simulator was coded in Python 3.6.1 within an IDE of PyCharm. For the autonomous bus fleet control system, the simulated events were considered as discrete variables (e.g., agents make decisions); therefore, we used SimPy under the MIT License as the simulation framework.
SimPy is a discrete-event simulation library. e behavior of the active components (such as vehicles, passengers, or messages) was modeled using processes. All processes live in an environment, which follows the structure of an OpenAI Gym, enabling the application of RL algorithms. For scientific computing, NumPy, a fundamental package, was used, and Matplotlib was applied to plot the training and validation results.
In this study, we built a simulator as an environment for autonomous bus fleet control where the autonomous bus is the only "intelligent" agent. e major components and a schematic of the environment are depicted in Figure 2 and discussed as follows:

Transformation of the Problem
By interacting with the environment, for each agent i ∈ V at each time step t, the agent i receives the environment's state s t ∈ STATU, where STATU is the set of possible states, and selects an action a t ∈ ACTS, where ACTS is the set of actions available in state s t ; the goal of the agent is to maximize the expected reward r, for policy π. One time step later, in part as a consequence of its action, agent i finds itself in a new state s t+1 . In the following, we provide the state space, action space, and rewards that are required by the algorithm. ( Note that wt k is the wait time of passenger k, and c is the reward discount based on Bellman Error.

Simulation Process.
e analyst first defines a scenario, including the fleet size (L), number of stops/stations (n), spatial-temporal demand rate of passenger requests (λ), length of an episode (τ, initial � 1h), and maximum number of episodes (L, initial � 300). Based on the scenario data, the simulation creates a dictionary for the ACTS, and state STATU. e STATU and ACTS are references for building the neural network. e simulation is process-driven; at each process step (Δs), the simulation updates the position and state of the agents and passengers into STATU, and the agent takes action accordingly. A critic value reward of the action is assigned in lieu of a reward function, followed by an actor update action policy for the agent. e simulation then checks whether the Timer of the simulation hour is > 1; then τ � τ + 1, Timer � 0; this process is repeated until τ � L.
A random generator (class PASSENGER) is employed to generate the passenger origin stop/station (o k ), destination stop/station (d k ), and created time (ct k ) for each passenger k ∈ P. e class BUSES defines the operating behavior for each agent i ∈ V and the legal actions a ∈ ACTS by which agent i abides. e agent i makes decisions to select actions with the aim of minimizing passenger wait times. A decision epoch occurs upon agent arrival or completion of loading. Agent arrival is triggered when an agent reaches a stop/station or on completion of the previous idling event.
e loading completion event is triggered when the loading process is complete and the agent is ready to take the next action. Agent i cannot pass a stop/station if a passenger wants to disembark at that location and cannot perform a U-turn until all the passenger requirements in the present direction have been serviced. If agent i is moving, it must decide whether it will stop at the next stop/station. If agent i is idling, it must select an intended movement direction. If agent i is approaching the terminal station, it must stop at the next stop/station. Upon agent i making a decision, the simulation updates the reward for agent i. When agent i is called by passenger k, the system returns "False" if agent i is full; otherwise passenger k enters the list of passengers carried by agent i, and the simulation records the pick-up time (pt k ). When passenger k reaches the destination, they are removed from the list of passengers carried by agent i. e wait time of passenger k is calculated as (wt k � pt k − ct k ). Here, ct k is the time at which passenger k appears at the stop/station, which is given by the random generator (class PASSENGER), and pt k is the time at which passenger k is picked up. Upon completion of a simulation episode τ, the simulator accounts for the total passenger waiting time PW(τ) and the number of passengers served PS(τ). e average wait time AW(τ)of episode τ is then calculated as follows: Figure 3 depicts flowchart of the simulation process. e process marked in pink represents scenario data that the analysis plans to test, whereas the processes marked in blue represent the interfaces with the MARL algorithm.
To train each agent's actor, we used a three-layer Tensor-Flow neural network. e units of the input layer were the same as the observation space: 100 units in the first hidden layer, 50 units in the second hidden layer, and 10 units of action space in the output layer. For each agent's critic, we used a two-layer neural network with the same units of the observation space as those in the input layer, 50 units in the first hidden layer, and one unit in the output layer. e network was trained using the Adam optimizer for faster training. We set each episode equal to one simulation hour, with a maximum of 300 episodes for implementing the training. e actor network's learning rate (a) was set as 1e-4, while the critic's (ß) was set as 1e−3. is allowed the critic to learn slightly faster than the actor as the learning of the actor network relies on the critic network.

Experiments
Experiments were conducted using the simulator introduced in Section 4. Considering the importance of spatial and temporal resource dynamics, the passenger arrival times and their destinations were set as random seeds. To test the generalization performance of our proposed algorithms, two comparisons were performed: (1) in the RL method comparison, we compared our proposed policy-based algorithm with the valued-based Q-learning algorithm; (2) in the fleet control method comparison, we compared the proposed algorithm with the existing scheduled bus system. Furthermore, the effect of the dispatched fleet size on passenger wait times was also compared for the proposed algorithm and scheduled bus system. e simulations were run on a standard 64-bit desktop computer with 8 GB of RAM and a 2.40 GHz processor. A single simulation experiment (1 h) for scheduled bus replication takes 2-10 min for completion. Owing to the Ten-sorFlow neural network technology and PPO algorithm, MARL only takes 0.5-4 s to complete a training episode. Experiments with larger fleet sizes, stops, and passenger volumes take longer to run.

Algorithm Comparison
Results. An evaluation of the performance of proposed algorithm was conducted to compare with DQN for single-agent and compare to MADQN for multiagent environments. First, a single agent was used to test the effectiveness of the algorithms and finetune the parameters before further building on the simulation. e single-agent environment considered consisted of five stops/stations with a 15-minute headway. Two travel patterns with passenger request rates of 90 and 180 passenger requests per hour were tested. e performance was measured based on the average passenger wait time gained by the platform over 300 episodes. e results indicate that both the MADDPG and DQN algorithms can learn the correct behavior in a single-agent environment. Moreover, MADDPG performs better than DQN, as shown in Figure 4. e multi-agent environment consisted of 5 buses and 15 stops/stations with a 5-minute headway. Two travel demand patterns of 810 and 1620 passenger requests per hour were tested. e results indicate that the MADDPG algorithm functions effectively; however, the MADQN fails to learn the correct behavior, as shown in Figure 5.

Comparison of Fleet Control Methods.
In this study, we compared the proposed algorithm with a scheduled bus system in different scenarios, namely, a university campus, an industrial zone, and a downtown street. To closely approximate real-world conditions, each scenario was designed based on the bus planning guidelines [37] using the procedure shown in Figure 6. e passenger service volume figures were adjusted to reflect that the autonomous bus has fewer seats (12) than those of a standard bus (43). In addition, we used the reasonable peak-hour factor of 0.75 suggested by the guidelines instead of the value of 1.0 in the original settings.
In a scheduled conventional bus system, the required fleet size is defined as follows:  Journal of Advanced Transportation 7 where R is the round-trip travel time. Vehicles stop at all of the n stops/stations such that where t o is the average travel time between stops/stations. With these formulations, we can obtain the proper route distance and number of stops/stations. e parameter values and passenger service volume per hour for the three scenarios are summarized in Table 1.
Two metrics were evaluated in this study: the average passenger wait time and vehicle-kilometers per hour.
Average waiting time of passenger: this metric represents the customer service quality. e simulations calculate the results for the MADDPG algorithm and the scheduled bus model using equation (9).
where D is the average distance between stops/stations. e performance of the MADDPG was evaluated based on the learning process over 300 episodes. e results of the scheduled bus model were obtained from a 1 h simulation replicated for an average of 10 times. e comparison between the performance of the MADDPG and that of the scheduled bus methods for three scenarios is presented as follows.

Scenario 1.
is scenario was set as a university campus with a short route (3.6 km) and low passenger demand (free flow). e low frequency was assigned as a 15-minute headway with 5 stops/stations and was operated by one bus, similar to the scale of recent autonomous bus pilot projects [18] e MADDPG method was significantly more efficient than the scheduled bus method in terms of both the passenger wait time and VKH in this scenario. is is because MADDPG has a more flexible dispatching strategy; for example, MADDPG allows a bus to idle at a stop/station when no passenger demand exists, or assigns a bus a U-turn to pick up the nearest passengers. A comparison between the results of the two methods is depicted in Figure 7.
is scenario was set as an industrial zone with a medium-range route (4.5 km) and general passenger demand (stable flow; unconstrained). e headway was 10 min with 10 stops/stations and was operated by two buses. e MADDPG method outperformed the scheduled bus method in terms of the passenger wait time but performed worse than the scheduled bus method in terms of the VKH in this scenario. is is because the objective of MADDPG is to minimize the average passenger wait time, which tends to waste more vehicle kilometers in picking up passengers as early as possible. A comparison between the results of the two methods is depicted in Figure 8.
is scenario was set as a downtown street with a longer route (5.6 km) and higher passenger demand (stable  Figure 6: Procedure for design scenarios. flow; interference). e high frequency was assigned as a 5minute headway with 15 stops/stations, and was operated by five buses. e performance of the MADDPG method was similar to that of the scheduled bus method in terms of the passenger wait time but was worse than that of the scheduled bus method in terms of the VKH in this scenario. is is because the MADDPG free-route advantage for reducing passenger waiting time is constrained by the high-frequency operation. A comparison between the results of the two methods is depicted in Figure 9. e results indicate that the performance of the MADDPG algorithm is strongly related to the service frequency. e MADDPG method outperformed the scheduled bus system in terms of passenger wait times and VKH when the service frequency was comparatively low, as in Scenario 1. However, for higher service frequencies, such as in   Scenarios 2 and 3, the VKH generated by the MADDPG method increased significantly to minimize the passenger wait time.

Effect of the Dispatched Fleet Size.
In comparison with the scheduled bus headway, which cannot be adjusted exactly according to the demand, the demand responsiveness of the MADDPG method allows agents to operate when there exists a requested service demand. A comparative analysis of the effect of the dispatched fleet size on the level of service was conducted by comparing the mean bus headway between the MADDPG and scheduled bus methods. In an online station system, the average passenger wait time was half the headway. We used a target service level of a 5 min average passenger wait time as the main policy evaluation criterion for a 10 stop/station loop route with a route distance of 4.5 km. Upon an average passenger wait time of over 5 min, an additional bus joined the service. Various travel request rates were tested, as shown in Figure 10.
As expected, increasing the fleet size resulted in a decrease in passenger wait times. e results indicated that the MADDPG method was significantly better than the scheduled bus system in using a smaller fleet size to serve a greater travel demand. For example, in the MADDPG method, two buses could serve up to 405 passengers per hour; however, the scheduled bus system could serve only 180 passengers per hour. e scheduled bus system required five buses to serve the passenger volume of 2430 passengers during the peak hour, whereas the MADDPG method required only four buses for the same passenger volume.

Summary.
Fully-autonomous buses promise to increase the competitiveness of public mobility services via eliminating the costs and performance limitations of human drivers and significantly reduce traffic accidents, hence allowing to potentially revolutionize existing public transportation systems. Multi-agent control systems are important owing to the nature of spatial-temporal dynamic environments and in environments where centralized information is unavailable, requiring agents to collaborate with other agents as they may not contain all the data or resources required to achieve the objective.
e study findings demonstrated that the proposed algorithm could successfully solve the multi-agent problem of autonomous bus fleet control.