Intelligent buses in a loop service: Emergence of no-boarding and holding strategies

We study how $N$ intelligent buses serving a loop of $M$ bus stops learn a \emph{no-boarding strategy} and a \emph{holding strategy} by reinforcement learning. The high level no-boarding and holding strategies emerge from the low level actions of \emph{stay} or \emph{leave} when a bus is at a bus stop and everyone who wishes to alight has done so. A reward that encourages the buses to strive towards a staggered phase difference amongst them whilst picking up people allows the reinforcement learning process to converge to an optimal Q-table within a reasonable amount of simulation time. It is remarkable that this emergent behaviour of intelligent buses turns out to minimise the average waiting time of commuters, in various setups where buses have identical natural frequency, or different natural frequencies during busy as well as lull periods. Cooperative actions are also observed, e.g. the buses learn to \emph{unbunch}.


Introduction
In the urban environment, a bus network system is a complex system. It is not possible to study a part of the system to understand its whole. In order to characterise the dynamics within the system, a holistic approach has to be taken. Buses move through space and pick up commuters who arrive at bus stops at random moments. is leads to a spatiotemporal complexity within the bus system which has been shown to behave chaotically [1,2].
Bus systems are notorious for being highly susceptible towards buses ending up bunching, leading to unnecessarily long waiting times for commuters. Buses may end up bunching due to uncertainty in travelling time, as well as the fact that a preceding bus would have a proclivity of picking up most of the passengers at a bus stop whilst the succeeding bus picks up relatively fewer passengers when it then arrives.

Literature Review.
Many studies in the literature have focused on the holding strategy [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]: if a bus is too fast, it would exercise an extended stoppage duration to correct for the headway from the bus in front of it-otherwise, it would bunch with it. Holding back buses, however, may tend to slow down the system and would require that some slack in the schedule has been allocated beforehand. Some of these investigated how bus systems can improve their scheduling using some form of reinforcement learning [8, 9, 16-18, 20, 21]. For example, Chen et al. [17] implemented a multiagent reinforcement learning framework solely to optimise holding durations for each bus, where they run around a loop corridor of bus stops. is simple setup let all buses move at the same speed in one loop, and overtaking is not allowed. Furthermore, they implemented their reinforcement learning scheme on a relatively small discrete set of states and actions (about 900 states in total, with each bus having 4 actions to choose from corresponding to the duration of holding), which is not readily or directly scalable to larger and more complicated bus route networks without a change in architecture. Nevertheless, Menda et al. [21] introduced a deep reinforcement learning framework, whereby a deep neural network serves as a function approximator to extend the state space from discrete to continuous and illustrated that it works reasonably well in the holding strategy carried out in [17]. Even a recent work by Alesiani and Gkiotsalitis [20] that applied deep reinforcement learning methods to model a bus loop service around Chua Chu Kang-Yew Tee in Singapore only focused on "how long to hold a bus at a bus stop," with the aim of maintaining the headway close to some prescribed headway, but balanced by not excessively adding to total travel time due to holding. Similar to the work in [17,21], Alesiani and Gkiotsalitis [20] also assumed that all buses are identical and do not overtake, which we have found to be not generally true in real human-driven buses [39].
In contrast to the majority of studies focusing on holding back buses, relatively fewer studies explored the opposite, viz. a no-boarding strategy [30][31][32][33][34][35]: a slow bus would always allow passengers to alight at a bus stop, but would disallow boarding and leave the bus stop if it is too slow in order to speed it up. Preliminary work by Delgado et al. [30] argued that this is a viable strategy to improve the bus system, mainly by carrying out extensive simulations. e bus system decides on when to implement no-boarding by numerically optimising an objective function subjected to a number of constraints. All buses are assumed to have an identical speed, no overtaking is allowed, and they introduced a "dummy bus" as a form of boundary condition such that all remaining passengers must eventually be picked up. Further simulation work in [31] extended the work of Delgado et al. [30] to more complicated setups with more constraints and dropping the use of the "dummy bus," though as before they considered buses to move in a loop. Similar improvements due to the no-boarding strategy are also observed. In contrast to references [30,31], Zhao et al. [32] constructed a theoretical analysis that describes how this strategy is beneficial, but appeared not to consider alighting (passengers only board the bus). More comprehensive theoretical calculations are given by Saw and Chew [34], where commuters first alight at a bus stop before new passengers can board. ey also considered simulations on buses with different speeds and can overtake. Sun and Schmöcker [33] considered the scenarios where buses can overtake and cannot overtake, where they allow only a certain percentage of passengers to board the front bus of a platoon of bunched buses.
Apart from references [30][31][32][33][34][35], we are not aware of any other papers on a no-boarding strategy, and none of them involved a fully reinforcement learning approach. e relatively fewer papers on no-boarding compared to holding may be due to the perception that a bus which is not full but not allowing passengers to board is not going to go well with the public. To account for commuters who have urgent needs to board the bus, Saw and Chew [35] considered a noboarding strategy that allows the commuters to cooperate or defect the no-boarding instruction. Of course, there are pros and cons to cooperating or defecting. A defector may occasionally get away if the number of defectors is small, but may be penalised with a hefty fine otherwise. It turns out that such an arrangement is viable as the overall system of busescommuters would settle down in a dynamical equilibrium whereby there would always be some proportion of cooperators and defectors, and the buses are able to implement no-boarding to prevent bus bunching.
Generally, the no-boarding strategy works well for a bus system with buses moving with the same average speed. is strategy maintains the buses' headways close to being ideally staggered, i.e., buses are reasonably spread out throughout the route. If the bus system is such that different buses move with different average speeds, the no-boarding strategy is also successful during the busy period since there is enough demand to slow down the "faster bus." Surprisingly, this strategy backfires during the lull period, as the slow bus has been sped up to the maximum by picking up nobody whilst there is insufficient demand to slow down the fast bus enough [31,34,40]. Consequently, the system is effectively operating with one less bus (since the slowest bus is almost always disallowing boarding).

Bus Bunching as a Synchronisation Phenomenon.
Bus systems which serve a loop of bus stops are common, where buses start from a terminal and go around a route of bus stops, before returning to the same terminal. Several previous works on studying how bus systems can be optimised also considered a loop service [17,20,21,[30][31][32]39]. In particular, our work here is based on a real university loop shuttle bus service where buses continually serve a loop of 12 bus stops [39] (see Figure 1).
Work in [39] described buses serving a loop of bus stops as coupled oscillators which interact with one another due to picking up passengers from bus stops. Given a generic loop where buses travel around, we can map such a loop onto a unit circle and preserve distances between neighbouring points. Such a mathematical representation allows us to define angles on the unit circle to represent the locations of the buses. e distance between buses is then represented as the phase difference between them on the unit circle. Furthermore, the notion of the speed of a bus on the route becomes equivalent to angular frequency (or just frequency) 2 Complexity on the unit circle. e notion of "frequency detuning" [41] denotes the fact that different human-driven buses generally move at their own natural speeds (one driver has a tendency to drive faster, and another prefers to drive a bit slower). Consider N buses serving M bus stops in a loop. ese M bus stops are staggered around the loop, each having a people arrival rate of s people per second. Each of the N buses has natural (angular) frequencies ω 1 > ω 2 > · · · > ω N , respectively (excluding any time stopped at bus stops). When a bus arrives at a bus stop, it first allows passengers who wish to alight to do so and then allows passengers to board. e alighting and boarding rate is l people per second. Overall, the quantity k: � s/l is a parameter that describes the level of demand for service. We assume that passengers from a bus stop would like to travel to the bus stop antipodally opposite to it, or the one just before travelling half a loop if M is odd. By an analytical calculation, work described in [39] has shown that there is a critical k c (N): where all N buses would be completely bunched into a single unit (i.e., completely synchronised) if k > k c (N). Also, we have the relations ω i � 2πf i � 2π/T i between angular frequency, frequency, and period. is antipodal destination is by no means an unrealistic simplification. One may argue that in a realistic bus system, passengers from one bus stop have some probability distribution to travel to some other bus stops. On average, when a bus arrives at a bus stop, there would be some proportion of passengers who wish to alight. e antipodal destination effectively also leads to some proportion of the passengers on the bus who wish to alight-the only difference is that they originated from the same bus stop which is located antipodally away, instead of from several different bus stops.
is does not affect the overall dynamics of how long a bus generally stops at a bus stop. We have considered the situation where people want to alight at any bus stop (excluding the bus stop of origin) with uniform probability distribution. Our simulations show that the average waiting time is essentially identical to the setup where people want to head to an antipodal bus stop. e average time spent on the bus, on the other hand, would depend on the probability distribution of where people want to go, since if more people want to head to a further bus stop, then they would spend more time on the bus to travel further.
(Note: Saw et al. [39] assumed that boarding and alighting occur simultaneously via different doors. Here, we assume that these occur sequentially through one door, alighting followed by boarding. us, we should include an overall factor of 1/2 into equation (1) for this paper since the processes being sequential via one door instead of two different doors would double the time a bus dwells at a bus stop. Generally, if multiple doors are available such that people could board simultaneously, we also find via simulations that the dwell time is proportionately reduced by such a factor (i.e., if there are two doors, then the dwell time is halved). However, there are instances where this is not exactly the case, for example, when there is an odd number of people, then a bus still needs to spend an extra time step to allow for that extra person to board via one of the two doors).
On top of that, numerical simulations with parameters based on values measured from a real university shuttle bus loop service showed that the buses are not persistently  bunched (no phase-locking, completely unsynchronised) if k < k, where k : � k c (2) is the corresponding system with N � 2 buses having natural frequencies ω 1 and ω N . e bus system would have some buses persistently bunched (partial synchronisation) if k < k < k c (N). In the case where all buses have an identical natural frequency, all buses would typically end up bunching into a single unit unless k is sufficiently low such that each bus only spends the minimum amount of time stopping at each bus stop. erefore, bus bunching is a perennial phenomenon, and it is of great interest to employ strategies such that the buses are able to maintain a regular headway between them, always remaining staggered.

e Problem: A Loop Shuttle Bus Service Modelled after
Our NTU Campus Loop Shuttle Bus Service. e purpose of this paper is to explore if there are ways beyond what was analytically studied in [34], such that a no-boarding policy may actually be salutary, especially in the lull period for a bus system with frequency detuning (where no-boarding backfires). Investigating a bus system with frequency detuning is crucial because human-driven buses tend to move with different natural frequencies due to differing driving styles as measured in our Nanyang Technological University (NTU) campus loop shuttle bus service [39]. is is in fact the key result of bus bunching being viewed as a synchronisation phenomenon of coupled oscillators with different natural frequencies, as elaborated in the previous section. e opposite of bunching, viz. to remain staggered, is also a synchronisation of the oscillators such that they remain phase-locked with phase difference between adjacent oscillators being ∼ 2π/N, where N is the number of buses (oscillators) [39,40].
Instead of implementing a human-thought-out or human-defined idea, we let buses figure out an ideal strategy via reinforcement learning [42]: buses are given two actions whenever they are at a bus stop, viz. stay or leave, with no prejudice nor human input, other than a feedback on whether the average waiting time of the commuters is minimised. is feedback may be the actual waiting time of the commuters, or the headway between the buses where a stable staggered headway signifies stable minimal average waiting time. We will consider the latter here and the former in a separate paper elsewhere. A so-called normal bus would stay if there is somebody who wants to board and leave if there is nobody there. A no-boarding strategy would correspond to the bus deciding to leave even if there is somebody who wants to board. Apart from the no-boarding strategy, the framework that is developed here would also be well suited to study the holding strategy: if there is nobody at the bus stop, a normal bus would leave, but a holding strategy would correspond to stay. By reinforcement learning (we shall employ Q-learning in this paper), the buses are initialised with random choices to execute. ey would then progressively explore and converge to an optimal strategy, as is implied by the theory of Q-learning based on a Markov decision process [42].
Our consideration here with a single loop of M � 12 bus stops is modelled after a university (NTU) campus shuttle bus service that serves tens of thousands of students, staff, and faculty members [43,44]. is is thus a realistic system which also exists in many bus systems worldwide with loop services, just like those studied in [17,20,21,[30][31][32]. Note that unlike references [17,20,21] which applied reinforcement learning on the holding strategy, our approach does not endow the buses with a conscience of holding (or no-boarding); rather, we will show that the holding (as well as no-boarding) strategy emerges from low-level actions of stay or leave.
us, this paper will show how buses with no knowledge of the no-boarding and holding strategies would turn out to collectively learn them via reinforcement learning, with only two actions to execute whenever they are at a bus stop. is will be achieved by the buses observing local distances of the bus immediately behind them and striving to keep such distances close to the prescribed idealisation whereby the buses would be staggered throughout the loop. In doing so, the bus system collectively turns out to learn cooperative actions such as unbunching themselves, if they end up being in unfavourable bunched configurations. Such an approach whereby no explicit human instructions are given to the system has the potential of uncovering novel solutions in more complicated situations to be further explored in subsequent work.

Reward for Actions of Stay or Leave Taken at Every Time
Step at Bus Stop. With the goal of minimising the average waiting time of commuters for a bus to arrive at a bus stop (i.e., this excludes any time spent on the bus), each time a bus is at a bus stop (and passengers who want to alight have done so), it executes either stay or leave and then receives the waiting time of the person ahead of the queue to board the bus (or who is supposed to board, but denied boarding if the bus leaves). However, this feedback is problematic: the waiting time of each person has high variance. For instance, the luckiest person who arrives at the bus stop when a bus is there has zero waiting time, whilst the unluckiest person who arrives when a bus has just left would have maximum waiting time, with the other people's waiting times distributed between these two extremes. is high variance is difficult for our low-level setup where a bus takes an action at every time step when it is at a bus stop and receives such a fluctuating feedback. Such a setup is low level in the sense that the decision-making process occurs at the level of microscopic time scale. Nevertheless, a different formulation where a bus only takes one single action when it arrives at a bus stop (viz. to decide to hold, implement no-boarding, or behave like a normal bus), instead of deciding whether to stay or leave at every time step, would be able to converge to a solution. e high-level formulation (making decision at a macroscopic time scale with the specified behaviour of holding, no-boarding, and normal bus) will be reported elsewhere in a separate paper. Alternatively, we note the following property for a bus loop system [34]: if the buses are staggered, then the average waiting time of commuters for a bus to arrive at the bus stop is minimised. e loop can be isometrically (i.e., preserving distance) mapped to a unit circle, which has well-defined phase angles from 0°to 360°. us, we consider the feedback for the buses' actions as the phase difference between itself and the bus immediately behind it, Δθ. A recent work from [34] found that backwards-looking angle turns out to be superior to forward-looking angles. Whilst the inclusion of both angles (or even more information like number of passengers waiting at the bus stops) may lead to further improved performance, the additional quantities would augment the state space and slow down the optimisation for the reinforcement learning process. Moreover, we find that the performance solely using the angle from behind turns out to be close to the ideal situation where buses are nearly perfectly staggered. is phase difference experiences a more gradual change, thereby eliminating the high variance of measuring individual commuters' waiting times. e goal would be to keep this phase difference close to the staggered value. In other words, buses would be rewarded for being close to the staggered configuration. For example, if there are two buses, then Δθ � 0°gives 0 reward, whilst Δθ � 180°g ives 1 point, with values in between scaling linearly. If Δθ > 180°, then the reward linearly decreases to 0 at Δθ � 360°. In fact, striving to keep headway staggered is also applicable to nonloop services as it amounts to keeping buses spaced out from the start of the route to its end. Note, however, this alone would lead to the buses striving to achieve the perfectly staggered configuration without any regard to the commuters, possibly even at the expense of not boarding anybody just to keep Δθ � 180°. erefore, we incentivise a reward of 1 point for each passenger who is picked up. A weighting hyperparameter can be selected such that the system is in a balanced region between closeness to staggered configuration and picking up passengers, where a bus aims to both pick up passengers and maintain a configuration that is nearly staggered. Of course, the choice and structure of the reward are arbitrary-as long as it results in the intended minimisation of the average waiting time. Different reward functions can lead to the same eventual result of keeping the angle between buses staggered to minimise the average waiting time. For example, the shape of the reward function that we use in equation (4) is linear, but other shapes can lead to the same result. is will be discussed in more detail in Section 4.
Nevertheless, there are other kinds of rewards that aim to minimise other quantities such as the worst waiting time of a commuter for a bus to arrive at the bus stop or the total travelling time. For example, if the reward function aims to minimise time spent on bus or total travelling time, then the holding strategy in the lull period with frequency detuning would be different since it increases these quantities (see Section 4.2.3). Hence, the eventual outcome does depend on the intended reward functions and what the bus operator strives to optimise.

Situations of Interest.
e setup for the bus system undergoing Q-learning is as follows. Each bus has its own Q-table containing 72 states where they represent the phase difference as measured from the bus immediately behind it.
is number of states is arbitrary, chosen to balance between not being too coarse and not taking too long for the simulations to run. Moreover, for subsequent future applications on real-time nonstationary environments, it would be desirable for these buses to respond fast enough to adapt appropriately. Independent Q-tables allow different buses to possibly learn different strategies, where one bus may occasionally perform a "sacrificial action" for the system as a whole to benefit. e 72 states would coarse-grain the phase difference into bins of 5°. In each of these states, it records the two Q-values representing the expected total rewards for the two actions stay or leave, respectively. e buses typically move around on the road, where it must proceed with moving forward. When it reaches a bus stop, it must allow passengers who wish to alight to do so, i.e., we do not allow the possibility of stop-skipping. is is because we find it to be not beneficial to speed up the bus at the expense of another round of time spent on the bus for these passengers, or asking them to alight one stop earlier and "walk their last mile" to their intended destinations. Furthermore, this allows the reinforcement learning process to have better chances of converging to an optimal Q-table for every bus within a reasonable amount of simulation time.
e only time a bus is allowed to consider whether to execute stay or leave is when it is at a bus stop and there is nobody on the bus who wishes to alight. Here are the following situations that we would explore. A bus is allowed to consider its action when it is at a bus stop, and: (1) ere is nobody to alight, but there is somebody at the bus stop who wishes to board. (2) ere is nobody to alight as well as nobody at the bus stop who wishes to board. (3) ere is nobody to alight. e first situation is intended to create a possibility where the buses may learn to implement the no-boarding strategy, since it may learn to leave the bus stop even though there is somebody who wishes to board. e second situation is intended to create a possibility where the buses may learn to implement the holding strategy, since it may learn to stay at the bus stop even when there is nobody to pick up. Finally, the third situation allows the possibility for the buses to learn some combination of the no-boarding and holding strategies. In the first two situations, each bus has a Q-table with 72 states and each state contains two values-one for stay and one for leave. For the third situation, there are 144 states because 72 states are when there is somebody who wants to board and another 72 states are when there is nobody who wants to board. Table. In Q-learning [42], when a bus is at a bus stop and has to pick an action A of either stay or leave, it has to first determine what state S it is presently in. To do so, it measures its phase difference Δθ with respect to the bus behind it. In this state, there are two actions and it Complexity 5

Updating the Q-
chooses the one which has the highest Q-value-unless it is in the ε-greedy exploration phase where there is a probability ε of randomly selecting an action. According to the theory, after executing the action A and receiving a reward of R, it should then subsequently measure again its phase difference from the bus immediately behind it to determine its future state S ′ for the purpose of updating its Q-table with a future expected reward: e hyperparameters α is the learning rate and c is the discount factor. e former determines how sensitively the Q-values would adjust due to new feedback, whilst the latter determines how seriously to believe an estimated future expected reward from its own Q-table.
However, since the bus only gets to execute an action at a bus stop when nobody wants to alight, if its action is leave, then its future state would only occur quite some time later when it reaches a bus stop. We find that this has the effect of affecting reliable convergence as other things may happen during the time when this bus leaves the bus stop and reaches another bus stop: for example, other buses would have picked up passengers at other bus stops and affect the overall waiting time of the commuters, as well as leading to a widely different future Δθ ′ . To circumnavigate this issue in obtaining an estimated future expected reward for updating the Q-table, we impose that when a bus executes stay in a state S with phase difference Δθ, then its future state S ′ is defined as the same state with phase difference Δθ ′ : � Δθ since it remained; whilst if instead it executes leave, then its future state S ′ is defined as the state with phase difference Δθ ′ : � Δθ + 5°, i.e., the phase difference has increased, since it moves to increase the phase difference from the bus behind it. Recall that this 5°is the size of each state, since we use 72 states to discretise the angles from 0°to 360°.

Bus System Environment Simulation Parameters and Reinforcement Learning Hyperparameters.
In all our simulations for the bus system environment, the parameters used are based on values measured from a real university (NTU) shuttle bus loop service with M � 12 bus stops [39]. e value for the rate of passengers boarding/alighting is l � 1 person per second. In the lull period, a representative average value for the people arrival rate at each bus stop is about s � 0.020 people per second, whilst that in the busy period it could be as high as s � 0.065 people per second. e natural frequencies of the buses are measured to be in the range of 0.93 mHz to 1.39 mHz, or a natural period of 12 minutes to 18 minutes excluding time stopped at bus stops. We adapt these values accordingly in our simulations for the bus system environment. Each simulation time step corresponds to 1 second.
For reinforcement learning, we carry out 1000 episodes, where each episode is 150 revolutions long. At the start of each new episode, the buses are randomly placed on the loop. e performance of the bus system in each episode is measured from the last 30 revolutions, where most of the transient part due to random initial conditions would have been weeded out. e system undergoes ε-greedy learning (i.e., there is a probability of ε that a random action is taken), where ε decays linearly from 1 to 0.1 in the first 200 episodes, after which it remains at 0.1 until the 700th episode. e learning rate α is kept at 0.2 for the first 700 episodes. In the last 300 episodes, we let the system fully exploit what they have learnt, with ε � 0, and α toned down to 0.1. e discount factor is always fixed at c � 0.9. e first 200 episodes represent an exploration phase, where the buses carry out many random actions due to the high value of ε. is is crucial to allow for the buses to avoid getting stuck in near-sighted local minima which may lead to missing out potentially better long-term strategies. e next 500 episodes form a mix of exploration and exploitation, where the buses take advantage of their learned Q-tables but maintain some degree of exploration just in case they get stuck in some local minima. Finally, the last 300 episodes denote a fully exploitation phase. Here, the buses still finetune their Q-tables since α � 0.1. e difference from previous episodes is that they now always take their best-perceived action, never taking a random action anymore.
For each particular setup throughout this paper, we carry out (at least) five independent runs. Generally, we obtain essentially identical qualitative results when the same setup is repeated even though the learning process involves random initial conditions in each new episode and stochasticity in the ε-greedy exploration. is therefore assures robustness in our results.
Before diving into these interesting situations involving N buses serving M bus stops, we first consider the simplest or trivial situation of N � 1 bus serving M � 12 bus stops in the next section. With only one bus, there is no nontrivial phase difference with respect to another bus. erefore, this bus must eventually learn to be a normal bus, i.e., it stays to pick up passengers when there is somebody who wishes to board and leaves otherwise. In Section 4, we study the case of N � 2 buses serving M � 12 bus stops for each of the three situations described in Section 2.2, followed by more buses in Section 5.

N � 1 Bus Learns to Be a Bus
With N � 1 bus serving M � 12 staggered bus stops in a loop, we aim to let this bus learn to be a bus, i.e., learn to stay at the bus stop when somebody wants to board and leave when nobody is at the bus stop. Recall that by default, it must allow anybody who wishes to alight to do so. Figure 2 shows the average waiting time of commuters at the bus stop for a bus to arrive, average time spent on bus, average total travel time (which is the sum of waiting time and time spent on bus), and average number of passengers on the bus, all as functions of number of episodes. Since there is only one bus, its phase difference as measured from the bus behind it (itself ) is always 360°(or 0°). Hence, the reward is purely 1 point for each person picked up.
is reinforcement learning scenario corresponds to the third one listed in Section 2.2 where the bus decides on an action when it is at a bus stop and nobody wants to alight. 6 Complexity e bus has a Q-table with two states, one when somebody wants to board and the other when nobody wants to board. Each state has two Q-values, one for stay and one for leave.
is Q-table therefore has only four numbers. For the bus system environment, we set the natural period of the bus (excluding time stopped at bus stops) to be T � 12 minutes, and the rate of people arriving at each bus stop s � 0.010 people per second. e unit of time for the top graph is T � 12 minutes (this is the unit of time for all graphs in this paper, unless otherwise stated).
As Figure 2 shows, the bus successfully learns to behave like a normal bus (stays when there is somebody to board and leaves when there is nobody to board), where it matches the performance of a hard-coded normal bus when it acts greedily in the last 300 episodes based on the Q-values that it has learnt. In the 201st to 700th episode, since ε � 0.1, it makes a random action once in every ten times, on average. A wrong action has ramifications on the waiting time of the commuters, since the bus leaves and they have to wait for one additional revolution. Only when the bus acts greedily does the performance match that of a hard-coded normal bus. Nevertheless, the time spent on the bus is not too affected during the phase where ε � 0.1, since passengers who want to alight must be allowed to do so. Large variance in the average number of passengers on the bus is observed before the 701st episode due to the ε-greedy action selection. is variance vanishes in the last 300 episodes when the bus acts greedily.

N � 2 Buses Learn No-Boarding and Holding
Let us now study the interesting situations with N � 2 buses serving a loop of M � 12 staggered bus stops. We consider bus system environments with the following three setups throughout this section which are based on the measured values from the NTU campus shuttle buses [39]: (a) Identical natural frequency, taking T � 12 minutes to complete a loop (excluding time stopped at bus stops). e rate of people arriving at each bus stop is set at s � 0.010 people per second. (b) Frequency detuning, with T 1 � 12 minutes and T 2 � 18 minutes to complete a loop (excluding time stopped at bus stops), respectively. e first bus is the faster one, whilst the second bus is the slower one.
We consider a busy period where s � 0.040 people per second. A busy period is defined by k > k in equation (1), where at least a pair of buses is persistently bunched. With these T 1 and T 2 , we have the critical k c � k � 0.014 (k � k c since N � 2). Note that we have included an overall factor of 1/2 in equation (1) since alighting and boarding occur sequentially.
(Strictly speaking, more buses must be employed to meet the higher demand during busy times since each bus has a finite capacity, but we will ignore that limit for the purpose of investigating how a simple two-bus system performs during a busy period. e situation during a busy period with more buses is dealt with in Section 5). (c) Frequency detuning in (b), during a lull period where s � 0.010 people per second. A lull period is defined by k < k in equation (1), where no buses are permanently bunched.
Note that it suffices to consider one value of s � 0.010 in (a) where the buses have an identical natural frequency, since the behaviour of the bus system is the same for any fixed s. e different phases of lull and busy become distinct only when the system has frequency detuning [34].
For each of the three situations 1, 2, and 3 as stated in the Introduction (Section 2.2), we consider these setups (a), (b), and (c).

No-Boarding.
e first situation is where buses are given the choices to stay or leave whenever they are at a bus stop, everybody who wishes to alight has done so, and there are passengers who would like to board. e reward R NB for each action (applicable to a system with any N number of buses) is given by where P is 1 if a person is picked up and 0 otherwise. Note that since the rate of people loading is l � 1 person per second, either somebody boards or nobody boards at any time step of the simulation so this quantity is well defined.  Complexity e phase difference of the bus from the bus immediately behind it, Δθ, gives a reward defined by is function f(Δθ) which remains at 1 beyond 360°/N implies that the bus is doing fine and is not too slow, but is only receiving linearly diminishing reward if Δθ is smaller than 360°/N which implies that it is too slow. e rationale for f(Δθ) staying flat instead of decreasing beyond 360°/N is due to the fact that once there is nobody at the bus stop, it must leave, i.e., there is no option for it to lengthen its stay or try holding back. It is only when it is too slow (Δθ < 360°/N) that it gets a lower reward.
A weight w balances between P (which encourages stay) and f(Δθ) (which encourages leave when Δθ < 360°/N). Generally, a small w ≪ 1 leads to the buses eventually learning to behave like normal buses, where they would always stay since they always find somebody at the bus stop who wants to board. On the other hand, w ≫ 1 leads to the buses eventually learning to always leave and maintain their perfectly staggered configuration of Δθ ≈ 360°/N. ere is a finite range of w ∼ 1 where the bus system eventually learns to both pick up passengers and attain a reasonably staggered configuration such that the average waiting time of the commuters at the bus stop is minimised. is precise range for w depends on the particular conditions of the simulation environment like s, l, T i , N, and M. In each of our reinforcement learning runs, we set an appropriate w in the balanced range. It appears that as long as w is within this range, essentially identical qualitative results are obtained. In other words, the actual value of w is unimportant as long as it is within that balanced range (we will see later that the corresponding setup is not quite true for situation 2 on holding). Figure 3 shows the results of these two buses with identical natural frequency undergoing reinforcement learning, comprising the average waiting time at the bus stop for a bus to arrive, average time spent on the bus, average total travel time (sum of the previous two quantities), and average number of passengers on the bus. In addition, it also shows the average phase difference from one bus as measured from the bus behind it, as well as the Q-tables for each bus. e performance of the N � 2 system is comparable to the analytical results in [34] where no-boarding is hardcoded, with an average waiting time of ∼ 0.30 units of T during the last 300 episodes where the buses act greedily with respect to their learned Q-tables. Normal buses would typically end up bunching and the average waiting time is ∼ 0.55 units of T, so we see a nearly 50% improvement.

Identical Natural Frequency.
Generally, a bus would implement no-boarding, i.e., leave if Δθ < 360°/N, and stay otherwise. Remarkably, they also discover the following known result from [34]: there is an upper bound on the angle to implement the no-boarding strategy, strictly below the perfectly staggered angle of 360°/N. is upper bound arises because if the angle is too close to the staggered configuration, then the buses would end up implementing no-boarding too frequently at a rate Notably, they learn about the upper bound at some angle strictly less than 180°to implement no-boarding, where if they exceed, then they would inadvertently implement no-boarding too frequently such that they are not meeting the level of demand for service. Also, they learn to unbunch, where one learns to stay whilst the other learns to leave, if their phase difference is 0°.

Complexity
where the passengers picked up is lower than the demand for service. e correspondence to the results from [34] is seen when the buses act greedily. On the other hand, in the ε-greedy phase before the 701st episode, the upper bound is ∼ 360°/N because there is a probability of staying which picks up passengers instead of stringently not boarding them. is is noted in the graph for the average phase difference and the Q-tables for the two buses: the buses are typically fluctuating around Δθ � 180°when exploration is involved, but this value is strictly less than 180°in the fully exploitation phase. e performance before the 701st episode is about those of normal buses with an average waiting time of ∼ 0.55 units of T and sometimes even 1 unit of T, instead of ∼ 0.30 units of T in the last 300 episodes.
Since the buses are randomly placed on the loop at the start of every episode, they may occasionally end up bunching. Can they unbunch? e answer is affirmative. Since the buses are endowed with independent Q-tables, they learn opposite actions if Δθ � 0°: one bus stays and the other leaves. e system as a whole discovers a cooperative mechanism to correct itself when bunched. ese results are consistently obtained in all five independent runs. e effect of this no-boarding strategy is to reduce the average time a commuter has to wait at the bus stop for a bus to arrive (i.e., the average waiting time). It does not increase the average time spent on bus, since it only leaves a bus stop earlier but never dwells at a bus stop longer than necessary. erefore, a reduction in average waiting time with no increase in average time spent on bus would imply a corresponding reduction in average total travel time. Figure 4 shows the corresponding results of these two buses with frequency detuning undergoing reinforcement learning during a busy period. e results here are essentially the same as the case with identical natural frequency, where the buses are able to learn the no-boarding policy. Bus 1 is the faster bus (in all frequency detuning cases for N � 2, bus 1 is always the faster bus) and tends to pick up more passengers since the slower bus implements no-boarding and leave, leaving more passengers to the former to slow down its higher natural frequency. e average waiting time is also comparable to the results found in [34]. Incidentally, the absence of data points before the 701st episode in the first graph is because the quantities are way too large due to the great number of passengers demanding service but not quite met by these two buses, such that they are beyond the range of the graph shown here.

Frequency Detuning during Busy Period.
Similar to the case with identical natural frequency, the two buses learn opposite actions when Δθ � 0°which would enable them to unbunch, and they also discover some upper bound strictly less than 180°where no-boarding is implemented. e slower bus (bus 2) seems to find a lower value for the upper bound to implement no-boarding than the faster bus (bus 1), since it is the one which is usually slower and has to implement no-boarding. e slow bus should spend enough time at the bus stop to actually allow passengers to board, even though its phase difference Δθ may be less than 180°; otherwise, it would not be picking up passengers and loses some reward for that. erefore, the upper bound for it to implement no-boarding is lower to let this happen. Figure 5 shows the corresponding results of these two buses with frequency detuning undergoing reinforcement learning during a lull period. Here, the buses do not quite end up with the expected no-boarding strategy.

Frequency Detuning during Lull Period.
is is in accordance with the observation noted in [34] where the no-boarding strategy backfires during the lull period because the slow bus has been sped up to the maximum by not picking up anybody! A hard-coded no-boarding policy would lead to the system effectively serving with one less bus because the slow bus almost always implements the no-boarding policy. Here, the buses found that perhaps it is better to just behave (almost) like normal buses, with performance that eventually matches closely to those of hard-coded normal buses. Incidentally, they do not necessarily need opposite actions when Δθ � 0°because their different natural frequencies allow them to unbunch.
Astonishingly, the optimal strategies for these two buses appear to defy what a human may intuitively conceive (at least initially), upon examining the Q-tables of the buses (bottom plots in Figure 5). When they begin to act greedily from the 701st episode onwards (whilst still maintaining a learning rate of α � 0.1 so that they do continuously finetune their Q-tables), the slow bus (bus 2) quickly changes to always behaving like a normal bus, with the fast bus (bus 1) implementing the no-boarding policy when it is "too slow." Eventually, by the 1000th episode, it implements the noboarding policy if Δθ ∼ 60°from the bus behind it.
Perhaps the slow bus realises that there is no point for it to implement no-boarding as it simply cannot be sped up fast enough to overcome its lower relative velocity, and if it keeps leaving, then it loses reward from not picking up passengers. On the other hand, the fast bus seems to think that when it is too slow, it should just leave so that it can quickly attain Δθ ∼ 180°which offers greater reward compared to getting stuck near Δθ ∼ 60°. If Δθ ≪ 60°, it probably would lose too much from not picking up passengers if it leaves, before it can make Δθ grow up to ∼ 180°so that it would rather behave normally and just stay, whilst if Δθ ≫ 60°, then the deficit in reward is not too high compared to Δθ ∼ 180°, such that it is fine with behaving normally and just stay to earn the reward from picking up passengers. Hence, in the lull period, we find that instead of trying in vain to keep the two buses staggered, they effectively increase the frequency detuning.
With only the no-boarding strategy being studied in [34], could the holding strategy or a holding + no-boarding strategy work to somehow provide some form of improvement for the bus system during the lull period? is is one primary aim of the framework in this paper, where we investigate reinforcement learning of the bus system to learn holding and holding + no-boarding strategies in the following sections with N � 2 buses serving a loop of bus stops.

Holding.
e second situation is where buses are given the choices to stay or leave whenever they are at a bus stop, everybody who wishes to alight has done so, and there is nobody at the bus stop. e reward R H for each action (applicable to a system with any N number of buses) is as follows: where is function g(Δθ) is discontinuous at Δθ � 360°/N. It is 0 at and less than 360°/N since it is regarded as "slow" with respect to the bus behind it. On the other hand, it approaches 1 from the right if Δθ > 360°/N since this is the ideal phase difference that it should strive for when it is "too fast." From a reward of 1 just over 360°/N, it then linearly decreases to 0 as Δθ grows towards 360°since larger phase difference is getting away from ideal. is reward R H does not say anything about how long it will repeatedly stay at a bus stop. One option is to include a negative reward so that buses do not simply remain at a bus stop indefinitely, i.e., −1 for each stay action or if a certain number of consecutive stay actions are executed (recall that in this situation, everyone who wishes to alight has done so, and there is nobody at the bus stop hence a negative reward discourages "time wasting"). en, a weight w H could be introduced between this negative reward and g(Δθ), analogous to the no-boarding reward R NB in equation (3). e hope with this is that there is some balanced region for w H such that the bus system does not excessively remain at a bus stop. It turns out that equation (5) works well and a bus does not indefinitely remain at a bus stop because it will get 0 reward if Δθ ≤ 360°/N which would prompt it to leave. Furthermore, unlike the no-boarding case where the actual value of w does not change the outcome as long as w is in the balanced region, here different values of w H would lead to different durations a bus may hold at a bus stop. We find this to be equivalent to just imposing a limit on how long a bus can hold.
Note also that this situation is different from the noboarding situation in the following sense: in the no-boarding situation, if a bus chooses the "unconventional" action of leave when there is somebody to pick up, then that is the end for this round at this current bus stop. It leaves the bus stop and moves on. However, for the holding situation here, if a bus chooses the "unconventional" action of stay when there is nobody to pick up, then it gets to choose its action again at this current bus stop. is is why the nature of the rewards and their emergent behaviours (as we will see below for holding) are not directly analogous. Figure 6 shows the results of these two buses with identical natural frequency undergoing reinforcement learning, corresponding to the graphs in the previous figures. e holding strategy looks impressively effective, where it achieves sub-0.3 units of T for the average waiting time of the commuters, even before the 201st episode where ε decays to 0.1 and α � 0.2. e way the holding strategy works is that if Δθ is not reasonably close to 360°/N, then the faster bus would remain at the bus stop until Δθ ∼ 360°/N. Since there is not much noise in the simulation environment (no traffic, rate of people arrival at bus stops is constant), the two buses would just remain fairly staggered thereafter. e buses learn that there is a lower bound to implement the holding strategy, which is strictly larger than 360°/N. is is the consequence of the discontinuity at 360°/N in the reward R H in equation (5) where its value at 360°/N itself is 0, which discourages staying. e buses also learn to never stay for any phase difference Δθ ≤ 360°/N as that gives 0 reward.

Identical Natural Frequency.
is is important because the two buses may be exactly staggered with Δθ � 180°, and if they both learn to stay, then they would just stay forever.
Curiously, the buses ostensibly learn opposite actions when they bunch, i.e., Δθ � 0°, during some earlier episodes, but these opposite actions disappear in subsequent episodes and both end up learning to leave if Δθ � 0°. How do they unbunch then, if they both take the same action? Since their positions on the loop are randomised at the start of each episode, they would inevitably end up bunching at some point. Upon closer inspection, there is actually a natural mechanism for normal buses to momentarily unbunch: if there is an odd number of passengers at the bus stop (or the number of passengers modulo N is not zero, for N bunched buses in general), then at that instant one bus stays there to pick up that last person whilst the other bus sees nobody and leaves (note: in the simulation, a loop over all buses is carried out at that instant. e first bus sees somebody and stays to pick up, whilst the second bus then sees nobody and leaves). is is how normal buses can momentarily unbunch from Δθ � 0°to be Δθ � 5°> 0°(recall that we discretise the angles by 72 bins of 5°). Of course, at the next bus stop, they would swiftly bunch again since they have to allow passengers to alight. So normal buses remain bunched. Even buses implementing the no-boarding strategy would remain bunched, which is why they learn opposite strategies via the process of reinforcement learning over the 1000 episodes.
However, the holding strategy differs in this unique manner: when a pair of bunched buses naturally unbunch due to one bus staying to pick up somebody and the other bus leaving as it has nobody to pick up, then the bus that stayed would see its phase difference as measured from the bus behind it (which is the bus that has just left, in front of it) to be Δθ � 355°, implying that it is way too fast! erefore, it would implement stay all the way based on its Q-table, until Δθ � 185°.
is is why there is no need for these buses implementing the holding strategy to have to learn opposite strategies. Figure 7 shows the results of these two buses with frequency detuning undergoing reinforcement learning, corresponding to the graphs in the previous figures. Here, demand for service is high in a busy period. e holding strategy slows down the fast bus, effectively slowing down the entire bus system. Since the reward for the system is purely to keep Δθ staggered, it does not care about staying too long at bus stops and the average waiting time suffers. is is indicated by the average number of passengers on the bus blowing up into thousands! Since there are only N � 2 buses trying to meet a high demand during the busy period, mistakes made when the buses explore other actions would lead to many of the other M � 12 bus stops rapidly accumulating passengers waiting for service. We will see in Section 5 that with N � 6 buses in the busy period, there are sufficient buses going around such that they are able to reasonably learn the holding strategy. With more buses, mistakes made by one bus when it explores are covered by other buses such that the number of passengers waiting at the M � 12 bus stops does not blow up. e overall performance here is generally worse than normal buses. e average time that commuters spend on the bus also suffers since the buses expend more time at each bus stop before they get off at their respective destinations.

Frequency Detuning during Busy Period.
is is one drawback of the holding strategy where the system gets slowed down, which is why a no-boarding strategy is arguably superior in a busy period.
Oddly enough for the holding strategy, this time in the busy period, the buses learn opposite actions when they bunch with Δθ � 0°. Here, they have to learn to unbunch deliberately because the busy period would otherwise keep them persistently bunched. Figure 8 shows the results of these two buses with frequency detuning undergoing reinforcement learning, corresponding to the graphs in the previous figures. For the first time in a lull period, we find a way to improve the average waiting time of commuters, by means of a holding strategy. However, the cost involved is that commuters would spend more time on the bus, on average, since the way the holding strategy works in keeping the buses staggered is by delaying the fast bus to the extent of being as slow as the slow bus. e average number of passengers on the fast bus is closer to that on the slow bus when the holding strategy is implemented, as compared to the normal buses where one bus consistently picks up more passengers than the other.

Frequency Detuning during Lull Period.
In spite of increasing the average time spent on bus and the average total travel time, perhaps the holding strategy may be viewed as viable since it is arguably less of a pain point to be on the bus enjoying the air conditioner compared to being out at the open bus stop where it may be hot under the blazing sun, wet during a thunderstorm, or even chilly during winter (in countries with four seasons).
We summarise the qualitative performance of the noboarding and holding strategies for each of the setups that we have discussed in Table 1. Quantitative percentage improvement (or worsening) will be given in the next section where we consider a system with N � 6 buses serving M � 12 bus stops in a loop.

Combined No-Boarding and Holding Strategies.
e third situation is where buses are given the choices to stay or leave whenever they are at a bus stop and everybody who wishes to alight has done so. Here, "somebody wants to board" and "nobody wants to board" are distinct. erefore, we take these situations as a combination of the first two situations where situation 1 occurs when there is somebody who wants to board and situation 2 occurs when nobody wants to board.
Note that since situation 2 can only occur after everybody at the bus stop has been picked up, if the bus leaves when somebody is still there, then that is the end for this round at the bus stop and situation 2 is completely sidestepped. After 1000 episodes of training, we find that the buses' Q-tables for situation 1 are trained but those for situation 2 are not. To allow for a fair amount of training for the latter Q-tables, we implement the following additional exploration possibility: in the first 200 episodes, if a bus chooses to leave when there is somebody to pick up, then there is a probability of Υ � 0.9 that it switches to stay. From the 201st to 500th episode, Υ is linearly decayed from 0.9 to 0. A reasonably high value of Υ is necessary to expose the buses to situation 2 for training, because, for example, if there are 10 passengers at the bus stop, then a bus needs to choose 10 consecutive stay actions before it has the chance to encounter and train for situation 2. Figure 9 summarises the results for the three setups (a), (b), and (c) listed right at the beginning of this section (before the start of Section 4.1) corresponding to having an identical natural frequency, frequency detuning in the busy as well as in the lull periods, respectively. We find that the buses are indeed able to learn both Q-tables such that each Q-table resembles that in the corresponding situations in Sections 4.1 and 4.2, with some minor differences that account for the fact that the buses can decide on stay or leave in two different situations 1 and 2. In terms of the performance, the graphs near episode ∼200 are primarily dominated by the holding strategy (recall that here, the performance is as good as fully exploiting even though ε � 0.1), whilst those approaching episode ∼500 are primarily dominated by the no-boarding strategy (recall that here, the performance is not as good as fully exploiting since ε � 0.1 induces the bus to leave when it should not, leaving the passengers behind to unnecessarily wait for the next bus). e performance transitions between that of holding to no-boarding somewhere between episodes 200 to 500 as Υ decays from 0.9 to 0. : Two buses with frequency detuning serving a loop of bus stops during a busy period learn the holding strategy by reinforcement learning. e entire system is slowed down greatly and generally performs worse than normal buses.

Complexity 13
Finally, in the last 300 episodes, where there is no longer any exploration ε � 0, the bus system settles into a most optimal strategy that they have acquired from the possible combinations.
In the case with identical natural frequency, the buses harness both the no-boarding and holding strategies where a bus would implement no-boarding if it is "too slow" (Δθ < 360°/N) and implement holding if it is "too fast" (Δθ > 360°/N). For the busy period with frequency detuning, however, since the buses are not able to train the Q-tables corresponding to holding, it does not actually execute the holding strategy properly.
is results in the extended waiting times, similar to what happened with the holding strategy alone during a busy period in Section 4.2.2.
For the two buses with frequency detuning during the lull period, it turns out that the buses perform as good as the holding strategy in terms of improving the average waiting time of commuters at the bus stop for a bus to arrive. e buses are able to learn that the no-boarding strategy, when applied by the slow bus (bus 2) in an attempt to speed it up, would result in it not picking up sufficient passengers and nullify its whole purpose of serving the loop. Since the frequency detuning is too large compared to the demand level, the no-boarding policy alone cannot speed it up, as we have seen in Section 4.1.3. Here, with both the options to stay and leave when there is somebody as well as nobody who wants to board, the buses try to harness both no-boarding and holding strategies, such that a bus would implement noboarding if it is "too slow" (Δθ < 360°/N) and implement holding if it is "too fast" (Δθ > 360°/N). However, the buses realise that the combination of no-boarding and holding is not the most optimal way to go, since bus 2 would be leaving with few passengers when it implements no-boarding and relatively little gain in speeding up. Eventually somewhere close to the 800th episode, the slow bus decides that it is better to forget about no-boarding even if Δθ < 360°/N, and the bus system relies entirely on the fast bus (bus 1) to implement the holding strategy. e performance then matches with the purely holding strategy presented in Section 4.2.3, with identical improvement in average waiting time, and identical increases in average time spent on bus as well as average total travel time.
In summary, given both possibilities of situations 1 and 2, the bus system is able to find the relevant most optimal strategy depending on the particular conditions. For example, with identical natural frequency, they harness both the no-boarding and holding strategies to the fullest. On the other hand, with frequency detuning in the lull period, they revert to the holding strategy and ditch the no-boarding policy.

Any N Buses Serving M Bus Stops in a Loop
is framework is directly generalisable to any N buses serving M bus stops in a loop. We have carried out more simulations with N � 3 buses and even N � 6 buses, respectively. A system with many buses generally produces qualitatively similar results to those already discussed for the case with N � 2 buses.  e setups with identical natural frequency are given k � 0.010, busy with frequency detuning is given k � 0.063, and lull with frequency detuning is given k � 0.010. When there is frequency detuning, the six buses are prescribed different natural frequencies, selected within the range of 12 minutes to 18 minutes (excluding time stopped at bus stops).
Here are some noteworthy points in the lull period where the buses have frequency detuning: ( With more buses, the ability to keep buses staggered over the loop significantly divides off the average waiting time at the bus stop for a bus to arrive. e expense incurred with the increase in average time spent on bus and also the average total travel time by implementing the holding strategy during the lull period seems to be well worth it, when described in terms of percentages. Apart from that, the combined strategies save as much time as holding on the average waiting time whilst incurring less cost on average time spent on bus (and average total travel time). Hence, we see a further improvement thanks to a combination of noboarding and holding strategies during the lull period for this system with N � 6 buses.

Discussion and Concluding Remarks
e use of reinforcement learning for a bus system serving a loop of bus stops has shown the potential of discovering strategies to optimise the performance of the system. e framework employed in this paper takes advantage of the phase difference between buses in a loop (also applicable to nonloop routes) where maintaining a staggered configuration translates to minimising the average waiting time of commuters at the bus stops for a bus to arrive. is provides a way to deal with the high variance of the individual waiting times that causes convergence of the Q-table to be essentially impossible within reasonable simulation time (or perhaps not even possible in some cases [45]). e system has learnt that no-boarding and holding strategies are indeed both useful strategies when the buses have an identical natural frequency. No-boarding speeds up the slower bus whilst holding slows down the faster bus. e former is also useful when buses have frequency detuning during the busy period, but the latter may slow down the system too much. Nevertheless, the holding strategy is salutary in the lull period where the fast bus is slowed down to match the slow bus in order to maintain a reasonably staggered configuration, at the expense of increasing time spent on bus and total travel time. is offers a solution during the lull period, where the no-boarding strategy simply does not work at all. Incidentally, whilst the noboarding strategy is arguably disruptive to passengers who need service urgently, Reference [35] has shown that allowing passengers the options of cooperating or defecting turns out to be a viable way of implementing such a noboarding strategy in real bus systems.
It is interesting to note that although the buses are given low-level actions of stay or leave at the bus stop when nobody wants to alight-essentially knowing only the rules of the game, reinforcement learning leads to the discovery of highlevel strategies of no-boarding [34,35] and holding [3-8, 10-15, 18, 19]. is illustrates the utility of a reinforcement learning framework where the system is able to arrive at high-level strategies without human presumptions and priors, like how the AlphaZero programme [46] can come up with and even validate known human strategies and tactics (e.g., the Berlin defence against the Ruy Lopez in Chess), discrediting some of them (e.g., the French Defence in Chess, apparently) and even revealing new possibilities (e.g., sacrificing multiple pawns and pieces in favour of longterm subtle activity in Chess-highly impressing many Chess Grandmasters, including a former World Champion [47,48]).
In particular, these intelligent buses are able to behave cooperatively to unbunch in unique and interesting ways like learning opposite actions in the case of no-boarding, as well as one bus just holding to allow the other to correct their phase difference. ey also discover useful strategies with the appropriate bounds where no-boarding and holding are implemented.
ese emergent behaviours arise from the ability of the buses to learn and improve from their interactions, eventually settling into some collectively optimal strategies. On top of that, the system also makes use of combining the options appropriately in various setups. is is important when we move on to nonstationary environments where the system must encounter various situations and be able to act with an optimal strategy.
Being low level, however, implies that the bus system does not actually "know" that it can "choose to implement a no-boarding strategy or a holding strategy" at will. All that it cares is a bus is at a bus stop, nobody wants to alight. Is there anybody who wants to board? If yes, then should it stay or leave? If not, then should it stay or leave? Since it typically encounters somebody who wants to board, if it leaves, then it will not encounter the latter situation where nobody wants to board-thus sidestepping the holding option. In order to allow for a balanced combination between a no-boarding strategy and a holding strategy, we have augmented the exploration phase with a new hyperparameter Υ. Alternatively, perhaps a different approach with this framework to be higher level would allow for faster convergence of the Q-table. In other words, when a bus is at a bus stop, it is "conscious" about the options on (a) behaving like a normal bus; (b) leave-implement no-boarding; or (c) stay longer-implement holding. is therefore places the options for no-boarding and holding on an equal footing, alongside behaving normally, and alleviates the bias towards implementing no-boarding over holding. Nevertheless, we have shown here that these low-level actions do lead to the high-level no-boarding and holding strategies, in the situations where somebody wants to board and nobody wants to board, respectively. is establishes the mechanisms on how low-level actions lead to the emergence of high-level coordinated strategies of the buses. With higher-level actions, the faster convergence becomes a crucial utility for being adaptive in nonstationary environments of the real world. us far, this paper assumes that all M � 12 bus stops are perfectly staggered around the loop and all have the same rate of people arrival, s. Whilst seemingly simplified, this represents an important first step in a series of increasingly complex progression for our research on the bus system undergoing reinforcement learning. In particular, we have established and clarified the behaviour of the bus system with identical natural frequency, as well as with frequency detuning in the busy and lull periods. Each setup has distinct characteristics of its own and the appropriate strategy should be applied especially if there is frequency detuning, viz. noboarding during busy and holding during lull.
A step forward would be to generalise the environment based on real data that we have collected in [39], to investigate how the bus system may arrive at novel and even adaptive strategies to deal with nonstationary environments where passengers may wish to head towards some hubs at certain times of the day, with some bus stops having higher rates of people arrival, i.e., s i for each i � 1, 2, . . . , M. e framework in this paper serves as a good platform for greater layers of complexity to be piled up on the environment. Eventually, we could then implement such strategies to our Nanyang Technological University campus shuttle bus service upon where this environment is modelled after [39,44,45], and subsequently even adapt to more complex bus routes.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.