Cooperative Multiagent Deep Deterministic Policy Gradient (CoMADDPG) for Intelligent Connected Transportation with Unsignalized Intersection

Unsignalized intersection control is one of the most critical issues in intelligent transportation systems, which requires connected and automated vehicles to support more frequent information interaction and on-board computing. It is very promising to introduce reinforcement learning in the unsignalized intersection control. However, the existing multiagent reinforcement learning algorithms, such as multiagent deep deterministic policy gradient (MADDPG), hardly handle a dynamic number of vehicles, which cannot meet the need of the real road condition. )us, this paper proposes a Cooperative MADDPG (CoMADDPG) for connected vehicles at unsignalized intersection to solve this problem. Firstly, the scenario of multiple vehicles passing through an unsignalized intersection is formulated as a multiagent reinforcement learning (RL) problem. Secondly, MADDPG is redefined to adapt to the dynamic quantity agents, where each vehicle selects reference vehicles to construct a partial stationary environment, which is necessary for RL. )irdly, this paper incorporates a novel vehicle selection method, which projects the reference vehicles on a virtual lane and selects the largest impact vehicles to construct the environment. At last, an intersection simulation platform is developed to evaluate the proposed method. According to the simulation result, CoMADDPG can reduce average travel time by 39.28% compared with the other optimization-based methods, which indicates that CoMADDPG has an excellent prospect in dealing with the scenario of unsignalized intersection control.


Introduction
In recent years, the development of connected vehicles [1] has prompted the innovation of transportation technology. By introducing the technologies of vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication, multivehicle coordination has become possible to improve traffic safety and efficiency. is makes the cooperative intelligent transportation systems (C-ITS) a new research hotspot [2].
As a classic application scenario of Intelligent Connected Transportation, the intersection is more complicated and challenging for the cooperation of intervehicle control than the highway. In the intersection, vehicles will enter from different directions and leave through different exits. e conflict between different vehicles will greatly limit the efficiency and safety of the intersection traffic. erefore, a flexible and complicated intersection control system is necessary. ere have been several related works focusing on traffic signal timing optimization. Ge et al. proposed a cooperative method for multi-intersection signal control based on Q-learning with Q-value transfer [3]. Kim and Jeong proposed a cooperative traffic signal control scheme using traffic flow prediction for a multi-intersection [4]. Kamal et al. concentrated on the mixed manual-automated traffic scenario and developed an adaptive traffic signal control scheme for an isolated intersection [5]. Yu proposed a fuzzy programming-based approach to optimize the signal timing for an isolated intersection, which used operation efficiency, traffic capacity, and exhaust emission as the joint optimization goals [6]. Bai et al. proposed a scenario where trams crossed intersections without stop and presented a coordinated control model along the tramline to optimize the control of prioritized traffic signals [7]. Xu et al. proposed a game-based policy for the signal control of an isolated intersection together with rerouting of the vehicles using vehicle-to-infrastructure communication [8]. Lu et al. presented a novel speed control method for the successive signalized intersections under connected vehicles environment. With the timing information and vehicle queues at the signalized intersection, vehicle speed was optimized to reduce fuel consumption and emissions [9]. Xu proposed a cooperative method to optimize roadside signal optimization and on-board vehicle speed control at the same time [10]. In addition, with the development of autonomous driving, the mixed traffic of manual-automated driving coexistence is also attracting researchers' attention. Boudaakat et al. established a hydrodynamic model to calculate the total capacity of roundabouts and provided a method to control traffic congestion [11]. Kamal et al. presented a novel adaptive traffic signal control scheme, aiming at minimizing the total crossing time of all vehicles and ensuring comfortable crossing of manually driven vehicles [12]. Gupta et al. proposed a conceptual model for negotiation between self-driving vehicles and pedestrians, which shows an improvement in the overall travel time of the vehicles as compared with the current best practice behaviour, always stop, of autonomous vehicles [13]. In general, traffic signal control schemes should maximize traffic efficiency under the premise of ensuring traffic safety. To make the intersection traffic control more accurate and in real time, many researchers have begun to focus on the study of unsignalized intersection control.
Generally, the researchers divide unsignalized intersection control schemes into two categories, centralized and distributed. Centralized coordination approaches collect the global information of the entire intersection to regulate the vehicles at the intersection. Dai et al. solve the intersection control problem with convex optimization [14]. Guan et al. proposed a centralized conflict-free cooperation method for multiple connected vehicles at unsignalized intersection using model accelerated proximal policy optimization [15]. Qian et al. devised the AIDM algorithm to schedule a vehicle or a platoon to pass the unsignalized intersection [16]. Nevertheless, these centralized schemes are pressed for communication and computation because all vehicles need a central controller to dispatch them.
In decentralized coordination approaches, the centralized controller disappears, and the controller is configured separately in each vehicle to adjust its trajectories considering kinetic information and conflict relationship of adjacent vehicles. Xu et al. introduced a conflict-free geometry topology and designed a distributed controller to stabilize the involved vehicle at the intersection [17]. Bian et al. divided the intersection area into four areas according to the distance and present a process containing observation, optimization, and control [18]. Hadjigeorgiou and Timotheou studied the optimization of the travel time and fuel consumption balance of autonomous vehicles through unsignalized intersection [19]. Belkhouche proposed decentralized multiagent control laws, which are derived for conflict resolution between vehicles [20]. Hsu et al. studied the interaction between vehicles and pedestrians at unsignalized intersections and proposed a decision theory model to represent the interaction [21]. Wang et al. proposed a cooperative algorithm to transform the high-dimensional problem of cooperative driving for multiple vehicles at multiconflict points into the single-dimensional problem of searching the optimal time for vehicles to enter the intersection [22]. Yang and Oguchi developed an advanced vehicle control system with connected vehicles to reduce vehicles delays under a partially connected environment [23]. However, well-designed models and controllers can only show interpretable parts, and more hidden parts become bottlenecks in performance improvement.
One of the most critical goals in artificial intelligence is to obtain a new skill, especially in a multiagent environment. Reinforcement learning (RL) can improve the policy via trial-and-error interaction with the environment, which is analogous to human beings. Recently, RL has taken an essential role in a variety of fields, for example, wireless communication [24] and autonomous driving [25]. Mnih et al. proposed Deep Q-learning Network (DQN) and obtained superhuman performance on Atari video games [26]. Considering that DQN is only applicable to the problem with discrete action spaces, Deep Deterministic Policy Gradient (DDPG) is proposed to solve continuous control problems [27]. When the scenario extends from a single agent to multiple agents, more information can be taken into consideration to improve algorithm performance. Multiagent DDPG (MADDPG) is a multiagent policy gradient algorithm where agents learn a centralized critic based on the observation and actions of all agents [28]. ere have already been many applications in the field of intersection control. Liang et al. proposed a double-dueling deep Q-network to control the traffic light cycle [29]. Zhou et al. proposed a car-following model, based on reinforcement learning, to obtain an appropriate driving behaviour to improve travel efficiency, fuel consumption, and safety at signalized intersections [30]. Lee et al. employed reinforcement learning that recognizes an entire traffic state and jointly controls all the traffic signals of multiple intersections [31]. However, the uncertainty of agent number poses a further challenge to address some problems, such as distributed coordination at unsignalized intersection.
is paper proposes a distributed conflict-free cooperation method, Cooperative MADDPG (CoMADDPG), for multiple connected vehicles at unsignalized intersection. e main contributions of this paper are as follows: (i) is paper formulates the scenario of multiple vehicles passing through an unsignalized intersection as a multiagent reinforcement learning problem. (ii) e Cooperative MADDPG (CoMADDPG) is proposed, which modifies the classic MADDPG algorithm to adapt the dynamic quantity agents. In CoMADDPG, each vehicle selects reference vehicles to construct a partial stationary environment, which is necessary for the introduction of the RL method.
(iii) is paper also proposes a novel vehicle selection method, which can project all reference vehicles on a virtual lane and selects the largest impact vehicles on the virtual lane. It can assist the CoMADDPG algorithm to converge quickly and avoid collisions effectively.
e rest of this paper is organized as follows. Section 2 illustrates our problem statement and presents the settings of states, actions, and rewards. Section 3 introduces the preliminaries of multiagent reinforcement learning and the workflow of the proposed CoMADDPG algorithm. Section 4 presents the experimental settings and results. Section 5 concludes this work.

Problem Statement.
is paper focuses on a 4-direction intersection shown in Figure 1. Each direction denotes the location in the figure, that is, up, down, left, and right, respectively. A certain distance of the intersection is focused on. ere are 4 entrances and 4 exits in total, which are unsignalized, and each direction contains only one lane. e vehicles are only allowed to go straight.
As depicted in Figure 1, boxes in different colours represent vehicles in different lanes. e trajectories of the vehicles intersect into 4 conflict points in the merging zone of the intersection. Based on the collected information, each vehicle independently decides its acceleration and deceleration using a policy network. When the vehicles enter the intersection area, the centralized server distributes the newest model to them. During the running process of the vehicles, they produce so-called experience, recording the running data of themselves and reference vehicles at each time step. e reference vehicles are selected via a proposed vehicle selection method. e whole process is continuous, with vehicles entering and leaving the intersection area.
Several assumptions are adopted as follows. Firstly, the kinetic information of vehicles can be measured to support the decision made by each vehicle. en, it is assumed that all approaching vehicles are connected and automated so that each vehicle can strictly obey the planned acceleration, adjust the velocity, and pass the intersection automatically. Additionally, all vehicles enter the intersection according to the set time, which corresponds to the Poisson process.

MARL Formulation.
e problem is formulated as a multiagent reinforcement learning problem, and each vehicle is treated as an agent by defining state space, action space, and reward function.

State and Action
Space. According to the assumptions, each vehicle can obtain others' kinetic information via vehicle-to-vehicle communication. To achieve the aim of collaboration, the state of a vehicle needs to include the dynamics of its adjacent vehicles. However, the dimensions of the state cannot be arbitrarily expanded, so several vehicles with the largest impact are chosen, as is shown in equation (1). e subscript m indicates the maximum number of the largest impact vehicles under consideration. As for the definition of s i other,j , the subscript j represents the j th largest impact vehicle of vehicle i, and it has no relationship with the identity of any vehicle: In equation (2), s i own represents the state of vehicle i, including position, velocity, and acceleration. In equation (3), s i other, * represents the reference vehicles of vehicle i, including position, velocity, and acceleration. Commonly, the position can be formed with Cartesian coordinate, that is, (x, y). However, through the analysis of the task formulation, conflicting vehicles at the same distance from the intersection will have a high correlation. erefore, polar coordinate (μ, θ) is utilized instead of (x, y). ere are only 4 directions in the problem, so the position is denoted by (d, l). Herein, d is the distance from the vehicle to the conflict point, and it is positive when approaching the intersection and negative when leaving. l is the index of lane, that is, 1, 2, 3, 4 { }.

Reward Settings.
e reward function is designed, as shown in Table 1. Compared with reward settings in [15], the reward is not defined with the final situation, such as vehicle passing or collision but scattered in the running process. Here, the distance difference Diff D and the time difference Diff T are used to assist each vehicle to obtain its reward: In equations (4) and (5) It is noted that as the distance difference shrinks, the risk factor does not change linearly, and a logarithmic function is introduced to describe this nonlinear change. On the other hand, the sign of time difference can be a good indicator of whether the distance between the two vehicles is increasing or decreasing. If the distance between vehicles is small enough, the increasing distance will generate positive rewards and vice versa. e transformation of the hyperbolic tangent function is a good description of this change. Moreover, to avoid reward expansion, which is an important issue in RL, we limit the reward value after each calculation and reward is controlled at [−20, 20].

Cooperative Multiagent Deep Deterministic Policy Gradient
In this paper, partially observable Markov games are considered, constituting a multiagent Markov decision process. e possible state S, a set of actions A 1 , . . . , A N , and a set of observations O 1 , . . . , O N jointly describe a Markov game for N agents. To determine the action, each agent utilizes a stochastic policy π θ i : O i × A i , which outputs the next state based on the state transition function T: S × A 1 × . . . × A N ↦S. After interacting with the environment, each agent obtains reward with the function of state and action S × A 1 ↦R and gets a separate observation o i . e initial state is dependent on a distribution ρ. Each agent runs for maximizing its expected return R i � T t�0 c t r t i , where T is the time horizon and c is a discount factor.

Multiagent Deep Deterministic Policy Gradient.
A significant problem faced by the traditional RL algorithm is that each agent is learning to improve the policy continuously.
us, from the perspective of each agent, the environment is dynamic, which is not stationary for traditional RL algorithm. To a certain extent, it is impossible to adapt to the dynamic environment by merely changing the agent's policy. Due to the instability of the environment, the critical techniques of DQN-like experience replay cannot be directly used. e policy gradient exacerbates the problem of significant variance due to the increase in the number of agents.
en MADDPG is an adaptation of actor-critic methods which considers action policies of other agents and can learn policies that require complex multiagent coordination. e algorithm has three characteristics. Firstly, the optimal policy obtained through learning can produce optimal action using only local information. Secondly, there is no need to know the requirements of the dynamic model of the environment and interagent communication.
irdly, the algorithm can be used in both cooperative and competitive environments.
In the game with N agents, the policies on each agent can be represented as μ � μ 1 , . . . , μ N , and the gradient of the expected return for agent i, J(θ i ) � E[R i ] , can be written as where D is the experience replay buffer, which contains the tuples (o i , o i ′ , a i , r i ) to record the experience of each agent i.    (x, a 1 , . . . , a N ) denotes a centralized action-value function that takes the action of all agents and some state information x and outputs the Q-value for agent i. In some fundamental cases, x includes the observation of all agents: x � (o 1 , . . . , o N ).
e centralized action-value function is updated as where μ ′ � μ θ 1 ′ , . . . , μ θ i ′ , . . . , μ θ N ′ is the target policy set with delayed parameters θ i ′ . e main intention of MADDPG is that when the actions produced by all agents are known, the environment is stationary even if the policies vary. Additionally, the algorithm has three techniques. Firstly, actor and critic constitute centralized training, and actor can run only by knowing local information during inference. Secondly, experience replay is improved to apply to a dynamic environment. irdly, policy ensemble is utilized to enhance stability and robustness.

Cooperative MADDPG (CoMADDPG).
One of the challenges for multiagent reinforcement learning is when policies of agents are updated, the environment changes, which contradicts existing assumptions of a stationary environment. Accordingly, a primary motivation behind MADDPG is that the actions taken by all agents are known, which makes the environment be considered stationary even as the policies change. However, by default, different observation variables in MADDPG will correspond to the agents one by one, which significantly limits the application scenarios of this algorithm. In this paper, the definition of a stationary environment is extended to suit the situation of more agents.
In the problem of distributed vehicle control at the intersection, the fluent entry and exit of vehicles lead to an uncertain and large number of agents. From the perspective of a stationary environment, it can exist not only globally but also partially. To construct the environment, an agent selects several agents as reference agents. erefore, the gradient of the expected return can be rewritten as In equation ( When this data enters the neural network as input, the position of the variables will play an important role. e above operation can decouple the identity and running information of the vehicles so that the proposed CoMADDPG can be adapted to the scenario of more agents.
Note that more agents will not make the decision process more complex. is is because although more agents will expand the length of the input data, those data only need to pass through the same network structure. Moreover, we introduce the virtual lane to propose a vehicle selection method, which makes the current vehicle only care about the largest impact vehicles. In order to ensure the effectiveness of small-scale neural networks, we set an upper limit on the number of reference vehicles. All agents play the same role in cooperation with each other. In the stage of decision-making, the running states of the current vehicle and reference vehicles are combined into a set. Each action takes into account the states and actions of the reference vehicles. e vehicle selection method is explained in detail in the next section.

Largest Impact Vehicles Selection.
e proposed method is built based on a distributed system, and all vehicles are regarded as the equal agents in the system. e process of selecting the largest impact vehicles is a prestep for information gathering. Based on the decoupling of identity and running information of the vehicles, how to select vehicles to obtain running information becomes an issue.  Figure 2, the radius of the arc represents the distance to the conflict points. e dots with different colours represent the projected vehicle on the virtual lane, which is a black arrow pointing forward. In the scenario of this paper, there are four entering lanes, so four virtual lanes take effect. e projected virtual platoon is shown in Figure 3, and different colours and arrows indicate different direction.
In this paper, the selection of the largest impact vehicles can depend on space distance. When L4 is performing the vehicle selection, a star structure is obtained to express the relationship between L4 and other vehicles in With the largest impact vehicle selection, the proposed CoMADDPG can use a relatively simple network structure to handle a large number of agents.

Algorithm Architecture.
is part illustrates how to apply CoMADDPG algorithm to this distributed control problem at intersection.
A learning algorithm for this distributed RL problem consists of two main parts: CoMADDPG trainer and executor. Figure 5 shows the overall architecture. e executor is applied to get updated policy from the trainer and uses it to collect experience from the simulation environment. en, the trainer uses experience data from the executor to update policy network, which follows the DDGP model update process. Finally, the trainer distributes the newest model to the executors.

Experimental Settings.
In this section, CoMADDPG is trained and evaluated in the scenario of the intersection, which contains 4 different directions and allows vehicles to go straight without turning. erefore, there are four conflicting points in the intersection, and each conflict point corresponds to a virtual lane. Furthermore, there are 4 types of vehicles, and each type possesses the same entry and exit. Vehicles appear at the beginning of each entering lane and follow a Poisson process with different vehicle density. Here, with the predefined arrival time and initial velocity, there is no need to set the distance between vehicles. Geometry and vehicle dynamics parameters are listed in Table 2. As for velocity and initial velocity, m/s is used in the experiment, but, in order to facilitate understanding, km/h is used in this paper. e central server on the intersection can collect experience from the vehicle, update the model, and distribute the newest model to the vehicles entering the lane. After receiving the newest model, the vehicle could determine the action with the model and send the produced experience to the central server. For results, the training processes of CoMADDPG are shown and illustrate our improvement in MADDPG.

Implementation Details.
In CoMADDPG, there are two modules, actor and critic, for inference and training. Each module corresponds to a network structure without shared parameters, which are shown in Figure 6. In Figure 6(a), the actor module inputs state and outputs action, which contains three dense layers and two normalization layers, and chooses ReLU, presented in equation (10), as activation function. On the other hand, the critic module is used to evaluate the actions with Q value in a specific state. e structural difference between critic and actor lies in action set, which is concatenated to the processed state as an additional input. In CoMADDPG, action set contains the actions performed by the current vehicles and reference vehicles: Furthermore, complete hyperparameters are listed in Table 3.

Results and Discussion.
is section presents the performance of our algorithm at the intersection and analyzes the empirical results. Firstly, a 3D view is used to visually show the changes in the position and velocity of vehicles as they pass through the intersection. en, the process of training is observed to verify the convergence of the proposed CoMADDPG. Next, as one of the significant contributions, the virtual lane is compared with actual lanes under different parameters. Moreover, the average travel time is tested with different vehicle densities and different lane lengths. Finally, an optimization-based method is chosen for comparison, and the real-time performance is discussed.
In order to clearly understand the running state, Figure 7 illustrates the position and velocity profiles of approaching vehicles from 4 entrances of the intersection. In Figure 7(a), the approaching vehicles entered the merging zone of intersection orderly. Figure 7(b) presents approaching vehicles to adjust their velocities to achieve collision-free. e speed adjustment is too frequent, which will be optimized to improve the stability of the vehicle speed in our future work.
Firstly, to prove the ability of convergence, the experiment is designed to compare the training process between CoMADDPG and DDPG. From the perspective of reinforcement learning model training, the loss function, reward, and the statistics of the number of collisions are measured. CoMADDPG is the adaptation of MADDPG in the case of dynamic quantity agents, and DDPG is employed as our baseline, which only considers its actions without others' actions in the stage of centralized training. In terms of the mean reward and the loss function of the actor, there is no apparent difference between the two algorithms. In Figure 8(a), the loss functions of actor, which is the decision module in CoMADDPG and DDPG, cross down and tend to be flat, which means both algorithms can converge. During the most steps of the training, DDPG has lower loss than the proposed CoMADDPG, which is due to the simpler state and action input. However, the loss function is only an auxiliary indicator, and the number of collisions in Figure 8(d) is more valuable. In Figure 8(b), DDPG shows a higher loss than CoMADDPG. is is due to the insufficient information obtained by DDPG, which cannot form a stationary partial environment to perform reinforcement learning algorithm. In Figure 8(c), both algorithms have the same trend in the change of the mean reward. Concerning the cumulative number of collisions during training, CoMADDPG is much lower than DDPG in Figure 8(   CoMADDPG will maintain an upward trend. In the evaluation stage, there is no collision in CoMADDPG, while DDPG still collides frequently. Secondly, vehicle selection based on virtual lanes is an essential step in the transformation of reinforcement learning methods. To discuss the impact of different vehicle selection strategies on training, the experiment is designed to compare the virtual lane-based method with the actual lanebased method. Figure 9 exhibits the collision performance under virtual lane-and actual lane-based methods with the different number of vehicles considered. e virtual lanebased method is described in Section 3.3, and the actual lanebased method relies on the physical distance to select vehicles. In the experiment, the maximum number of the largest impact vehicles is set to 1, 3, and 6, respectively.
Because one vehicle has conflicts with three of the four lanes, three is selected as a parameter. e influence of the front and rear vehicles is also considered here, so 6 is also used as a parameter for the experiment. 1 is set as a control parameter. As displayed in Figure 9, the actual lane-based method hardly achieves no collision, but as more vehicles are considered, the new number of collisions drops significantly. As for virtual lane-based method, it keeps fewer collisions. When considering that the number of vehicles reaches 6, the curve can reach a relatively horizontal state faster.
is demonstrates the effectiveness of virtual lane-based vehicle selection in CoMADDPG to achieve no collision at unsignalized intersection.
irdly, to observe the influence of various lane lengths on average travel time, the experiment shows vehicles of different densities are running on lanes of different lengths and the average travel time is evaluated. Intuitively, a longer lane would allow vehicles to adjust their velocity to pass the intersection quickly. In Figure 10, there is no noticeable difference among different vehicles densities. e average travel time is proportional to the length of the lane, which means the proposed method can effectively deal with different lane lengths and vehicle densities. Moreover, lowdensity vehicles perform relatively poorly in long lanes. is is due to the sparse intervehicle spacing, resulting in insufficient coordination among vehicles.
Finally, the trained CoMADDPG is utilized for comparison with the optimization-based approach [18], which is     Table 4. It is observed that the proposed method increases the average travel times by 7.23%∼39.28% in medium-and high-traffic volumes. Only in the case of low traffic flow the traffic efficiency is slightly lower than the optimization-based method. In addition, where more density is reached, the optimization-based method is no longer applicable, and the proposed CoMADDPG still works normally. Note that the results are only compared in the speed adjustment area, and the optimization-based method has an extra distance of 100 meters for observation and optimizing the speed. In brief, the results revealed that the proposed CoMADDPG has a slower travel time than the baseline, which can avoid vehicles congested in the entering lane of the intersection. e process of decision is evaluated on a laptop with an Intel CPU (i7-8565U @ 1.8 GHz, 1.9 GHz), 16 GB RAM, and NVIDIA GeForce MX250. e average running time is 0.36 ms. According to the 3 GPP standard [32], TX rate for cooperative collision avoidance between UEs supporting V2X applications is 100 messages/s, which means the message sending interval is 10 ms. Obviously, the proposed CoMADDPG meets the requirement of real time, and there is ample time to support function expansion.

Conclusion
In this paper, a multiagent reinforcement learning method is employed to solve distributed cooperation for connected and automated vehicles at unsignalized intersection, which has been regarded as a challenging problem of cooperation among dynamic quantity vehicles. Vehicle selection is incorporated into MADDPG to propose CoMADDPG, which makes it adapt to the dynamic quantity vehicles at the unsignalized intersection. Moreover, the virtual lane-based method enhanced intervehicle cooperation for collision avoidance. A typical 4-direction intersection containing four different types of vehicle is studied. e simulation results demonstrate that the proposed method is efficient. Compared with the existing optimization-based method, up to 39.28% improvement implies that CoMADDPG is worthwhile to handle distributed vehicle control safely and efficiently at unsignalized intersection.
In order to simplify the problem, this paper only studies the case where the single lane only goes straight. e proposed CoMADDPG can also solve the situation of multiple lanes and multiple directions. For multiple lanes, projecting more actual lanes onto virtual lanes requires an appropriate increase in the number of vehicles selected. For multiple directions, the piecewise projection of the collision points on the curve needs to be addressed.
In this paper, traffic safety and efficiency optimizations of passing through unsignalized intersection are researched, but vehicle stability is not considered. e introduction of vehicle stability would limit the exploration ability of the RL algorithm and may fall into the local optimum, which cannot achieve the highest vehicle passing efficiency. In future work, vehicle stability will guide our research as an essential topic.

Data Availability
No data were used to support this study. What we adopted is reinforcement learning, and data are generated from the environment we made in our simulation.

Conflicts of Interest
e authors declare that they have no conflicts of interest.