^{1}

^{1}

^{2}

^{1}

^{1}

^{2}

Unsignalized intersection control is one of the most critical issues in intelligent transportation systems, which requires connected and automated vehicles to support more frequent information interaction and on-board computing. It is very promising to introduce reinforcement learning in the unsignalized intersection control. However, the existing multiagent reinforcement learning algorithms, such as multiagent deep deterministic policy gradient (MADDPG), hardly handle a dynamic number of vehicles, which cannot meet the need of the real road condition. Thus, this paper proposes a Cooperative MADDPG (CoMADDPG) for connected vehicles at unsignalized intersection to solve this problem. Firstly, the scenario of multiple vehicles passing through an unsignalized intersection is formulated as a multiagent reinforcement learning (RL) problem. Secondly, MADDPG is redefined to adapt to the dynamic quantity agents, where each vehicle selects reference vehicles to construct a partial stationary environment, which is necessary for RL. Thirdly, this paper incorporates a novel vehicle selection method, which projects the reference vehicles on a virtual lane and selects the largest impact vehicles to construct the environment. At last, an intersection simulation platform is developed to evaluate the proposed method. According to the simulation result, CoMADDPG can reduce average travel time by 39.28% compared with the other optimization-based methods, which indicates that CoMADDPG has an excellent prospect in dealing with the scenario of unsignalized intersection control.

In recent years, the development of connected vehicles [

As a classic application scenario of Intelligent Connected Transportation, the intersection is more complicated and challenging for the cooperation of intervehicle control than the highway. In the intersection, vehicles will enter from different directions and leave through different exits. The conflict between different vehicles will greatly limit the efficiency and safety of the intersection traffic. Therefore, a flexible and complicated intersection control system is necessary. There have been several related works focusing on traffic signal timing optimization. Ge et al. proposed a cooperative method for multi-intersection signal control based on Q-learning with Q-value transfer [

Generally, the researchers divide unsignalized intersection control schemes into two categories, centralized and distributed. Centralized coordination approaches collect the global information of the entire intersection to regulate the vehicles at the intersection. Dai et al. solve the intersection control problem with convex optimization [

In decentralized coordination approaches, the centralized controller disappears, and the controller is configured separately in each vehicle to adjust its trajectories considering kinetic information and conflict relationship of adjacent vehicles. Xu et al. introduced a conflict-free geometry topology and designed a distributed controller to stabilize the involved vehicle at the intersection [

One of the most critical goals in artificial intelligence is to obtain a new skill, especially in a multiagent environment. Reinforcement learning (RL) can improve the policy via trial-and-error interaction with the environment, which is analogous to human beings. Recently, RL has taken an essential role in a variety of fields, for example, wireless communication [

This paper proposes a distributed conflict-free cooperation method, Cooperative MADDPG (CoMADDPG), for multiple connected vehicles at unsignalized intersection. The main contributions of this paper are as follows:

This paper formulates the scenario of multiple vehicles passing through an unsignalized intersection as a multiagent reinforcement learning problem.

The Cooperative MADDPG (CoMADDPG) is proposed, which modifies the classic MADDPG algorithm to adapt the dynamic quantity agents. In CoMADDPG, each vehicle selects reference vehicles to construct a partial stationary environment, which is necessary for the introduction of the RL method.

This paper also proposes a novel vehicle selection method, which can project all reference vehicles on a virtual lane and selects the largest impact vehicles on the virtual lane. It can assist the CoMADDPG algorithm to converge quickly and avoid collisions effectively.

The rest of this paper is organized as follows. Section

This paper focuses on a 4-direction intersection shown in Figure

The overall flow diagram of cooperative reinforcement learning integrated with intersection scenario.

As depicted in Figure

Several assumptions are adopted as follows. Firstly, the kinetic information of vehicles can be measured to support the decision made by each vehicle. Then, it is assumed that all approaching vehicles are connected and automated so that each vehicle can strictly obey the planned acceleration, adjust the velocity, and pass the intersection automatically. Additionally, all vehicles enter the intersection according to the set time, which corresponds to the Poisson process.

The problem is formulated as a multiagent reinforcement learning problem, and each vehicle is treated as an agent by defining state space, action space, and reward function.

According to the assumptions, each vehicle can obtain others’ kinetic information via vehicle-to-vehicle communication. To achieve the aim of collaboration, the state of a vehicle needs to include the dynamics of its adjacent vehicles. However, the dimensions of the state cannot be arbitrarily expanded, so several vehicles with the largest impact are chosen, as is shown in equation (

In equation (

The reward function is designed, as shown in Table

Reward settings.

Reward items | Reward |
---|---|

Distance difference | log |

Time difference |

In

It is noted that as the distance difference shrinks, the risk factor does not change linearly, and a logarithmic function is introduced to describe this nonlinear change. On the other hand, the sign of time difference can be a good indicator of whether the distance between the two vehicles is increasing or decreasing. If the distance between vehicles is small enough, the increasing distance will generate positive rewards and vice versa. The transformation of the hyperbolic tangent function is a good description of this change. Moreover, to avoid reward expansion, which is an important issue in RL, we limit the reward value after each calculation and reward is controlled at [−20, 20].

In this paper, partially observable Markov games are considered, constituting a multiagent Markov decision process. The possible state

A significant problem faced by the traditional RL algorithm is that each agent is learning to improve the policy continuously. Thus, from the perspective of each agent, the environment is dynamic, which is not stationary for traditional RL algorithm. To a certain extent, it is impossible to adapt to the dynamic environment by merely changing the agent’s policy. Due to the instability of the environment, the critical techniques of DQN-like experience replay cannot be directly used. The policy gradient exacerbates the problem of significant variance due to the increase in the number of agents.

Then MADDPG is an adaptation of actor-critic methods which considers action policies of other agents and can learn policies that require complex multiagent coordination. The algorithm has three characteristics. Firstly, the optimal policy obtained through learning can produce optimal action using only local information. Secondly, there is no need to know the requirements of the dynamic model of the environment and interagent communication. Thirdly, the algorithm can be used in both cooperative and competitive environments.

In the game with

One of the challenges for multiagent reinforcement learning is when policies of agents are updated, the environment changes, which contradicts existing assumptions of a stationary environment. Accordingly, a primary motivation behind MADDPG is that the actions taken by all agents are known, which makes the environment be considered stationary even as the policies change. However, by default, different observation variables in MADDPG will correspond to the agents one by one, which significantly limits the application scenarios of this algorithm. In this paper, the definition of a stationary environment is extended to suit the situation of more agents.

In the problem of distributed vehicle control at the intersection, the fluent entry and exit of vehicles lead to an uncertain and large number of agents. From the perspective of a stationary environment, it can exist not only globally but also partially. To construct the environment, an agent selects several agents as reference agents. Therefore, the gradient of the expected return can be rewritten as

In equation (

Note that more agents will not make the decision process more complex. This is because although more agents will expand the length of the input data, those data only need to pass through the same network structure. Moreover, we introduce the virtual lane to propose a vehicle selection method, which makes the current vehicle only care about the largest impact vehicles. In order to ensure the effectiveness of small-scale neural networks, we set an upper limit on the number of reference vehicles. All agents play the same role in cooperation with each other. In the stage of decision-making, the running states of the current vehicle and reference vehicles are combined into a set. Each action takes into account the states and actions of the reference vehicles. The vehicle selection method is explained in detail in the next section.

The proposed method is built based on a distributed system, and all vehicles are regarded as the equal agents in the system. The process of selecting the largest impact vehicles is a prestep for information gathering. Based on the decoupling of identity and running information of the vehicles, how to select vehicles to obtain running information becomes an issue. There are two main types of vehicle collisions near intersections. One is the longitudinal collision in the lane, and the other is the lateral collision at the merging zone. The longitudinal collisions can be resolved by selecting adjacent vehicles in the same lane. To solve the lateral collision, this paper introduces the concept of the virtual lane.

For clear description, eight vehicles (L3, L4, R3, R4, U3, U4, D3, and D4) are chosen for illustration in Figure

Virtual platoon projection. L3, L4, R3, R4, U3, U4, D3, and D4 are chosen for illustration, and L4 is taken as an example to perform the vehicles selection.

A projected virtual platoon.

In this paper, the selection of the largest impact vehicles can depend on space distance. When L4 is performing the vehicle selection, a star structure is obtained to express the relationship between L4 and other vehicles in Figure

Star structure for L4. The shown dots represent candidate vehicles, and the red-dashed boxes select the vehicles to be considered.

With the largest impact vehicle selection, the proposed CoMADDPG can use a relatively simple network structure to handle a large number of agents.

This part illustrates how to apply CoMADDPG algorithm to this distributed control problem at intersection.

A learning algorithm for this distributed RL problem consists of two main parts: CoMADDPG trainer and executor. Figure

Overall architecture of the algorithm. The new executor, that is, vehicle, downloads the newest network from the centralized trainer. The centralized trainer updates the networks with the experience from executors.

In this section, CoMADDPG is trained and evaluated in the scenario of the intersection, which contains 4 different directions and allows vehicles to go straight without turning. Therefore, there are four conflicting points in the intersection, and each conflict point corresponds to a virtual lane. Furthermore, there are 4 types of vehicles, and each type possesses the same entry and exit. Vehicles appear at the beginning of each entering lane and follow a Poisson process with different vehicle density. Here, with the predefined arrival time and initial velocity, there is no need to set the distance between vehicles. Geometry and vehicle dynamics parameters are listed in Table

Geometry and vehicle dynamics parameters.

Parameter | Value |
---|---|

Lane length (m) | 155 |

Vehicle size (m) | 2 |

Velocity (km/h) | [18,50] |

Initial velocity (km/h) | 36 |

Acceleration (m/s^{2}) | [−3,3] |

In CoMADDPG, there are two modules, actor and critic, for inference and training. Each module corresponds to a network structure without shared parameters, which are shown in Figure

The network structures of actor and critic used in the proposed CoMADDPG. (a) Actor module. (b) Critic module.

Furthermore, complete hyperparameters are listed in Table

Hyperparameters of experiment.

Parameter | Value |
---|---|

Discounted factor | 0.80 |

Minibatch size | 128 |

Soft update factor | 0.998 |

Epoch | 300 |

Learning rate-actor | |

Learning rate-critic | |

Hidden layers number | 2 |

Hidden units number | 64 |

Optimizer | Adam |

This section presents the performance of our algorithm at the intersection and analyzes the empirical results. Firstly, a 3D view is used to visually show the changes in the position and velocity of vehicles as they pass through the intersection. Then, the process of training is observed to verify the convergence of the proposed CoMADDPG. Next, as one of the significant contributions, the virtual lane is compared with actual lanes under different parameters. Moreover, the average travel time is tested with different vehicle densities and different lane lengths. Finally, an optimization-based method is chosen for comparison, and the real-time performance is discussed.

In order to clearly understand the running state, Figure

Vehicle status profiles in the area of intersection (different colours: vehicles from different entrances). (a) Position profiles. (b) Velocity profiles.

Firstly, to prove the ability of convergence, the experiment is designed to compare the training process between CoMADDPG and DDPG. From the perspective of reinforcement learning model training, the loss function, reward, and the statistics of the number of collisions are measured. CoMADDPG is the adaptation of MADDPG in the case of dynamic quantity agents, and DDPG is employed as our baseline, which only considers its actions without others’ actions in the stage of centralized training. In terms of the mean reward and the loss function of the actor, there is no apparent difference between the two algorithms. In Figure

Comparison between CoMADDPG and DDPG during the training process. (a) The loss function of the actor. (b) The loss function of the critic. (c) The mean reward. (d) The cumulative number of collisions.

Secondly, vehicle selection based on virtual lanes is an essential step in the transformation of reinforcement learning methods. To discuss the impact of different vehicle selection strategies on training, the experiment is designed to compare the virtual lane-based method with the actual lane-based method. Figure

Comparison of cumulative collisions number using different vehicle selection configurations. “Actual” means actual lane-based vehicle selection. “Virtual” means virtual lane-based vehicle selection. The number indicates the maximum number of the largest impact vehicles.

Thirdly, to observe the influence of various lane lengths on average travel time, the experiment shows vehicles of different densities are running on lanes of different lengths and the average travel time is evaluated. Intuitively, a longer lane would allow vehicles to adjust their velocity to pass the intersection quickly. In Figure

Comparison among various vehicles densities with different lengths of lane. 1200 to 8400 indicate different vehicles densities, whose unit is vehicles/hour.

Finally, the trained CoMADDPG is utilized for comparison with the optimization-based approach [

Comparison of average travel times.

Total traffic volume (vehicles/hour) | 1200 | 2400 | 3600 | 4800 | 6000 | |
---|---|---|---|---|---|---|

Travel time (s) | Ref. [ | 12.145 | 13.773 | 21.366 | — | — |

Proposed | 12.385 | 12.777 | 12.974 | 12.580 | 12.618 | |

Improvement (%) | −1.98 | 7.23 | 39.28 | — | — |

The process of decision is evaluated on a laptop with an Intel CPU (i7-8565U @ 1.8 GHz, 1.9 GHz), 16 GB RAM, and NVIDIA GeForce MX250. The average running time is 0.36 ms. According to the 3 GPP standard [

In this paper, a multiagent reinforcement learning method is employed to solve distributed cooperation for connected and automated vehicles at unsignalized intersection, which has been regarded as a challenging problem of cooperation among dynamic quantity vehicles. Vehicle selection is incorporated into MADDPG to propose CoMADDPG, which makes it adapt to the dynamic quantity vehicles at the unsignalized intersection. Moreover, the virtual lane-based method enhanced intervehicle cooperation for collision avoidance. A typical 4-direction intersection containing four different types of vehicle is studied. The simulation results demonstrate that the proposed method is efficient. Compared with the existing optimization-based method, up to 39.28% improvement implies that CoMADDPG is worthwhile to handle distributed vehicle control safely and efficiently at unsignalized intersection.

In order to simplify the problem, this paper only studies the case where the single lane only goes straight. The proposed CoMADDPG can also solve the situation of multiple lanes and multiple directions. For multiple lanes, projecting more actual lanes onto virtual lanes requires an appropriate increase in the number of vehicles selected. For multiple directions, the piecewise projection of the collision points on the curve needs to be addressed.

In this paper, traffic safety and efficiency optimizations of passing through unsignalized intersection are researched, but vehicle stability is not considered. The introduction of vehicle stability would limit the exploration ability of the RL algorithm and may fall into the local optimum, which cannot achieve the highest vehicle passing efficiency. In future work, vehicle stability will guide our research as an essential topic.

No data were used to support this study. What we adopted is reinforcement learning, and data are generated from the environment we made in our simulation.

The authors declare that they have no conflicts of interest.

This research was funded by the National Key R&D Program of China (2016YFB0100902).