A Decision-Making Model for Autonomous Vehicles at Urban Intersections Based on Conflict Resolution

Beijing Institute of Technology, School of Mechanical Engineering, Intelligent Vehicle Research Institute, 5 South Zhong Guan Cun Street, Haidian District, Beijing 100081, China Advanced Technology Research Institute, Beijing Institute of Technology, Jinan 250001, Shandong, China University of California, Berkeley, 1357 South 46 Street, Richmond, CA 94804, USA Shandong Hi-Speed Construction Management Group Co., Ltd., Jinan 250001, Shandong, China


Introduction
Today's driving-assistance systems have made traffic more efficient and safer and show considerable improvements towards the availability of autonomous driving. To develop the next generation of driver assistance systems or even selfdriving systems, the algorithms that are capable of handling complex situations are required. Many researchers have proposed some approaches about perception [1], path planning [2], and control [3]. However, the decision-making of autonomous driving at intersections is still one of the major bottlenecks. e primary reason for the difficulty in analyzing crossing behavior is that most models may only work when given long-term, accurate predictions of the trajectories of other participants. To address this problem, this paper will focus on developing a tactical decisionmaking model for autonomous vehicles in intersection crossing scenarios. e problems of robust tactical decision-making for autonomous vehicles in a complex and dynamic urban environment have been investigated quite extensively by many organizations and researchers, such as Google [4], Carnegie Mellon University [5], Berkeley [6], and Baidu [7]. e UCB utilized a minimal future distance and a two-level dynamic threshold to perform collision prediction tasks at urban intersections [8]. BMW and the University of Munich came up with a decision-making model based on partially observable Markov decision processes [9]. NVIDIA used a deep convolutional neural network (DCNN) to establish an end-to-end driving model [10].
In recent years, more and more researchers have begun studying decision-making behavior. Chen [11] established a vehicle decision model in an urban environment using a hierarchical finite state machine method for different drivers and road environment characteristics. Liu et al. [12] adopted the control prediction theory and the reinforcement learning theory to obtain a decision model. However, these models cannot be adapted to urban intersections. Ma et al. [13] proposed a decision-making framework titled "Plan-Decision-Action" for autonomous vehicles at complex urban intersections. Zhong et al. [14] proposed a model-learningbased actor-critic algorithm with the Gaussian process approximator to solve the problems with continuous state and action spaces. Xiong et al. [15] used a Hidden Markov model to predict other vehicles' intentions and built a decision-making model for vehicles at intersections. Lv et al. [16] combined offline and online machine learning methods to establish a personalized decision model that could simulate the characteristics of driver behavior. Chen et al. [17] used the rough-set theory to extract different drivers' decision rules. Chen et al. [18] used a novel RSAN (rough-set artificial neural network) method to learn decisions made by excellent human drivers. Chen et al. [19] proposed a merging strategy based on the least squares policy iteration (LSPI) algorithm and selected a basis function that included the reciprocal of TTC, relative distance, and relative velocity to represent the state space and discretize the action space. However, these studies did not take the overall interaction scenarios into consideration and can only be adopted for short-term trajectory prediction.
is paper focuses on the decision-making process of autonomous vehicles in an urban environment and develops a vehicle trajectory prediction model based on Gaussian process regression (GPR) [20], which can generate longterm predictions of incoming vehicles. e problem of conflict resolution among vehicles at intersections is modeled as a multiobjective optimization problem (MOP), in which the acceleration, as the only decision variable, is used to control the vehicles. e main contributions of this work are the presentations of two solutions of intersection multiobjective optimization problems. First, the noninferior genetic algorithm (NSGA-II) is applied to maximize the overall driving benefit of system; the other one considers the deep deterministic policy gradient (DDPG) algorithm of reinforcement learning with continuous actions. Its expected gradient of the action-value function means that DDPG can be estimated much more stable than the usual stochastic policy gradient. A simulation and verification platform was built to validate the results based on Matlab/ Simulink and PreScan, and the proposed MOP decisionmaking method and calculation algorithms were verified in several typical scenarios. e remainder of this paper is organized as follows: Section 2 elaborates upon the methodology used in this study, which includes an introduction of Gaussian process regression, nondominated sorting genetic algorithm (NSGA-II), and deep deterministic policy gradient algorithm of reinforcement learning. Section 3 describes data acquisition and data processing. Section 4 proposes the GPR models for trajectory prediction and the MOP decision-making model based on efficient conflict resolution at intersections, which is solved by NSGA-II and DDPG. e simulation verification platform to evaluate the effectiveness and reliability of the proposed model and performance between two algorithms is introduced in Section 5. In Section 6, conclusions and future work are presented.

Gaussian Process Regression
Model. Gaussian process regression (GPR) is a statistical method that can make full use of raw data by considering its temporal trends and periodic changes to establish a suitable predictive model. is model has been used to predict the trajectories of vehicles and has been proven to be efficient. Compared with LSTM, its main advantage is that it is more robust when dealing with data with noise, making it more suitable for urban intersections.
e log likelihood function of the sample data is shown as follows: e joint distribution of the model's observations and training data is shown as follows: where K * � [C(x * , x 1 ), C(x * , x 2 ), . . . , C(x * , x n )] T is the covariance matrix between the test data x * and the training data and C(x * , x * ) is the covariance matrix of the test data itself. erefore, the output of the model can be found with (3). By calculating the mean and variance of the output y * of the model, the predicted mean y * and predictive confidence σ 2 * of the model can be obtained separately:

Nondominated Sorting Genetic Algorithm.
In 2000, a new nondominated sorting genetic algorithm (NSGA-II) was proposed by Srinivas and Deb on the basis of the NSGA, which is a theory and method of handling the Pareto optima in multiobjective optimization problems. It is one of the most popular multiobjective genetic algorithms (GAs) in studying complex system analysis, and diversity results discovery. e structure of the algorithm is as shown in Figure 1. e step-by-step procedure shows that NSGA-II algorithm is simple and straightforward. First, a combined population Rt � Pt ∪ Qt is formed. e population Rt is of size 2 N. en, the population Rt is sorted according to nondomination. Since all previous and current population members are included in Rt, elitism is ensured. Now, solutions belonging to the best nondominated set F1 are of best solutions in the combined population and must be emphasized more than any other solution in the combined population. If the size of F1 is smaller than N, we definitely choose all members of the set F1 for the new population Pt + 1. e remaining members of the population Pt + 1 are chosen from subsequent nondominated fronts in the order of their ranking. us, solutions from the set F2 are chosen next, followed by solutions from the set F3, and so on. is procedure is continued until no more sets can be accommodated. Say that the set is the last nondominated set beyond which no other set can be accommodated [21].

Deep Deterministic Policy Gradient.
e interactive learning process of reinforcement learning is similar to human learning, which can be represented as a Markov decision process consist of (s, a, P, r). In 2013, the DQN [22] algorithm was proposed by DeepMind, opening a new era of deep reinforcement learning. e core improvement of the algorithm is to use experience replay and build a second target network [23], which eliminate the correlation between the training samples and improve stability of training. Some algorithms evolved by DQN have made great progress in discrete action control problem, but it is difficult to learn continuous strategy control problem. In 2015, DeepMind proposed the DDPG algorithm based on the DPG and DQN algorithms [24], importing the normalization mechanism in deep learning [25]. Experiments show that the proposed algorithm performs well on multiple kinds of continuous control problems.
e DDPG algorithm is an improved actor-critic method. In the actor-critic algorithm, the actor function π(s|ϕ) generates an action given the current state. e critic evaluates an action-value function Q(s, a|θ) based on the output from actor, as well as the current state. e TD (temporal-difference) errors produced from the critic drive the learning in the critic network, and then actor network is updated based on policy gradient. e DDPG algorithm combines the advantages of the actor-critic and DQN algorithms so that the converge becomes easier. In other words, DDPG introduces some concepts from DQN, which are employing the target network and estimate network for both of the actor and critic. Moreover, the policy of the DDPG algorithm is no longer stochastic but deterministic. It means the only real action is outputted from the actor network instead of telling probability of different actions. e critic network is updated based on where y i � r i + cQ ′ (s t+1 , a t |θ Q′ ) is the Q value estimated by target network and N indicates the total number of minibatch size. e actor network is updated by means of the gradient term where Q(s, a|θ Q ) is from critic estimate network. Furthermore, the DDPG algorithm solves continuous action space problem by means of experience replay and asynchronous updating. e updates of the target critic and target actor networks are as follows: 3. Data e data were collected from the intersections of Wei Gong Cun Road using subgrade sensors and a retrofit autonomous vehicle as the training and testing samples of the trajectory prediction model. e details are discussed in the following section.

Subgrade Data Acquisition.
e camera for subgrade data acquisition was installed on the BIT Science and Technology Building. e vehicles' locations (x, y, z), velocities (v), and accelerations (a) were extracted. e symmetric exponential moving average (SEMA) method [26] was adopted to smooth out the training data.

Vehicle Data Acquisition.
e vehicle data were collected with a BYD line-controlled autonomous vehicle which was retrofitted by the BIT Intelligent Vehicle Research Institute. e retrofit autonomous vehicle "Surui" [27] was equipped with several kinds of sensors, as shown in Figure 2(b). e camera and LIDAR sensor were able to detect, track, and localize dynamic objects. e outputs of the fusion algorithm are the positions of vehicles.  Crowding distance sorting

Trajectory Prediction Model.
A trajectory prediction model based on the GPR model was used to predict the trajectories of MVs. e training process of GPR models [28] is shown in Figure 4(a).
In this paper, the data collected from the subgrade sensors were used for training the GPR models and optimizing its hyper parameters. − (x(t), y(t), v s (t), v y (t)) were the inputs, while a x (t) was the output. A square exponential covariance function (SE) was adopted as the kernel function, because it can accurately describe the nonlinear relationships between the inputs and outputs. A conjugate gradient optimization algorithm was then adopted to search for optimal parameters. When the error fell below 0.001, the results were regarded as convergent.
After training the prediction model, as this paper paid more attention to straight driving MVs, the CA (constant acceleration) [29] kinematic formula is utilized to calculate the follow-up trajectories more accurately, as shown in Figure 4(b).

A Decision-Making Model Based on Efficient Conflict
Resolution. An appropriate parameter should be selected to analyze the traffic conflict. TTC (time to collision) is a widely used parameter in traffic conflict research, but it is generally used for scenes such as highway and is improper to evaluate the danger degree of vehicles collision at intersections. We use EPET (estimating postencroachment time) as the safety indicator which describes the time difference between vehicles passing through the center of conflict zone and can effectively evaluate collision danger between vehicles at any angles, as shown in Figure 3(b): where T uv and T mv i are, respectively, the time when UV and MV(i) arrive the conflict zone. EPET is expected a larger value which means smaller risk of collision.
While ensuring safety, an appropriate speed is expected, which stands for efficiency during crossing the intersection. Using these criteria, we define the following measure combining safety and efficiency: where U is the profit function, and we expect a larger U that represents more ideal motions during the crossing for vehicles. V cri is the expected speed for MV, which is set to 40 km/h according to the driving rules at the intersections. e reason for defining the U negative is ensuring efficiency in the following model, e.g., deep reinforcement learning.
As the states and actions of vehicles are continuous, we use acceleration a as a parameter to control target. A constrained model of multiobjective optimization problem (MOP) is proposed based on conflict resolution at intersection, and the goal of which is to maximize profit of the system. e interaction between vehicles is quantized by importing a variable parameter P, and vehicles will cross the intersection in competition when P is zero. When P is 1, vehicles will be in cooperation completely. e mathematical model of MOP is usually expressed as follows: where f(X) is object function and h i (X) � 0 and g j (X) ≥ 0 are constraint conditions. For solving the maximum of U, it can be transformed into finding the minimum of negative function. erefore, we can then establish v max depends on the speed limit at the intersections, and a max and a min represent comfort requirement during driving, which are defined as ±2m/s 2 in this paper.

Constraint Condition.
To ensure safety, a simplified circle model for vehicles is established, as shown in Figure 5.
We set a safety constraint for no overlap between the excircles of vehicles: Journal of Advanced Transportation where L and W are, respectively, the length and width of vehicles. e formula for the motion state of vehicles is as follows: where ( x i (0), y i (0)) is initial position and φ 1 is orientation.

Process of Decision
Making. For the model of MOP, we perform an optimal solution based on NSGA-II, and the process is shown in Figure 6.
ere are two stages in the solution process: the first phase is decision making at the initial moment and performing the action with the known information, and the second phase is to update the position and velocity of vehicles with dynamic information and then regenerate optimal motions.

e Calculation Method Based on Deep Reinforcement
Learning. If we assume that the process of crossing intersections is a Markov decision process (MDP), it is practical to apply deep reinforcement learning for continuous action spaces. e input state is the speed of vehicles and distance from the center of vehicles to the center of conflict zone, i.e.,  Center of conflict zone (x t , y t ) of MV. In this study, the reward function is built in the same way as (8), R � U, which is with consideration of safety and efficiency. We expect a larger total reward that means the sum of the rewards for each step and converge it through training based on policy gradient, which is the reason why we set the reward is negative. As for a positive reward function, a larger total reward may result from more step, which means more time to cross intersections by an inefficient policy. However, for a negative reward function, a larger total reward means a safe and efficient policy.

Discussion and Evaluation
In this section, we trained DDPG on OpenAI Gym and then tested the algorithms on PreScan to compare. is allowed us to verify the effectiveness and reliability of the proposed algorithms.
Simulation parameters are set as follows: we test the algorithms in single or multiple-vehicle scenes where there is one or more MVs driving straight from north to south, and a UV is excepted to cross the intersection controlled by algorithms with no collision. e length and width of vehicle MV and UV are 4800 mm and 2178 mm, respectively, communication distance range is less than 200 m apart from each other, and speed limit at intersection is 60 km/h.

Simulation and Verification Platform.
PreScan is a simulation environment for developing advanced driving assistant systems (ADASs) and intelligent vehicle (IV) systems. It is a platform that can be used to build 3D virtual traffic scenes, generate vehicles, pedestrians, traffic lights, and other control modules, as shown in Figure 7(a). PreScan comes with a powerful graphics preprocessor, a high-end 3D visualization viewer, and a connection to standard MAT-LAB/Simulink. It is composed of various main modules. Some of these main modules represent a specific world. Multiple sensor readings were simulated and captured in the Sensor World. We build a new task about intersection with multiple vehicles on OpenAI Gym, as shown in Figure 7(b). e deterministic actor policy network and critic policy network have the same architectures, which are multilayer perceptions with two hidden layers (64-64). For the metaexploration policy, we implemented a stochastic Gaussian policy with a mean network or variance network represented with a MLP with two hidden layers (64-64).

Results of Prediction Model.
In this paper, the predictions of steering-vehicle trajectories and the straight vehicle trajectories are verified separately. ese trajectories are divided into several different pieces to evaluate the prediction performance. e prediction lengths of the straight vehicle are 3 s, 4 s, 5 s, and 6 s. e prediction lengths of steering-vehicle are 3 s, 4 s, and 5 s. ere are 80 trajectories in each group. Figure 8(a) shows the prediction error of the straight vehicle trajectories. It can be found that the GPR model has better performance than the commonly used model in Journal of Advanced Transportation prediction of straight vehicle trajectories. Figure 8(b) shows the prediction error of the steering-vehicle trajectories. It can be found that the GPR model is more accurate than the constant-rate steering motion model (CTRV).

Effect of MOP Model.
Scenario 1: single-vehicle scenario Figure 9(a) depicts the interaction between a UV and an incoming MV. Two experiments were carried out in the simulation platform. e difference between the two experiments was whether the UV was controlled by the tactical decision-making algorithm or not. In the first experiment, without the proposed algorithm, a collision between the MV and UV happened at t � 5.8 s. In the other experiment, the main vehicle was controlled by the proposed algorithm. When the two vehicles met at the intersection, the main vehicle predicted the trajectory of the other vehicle, which is shown in Figure 9(b). In this experiment, deceleration was the optimal choice. e desired velocities given by the decision-making algorithm and the actual velocity changes are shown in Figure 9(c).
ere was no collision because the algorithm chose to yield to the incoming vehicle. Figure 9(c) shows that with the decision-making algorithm, the main vehicle decelerates before entering conflict zone, thus slowing down to give way to the incoming vehicle. Figures 9(d) and 9(e) show the distances and TTCs of the two vehicles. Before the algorithm is executed, both the distance and the TTC curves of the two vehicles pass through x � 0, indicating that a collision occurs at this time. After the algorithm is executed, the distance and the TTC remain within the safe range, indicating that no collision occurred.

Comparison of NSGA-II and DDPG Algorithm. Scenario 2: multiple-vehicle scenario
To compare the performances of the DDPG and NSGA-II algorithms, we conducted two groups of experiments on the same scene, in which D MV1 and D MV2 were, respectively, 10 m and 32 m, and the initial position of the UV, i.e., D UV , was 30 m. We set MV1 and MV2 to drive with a constant speed of 40 km/h. Subsequently, we trained the DDPG algorithm based on the MOP model, tested the performance in group B, and compared it with that of NSGA-II in group A, as shown in Figure 10.
For group A, the UV adopts a yield strategy wherein it slows down before t � 3 s to wait for MV1 and MV2 to cross the intersection and then accelerates after the MVs move away. As shown in Figure 10 increasingly lower than the expected speed, the reward appears to decline until t � 3 s and increases thereafter. A higher crossing time means a higher accumulation of the negative reward, which leads to a lower total reward of − 44.184. Figure 10(b) shows that the UV passes through the intersection between the two MVs with an efficient strategy in group B; as shown in the bottom image in Figure 10(b), the UV reaches the conflict zone at t � 2 s, approximately 0.5 s earlier than MV2. In the image, the shaded area represents the conflict zone in consideration of the size of the vehicles. With the efficient strategy of the DDPG, the UV maintains an acceleration of 2 m/s 2 during the entire process of passing through the intersection, thus achieving a much higher total reward than that in group A. e comparison data in Table 1 show that the passing through time for the UV of group B is approximately 1.5 s lower than that of group A, which means that the DDPG algorithm reduces the traffic delay and improves the efficiency with which the UV passes through the intersection. Moreover, the rate of change in the acceleration of the UV is lower in group B, which implies a lower energy consumption. In general, the DDPG algorithm is more efficient than NSGA-II. e stability of the DDPG and NSGA-II algorithms was studied by performing a new task wherein the initial speed of the UV was varied from 30 km/h to 55 km/h.
We built a single-vehicle scene, where there is only one UV, and imported the trained actor policy of the DDPG to output the motions of the UV. We then imported the NSGA-II algorithm as a compared group to observe the performance on the same task 10 times. As shown in Figure 11, because the NSGA-II algorithm was recalculated each time, the total reward is quite different at the same initial speed of the UV. On the other hand, the DDPG gives a more stable and efficient result, and the average of the total rewards of the DDPG is higher than that of NSGA-II. Furthermore, the averages of the total   8 10 12 The nearest distance between the two vehicles before executing the algorithm The nearest distance between the two vehicles after executing the algorithm (e)  rewards of the two algorithms decrease when the initial speed is greater than 50 km/h, which indicates the possibility of a collision.

Conclusion and Future Work
To improve the safety and efficiency of autonomous vehicles, this paper proposed a MOP decision-making model based on efficient conflict resolution for autonomous vehicles at urban intersections, which considers the complexity of urban intersections and the uncertainties of vehicle behavior. e prediction algorithm for incoming vehicles was studied, and we compare the performance for UV at intersections based on the decision-making model by NSGA-II and DDPG. e main conclusions are listed as follows: (1) e trajectory prediction model fits the predicted trajectory by learning the probability distribution of a large amount of trajectory data, and the accuracy of the model depends on the quantity and quality of the training data. e incoming vehicle trajectory data collected in this paper was limited and was unable to cover all the incoming vehicle motion patterns. (2) e MOP decision-making model performs well, which can avoid a collision for vehicles happened at intersections. Compared to a traditional machine learning algorithm, NSGA-II, the performance of DDPG algorithm is more stable and effective to solve the MOP model at intersections, and UVs perform the more appropriate and efficient motions by DDPG.
e decision making of autonomous vehicles is influenced by human-vehicle-road (environmental) factors. Due to limits on the length of this article, the impacts of pedestrians, nonmotor vehicles, road structure types, and traffic flow density on decision-making were not considered in this study. In the future, the impacts of these factors will be studied and discussed. e interactions between people and vehicles will be considered to further improve the decision-making model of driving behavior under real road conditions.

Data Availability
e data used to support the findings of this study are provided in the Supplementary Materials section. autonomous vehicle were as the training and testing samples of trajectory prediction model. And the data were divided into three categories: left-turn vehicles, right-turn vehicles, and straight vehicles. Every category included the vehicles, information like location, speed, acceleration, and so on. (Supplementary Materials)