Refined Path Planning for Emergency Rescue Vehicles on Congested Urban Arterial Roads via Reinforcement Learning Approach

School of Electronic and Control Engineering, Chang’an University, Xi’an 710064, China School of Intelligent Systems Engineering, Sun Yat-Sen University, Guangzhou 510275, China Guangdong Province Key Laboratory of Fire Science and Technology, Guangzhou 510006, China School of Information Engineering, Chang’an University, Xi’an 710064, China China Communications Information Technology Group Co., Ltd, Beijing 100088, China


Introduction
Urban arterial roads are important components of the urban traffic system, traffic accidents occurring on them produce a major impact on the entire urban road network and result in huge casualties and large economic losses [1][2][3]. It is common to see that traffic accidents would lead to congestions due to the high volume of urban road networks, which greatly change the capacity of road network and influence the optimal path for emergency rescuing [4][5][6][7].
us, being a typical phenomenon in urban traffic accident rescuing, it is of great significance to conduct studies on path planning for emergency vehicles on congested urban arterial roads. e complex composition of urban arterial roads, multiple intersections, and large traffic flow have brought many difficulties to urban road vehicle path planning, which make it more complicated when compared with route planning on freeways [8][9][10][11][12]. Furthermore, the rescue path planning of emergency vehicles is different from the general route planning problem. Firstly, the rescue path planning of emergency vehicles will encounter various road conditions; because of the large traffic volume of urban roads, traffic accidents result in road congestion easily, which greatly reduces the capacity of the road upstream of the accident site and makes arrival to the accident site time-consuming. Moreover, the architecture of many urban arterial roads is composed of main road and side road, containing lots of intersections, which gives a number of options for path planning and deserves a refined plan.
A great number of research studies focus on vehicle path planning problems. Yang et al. proposed a path planning method for emergency vehicles, where the road network is divided into a weighted grid and then the rescue path is planned by vector grid map method [13]. e green ant method was proposed by Jabbarpour et al. to provide path planning for unmanned ground vehicles which leads to low power consumption [14]. Wu et al. used an improved ant colony algorithm in congested urban area for dynamic path planning, introducing a designed road evaluating factor instead of road distance to combine with particle swarm optimization [15]. A system was established by Zhang et al. to plan path in real time by efficiently caching and conducting an experiment on a real road network [16]. Karouri et al. proposed an efficient path planning method over largescale traffic scenario using Dijkstra greedy algorithm and Green Light Optimal Speed Advisory service [17]. Most of the research studies mentioned above are towards large-scale road networks, yet the path planning for the refined urban road needs to be improved, and more attention should be paid to the research on the optimal rescue path under different traffic control schemes on congested roads.
With the development of artificial intelligence, algorithms like deep Q-network have been widely used for decision making in various practical problems [18][19][20]. e use of reinforcement learning in path planning is increasing and has provided different goal-oriented path planning for various types of vehicles due to its strong performance and high applicability in decision making of path selection. Liu et al. designed a best path selection method to help different types of intelligent driving vehicles based on the prior knowledge applied reinforcement learning strategy [21]. Combining hierarchical reinforcement learning with neural networks, Yu proposed a path planning algorithm for mobile robots and tested it in different scenarios [22]. Chen proposed a path planning scheme using deep reinforcement learning for autonomous vehicles to reduce transport costs and to increase traffic efficiency [23]. While being a valuebased deep reinforcement learning algorithm with stability and high qualified experience leading to strong performance in strategy optimization [24], PERDQN can provide path planning for emergency rescue vehicles in a complex road network with reliability. erefore, we established a refined road network model for congested urban arterial roads and used PERDQN to provide path planning for emergency vehicles during traffic accidents, which aims at reaching the accident site with shortest time and least loss of road capacity on road network. e contents of this paper are organized as follows. Section 2 explains how to build a Markov decision process for urban arterial roads while considering the traffic efficiency and impact on road network, and also describes how rescue paths for emergency vehicles are planned based on PERDQN algorithm under different traffic control schemes. Section 3 presents the result of the proposed method in the case of real urban arterial road. Section 4 summarizes the contributions of this paper and gives an outlook on our future study.

Path Planning on Congested Urban Arterial Roads Based on PERDQN.
e rescue path of the emergency vehicle refers to the driving track from its departure to the arrival at the scene of the traffic accident, different rescue paths lead to diverse arrival time and road network impact. And from the perspective of emergency vehicle, the path planning problem can be regarded as a problem about how to drive to the accident site with the shortest time and with the least impact on the road network, while the driving decisions emergency vehicle makes at each road node determine the rescue path it drives. erefore, a rescue path planning environment for congested urban arterial roads based on the Markov decision process is constructed, and PERDQN is used to plan path for emergency vehicles on basis of the environment. rough interacting with the MDP environment, the emergency vehicles learn from experience and improve the path planning capabilities and eventually find the optimal rescue path under various control schemes. Figure 1 shows the framework of rescue path planning for emergency vehicles based on PERDQN.

Establishment of Congested Urban Arterial Road
Environment. Urban arterial roads have diverse functions and complex compositions. Most of them consist of main roads and side roads, and there are often many intersections interspersed in them. Moreover, it is a common phenomenon that the main roads in the opposite directions on the arterial road are separated by continuous road fences, while numbers of joint roads make mainroad and side road in the same direction connected. is provides multipleoptions for emergency vehicles path planning during accidents rescuing, especially when the road is congested due to traffic accidents. us, it is necessary to establish a model that extracts the key issue of problem while maintaining the characteristics of the arterial roads.
e node-segment model is a common method that simplifies the path planning problem while preserving the characteristics of the road networks [13][14][15]. e model is composed of nodes and road segments, which especially focuses on the joint point of roads, where vehicle can change path, such as the connection between the main road and side road in the same direction, and the intersections that connect opposite roads are regarded as nodes in the model. e node model N � N ij i�1,2,...,m;j�1,2,...,n , where N ij denotes the j th node on the i th road, m is the number of main roads and side roads in the road network, and n denotes the amount of joint points on separated roads. e segment model Seg � Seg ij i�1,2,...,m−1;j�1,2,...,n−1 represents the road segment between nodes, where Seg ij is the road segment between N ij and N ij+1 , denoted as Seg ij � |N ij+1 − N ij |. Seg ij is the length of road segment. Moreover, different lanes on the single main road or side road are regarded as indistinguishable for sharing the same limited speed and traffic control requirements, leading to the same travel efficiency and road impact in path planning.
On the basis of the established node-segment model of road networks, the path planning problem needs to be further adapted into a decision-making problem; being a dynamic, stochastic mathematical framework that describes sequential decision making, the Markov decision process (MDP) is introduced to establish a decision-making environment for the path planning. A Markov decision process consists of a tuple 〈S, A, T, R〉, where state S � (s k , k � 1, 2, . . . , ) represents the position that emergency vehicle is on, denoted as road node in road network, the mapping from node N ij to state s can be represented as s k � i−1 l�1 n l + j| node�N ij , and n l is the mount of node at l th lane. Action A � (1, 2, 3, . . . , ) indicates the next road node that emergency vehicle is going to. T is the transition probability of the vehicle moving from the current node to the next node, which also represents the mechanism of MDP; the factors that affect vehicle driving between nodes include the speed of the vehicle on the road section, the length of the road section, and the capacity of the road segment, and they are included in reward shaping. e reward function R(S, A, S ′ ) defines the reward value obtained from the current node to the next node through driving. Great attention is paid to the traffic efficiency and road network impact in traffic research studies [25][26][27], and with the traffic wave theory suggested in [28,29] and the Bureau of Public Roads (BPR) function proposed in [30], the vehicle queuing length (VQL) during traffic accidents and arrival time in congested road networks can be calculated. Further, to let emergency vehicles take the road network impact and travel efficiency into consideration, the reward function is designed to be where R ij denotes the reward that emergency vehicle received when driving from state s i to state s j , t ij a is the time taken from s i to state s j , and it is calculated by BPR function(5), t i ea denotes the estimated arrival time to accident site from s i , which is the sum of the arrival time of each section on the shortest path (the shortest path to accident site from s i is generated by Dijkstra's algorithm based on segment matrix that has been mentioned in part of modelling, Dijkstra's algorithm is a method widely used for shortest route calculation [31]), and q l refers to the sum of the expected queue lengths under different control schemes and is calculated by traffic wave theory(1). l ij dif � l j − l i , where l j indicates the shortest distance from s j to the accident point that also calculated by Dijkstra's algorithm. t ij a and l ij dif focus on the difference between nodes and provide local information for emergency vehicle, while t i ea offers estimated time to accident site and q l indicates the impact of the road network under accidents and traffic control schemes. Moreover, to make the time factors t ij a , t i ea and distance factors q l , l ij dif share the same magnitude in value, time factors are preprocessed by rescaling normalization to the range of mean of distance in road segments L � l i i∈road segment . α, α 1 , α 2 , β, β 1 , β 2 are factors that control ratio of corresponding indicators, and they are set to be empirical values as 0.9, 0.9, 0.1, 0.1, 0.2, 0.8, respectively. s terminal is the goal of route and refers to the location of the accident site. r terminal is a constant that motivates vehicle to get to the accident point and is 150 in our established experiment.
In the path planning problem, the emergency vehicle needs to choose its driving behaviour to reach the next location according to current location, which means in MDP, the emergency vehicle selects its driving action according to the driving strategy π � π(a|s) on the basis of  Journal of Advanced Transportation current state s (π(a|s) indicates the probability of choosing action a at state s). erefore, the goal of MDP is to find an optimal strategy that drives emergency vehicle to accident site as soon as possible. During the decision-making process of emergency vehicle, each action should be aimed at maximizing long-term return on effort, that is, to reach the accident point with the shortest time and the smallest road network impact, and the contribution of route selection decisions to reaching the accident point is quantitatively evaluated by state-action value function: where q π (s, a) is the value of state-action pair of (s, a), E π denotes the expectation value of policy π, G t denotes the cumulative reward value, r t+1 refers to the reward at t + 1, and c is the reward decay factor. e optimal strategy for emergency vehicles should have the largest state-action value function under any circumstances to ensure that the decisions it made own largest reward; supposing π * is the optimal policy, then Based on the optimal policy, the optimal action at current state according to π * can be selected as follows:

Path Planning Based on Prioritized Experience Replay
Deep Q-Network. By constructing the MDP environment, we transformed the path planning problem into a decisionoptimization problem, to be more specific, it is now a problem about how to make emergency vehicle obtain the optimal policy in path planning on the established urban arterial road model. Based on the optimal policy, the emergency vehicle decides where to go according to current location and reaches the selected position at the next step in path planning dimension. Continuously, emergency vehicle plans its driving route based on the current location until it reaches the accident site. By then, the driving trajectory of emergency vehicle forms its rescue path. In reinforcement learning, the optimal policy is learned from experience of the interaction with MDP and achieved by estimating optimal action-value function because the action-value function is the basis for the vehicle to determine how to drive at the current node. Only when the values of different actions in different states are known can the vehicle choose the action with largest value as the optimal one.
In PERDQN [24], deep neural network is used to obtain the estimated optimal action-value function as nonlinear function approximator, formulated as Q(s, a; θ) ≈ Q π * (s, a), where θ refers to the parameters of neural network. To be more specific, two neural networks with the same structure including Q-network and target network are constructed, the Q-network produces Q(s, a; θ) and evaluates current state-action pair, the target network is used to generate target value as y � r + cmax a′ Q(s ′ , a ′ ; θ − ), θ − is the parameter of target network, and s ′ and a ′ refer to sampled state-action pair. e loss of Q-network is calculated by which is the difference between estimated value and target value whose gradient is rough iterative updates, the Q-network approximates the state-action value and gradually learns to be the optimal state-action value function Q π * (s, a). By then, the decision made according to equation (4) is the optimal strategy, which is the optimal path for emergency vehicles.
Moreover, PERDQN introduces the prioritized replay method to improve the performance of experience replay, which is a stochastic priority sampling method. Experience with more information will have a higher probability of being sampled, and the probability of each experience being sampled is where p t is the priority of the i th experience and a is used to control the amplitude of priority. Proportional prioritization variant is calculated by where where δ t denotes the TD error in t th experience and ε is a small positive constant that ensures that the edge case of experiences whose TD error is zero can be sampled. Moreover, ε-greedy policy is introduced to PERDQN to ensure the exploration and to avoid local optimal in training. e pseudocode of PERDQN-based path planning is shown in Algorithm 1.

Experiment Setup.
East Youyi Road is an east-west urban arterial road located in Xi'an, Shaanxi Province, China. It is a typical urban arterial road with main roads and side roads in both directions, fences existing between the opposite main roads, and intermittent joint point connecting the main roads and side roads in the same direction. Emergency vehicles can change lanes between the main road and the side road in the same direction on the joint point or change to any lane at the intersection. Figure 2 presents the architecture and surrounding environment of East Youyi Road in the section from West Wenyi Road to East Cehui Road.
As shown in Figure 2, two traffic accidents were assumed on the main road of East Youyi Road in the east-west direction. Accident case 1 is set to be near West Wenyi Road and is on the left side of the figure, and accident case 2 is near East Cehui Road and is shown on the right side of the figure, and both accidents are placed on the main road in the eastwest direction. e nearest fire station to the road section is the Weixing Fire Station located 1.9 km northeast. erefore, the starting point of emergency rescue vehicles is set to be near the intersection of the east-west main road of East Youyi Road and the intersection of East Cehui Road.
Further, to verify the performance of proposed method, we assume that two traffic accidents occurred at accident point 1 and accident point 2, respectively, causing upstream congestion and vehicle queuing, and the on-site emergency disposal durations are both 15 minutes. Additionally, four typical traffic control schemes conducted by traffic police were applied to the case study to vary the optimal route for different accident points under different traffic control schemes: scheme 1-reverse main road and side road control (RMSC), scheme 2-reverse main road control (RMC), scheme 3-prograde main road control (PMC), and scheme 4-prograde main road and side road control (PMSC), where prograde is regarded to be driving from east to west because of the starting point of emergency vehicles. Furthermore, the traffic police department is the prioritized department that gets informed after occurrence of traffic accidents and can take responsibilities earlier than any other emergency departments, guaranteeing the road segment at key point being controlled before emergency rescue vehicles from other emergency departments arrive at the accident area. erefore, under the traffic control at the junction of the East Youyi Road and West Wenyi Road in west-east direction, it is essential for emergency rescue vehicles including firefighters and the ambulances to drive retrograde in the west-east direction of East Youyi Road and reach the accident site faster than driving in the congested section caused by accidents. e upstream section of the accident  (i) Initialization: minibatch k , step-size η, replay period K and size N, exponents α and β, budget T. (ii) Initialize experience replay memory H � ∅, Δ � 0, p 1 � 1 (iii) Assign the starting position of the emergency vehicle to the initial state s 0 (iv) Observe s 0 and choose action a 0 ∼ π θ (s 0 ) Compute TD error based on equation (9) Update weights in Q-network θ←θ + η · Δ according to equation (6) and then reset Δ � 0 (xvii) Every K steps copy weights into target network θ target ←θ (xviii) end if (xix) With probability ε, choose action randomly (xx) Otherwise, choose action a t ∼ π θ (s t ) end for ALGORITHM 1: Rescue path planning for emergency vehicles based on PERDQN.
Journal of Advanced Transportation site is always under control under any scheme to prevent vehicles from entering the accident area and causing secondary accidents. e traffic flow parameters on urban arterial roads suggested in [32,33] are shown in Table 1. Table 2 demonstrates the traffic density in East Youyi Road from 11:00 am to 13:00 pm under different control measures, which is obtained by collecting a week's data in the actual road sections combined with detailed consultation with the traffic police from Beilin Brigade, Traffic Police Division, Xi'an Public Security Bureau.

Path Planning Using Reinforcement Learning Approach.
e experiment is conducted on a computer with Intel i5-8300H 2.30 GHz CPU, 8 GB memory, and NVIDIA GTX1060 GPU. We use the same network architecture and hyperparameter settings in both deep reinforcement learning algorithms, whose optimizer is RMSProp with batch size 64 and ε annealed linearly from 1 to 0.0001 with decrease of 0.00005 in each step; both algorithms run 150k steps, and a replay buffer of 10k capacity is used in each algorithm. e structure of neural networks in DQN and PERDQN is also the same, the input layer is consistent with the dimension of state, the first hidden layer is fully connected with 30 neural units, the second fully connected hidden layer owns 15 neural units, and the dimension of the output layer is the same as the shape of action.
Under four typical traffic control schemes, PERDQN and DQN are used for rescue path planning; moreover, the rescue route based on the shortest path (SP) method, which is the most common path plan method in actual rescue path planning, is used for comparison. Figure 2 demonstrates the results of path planning using PERDQN and SP method on the refined road schematic diagram of the East Youyi Road. Figure 3 shows the node model diagram of the studied road section and optimal path planned by corresponding algorithms. e joint road points, the traffic control places, and the location of accident site are represented by circles, rectangles, and triangles, respectively. e yellow solid lines in separated figures indicate the rescue path planned by corresponding method under different traffic control schemes. (1a)-(1d) in Figure 3 show the optimal path for emergency vehicles under four typical traffic control schemes of case 1 decided by PERDQN, and (1e) in Figure 3 shows the optimal path selected by the shortest route method in case 1. (2a)-(2d) in Figure 3 show the optimal path determined by PERDQN under traffic control schemes in case 2, and (2e) in Figure 3 shows the rescue path of case 2 using the SP method. Figure 3 shows that the path planned by the PERDQN algorithm is different under different traffic control schemes while the path planned by SP remains the same, for the disparity of road traffic efficiency and queue length under different schemes are taken into consideration when using PERDQN. Nevertheless, the SP method always chooses the path with shortest path to the accident site, regardless of which traffic scheme is on, so only a single figure in each case is used for demonstrating the rescue path planned by the SP method. Moreover, the optimal paths of different methods under four traffic control schemes are summarized and listed in Table 3; note that L2 denotes P11 in case 1 and P15 in case 2, while L1 is P8, L3 denotes P17, and L4 represents P25 in both cases. e corresponding reinforcement learning training curves in case 1 are shown in Figure 4. Table 3 shows the optimal paths planned by the PERDQN and SP method, the routes under different control schemes are distinct, and the trajectory diagram of optimal paths is drawn in Figure 3. Figure 4 shows the training curve of case 1, where Figures 4(a)-(d) show the training curves of PERDQN and DQN under four schemes. e horizontal axis is the number of training steps and the vertical axis represents the average reward. In the first 10k steps, average rewards rarely changed because both algorithms do not update their neural network parameters until replay buffer is first fulfilled. Moreover, the existence of prioritized replay experience enhances the sampling effectiveness of experience and makes PERDQN achieve better decision-making effects than DQN, which also prevents PERDQN falling into local optimal that DQN would do, resulting in paths planned by PERDQN being better than those planned by DQN. Figure 4(e) shows the comparison of the training curves of PERDQN under different schemes. It shows that scheme 2 owns the highest average reward, which indicates that the optimal rescue path for case 1 planned by PERDQN is P16-P24-P23-P22-P21-P20-P19-P18-P17-P9-P10 using the RMC scheme, that is, to retrograde through the opposite main road to reach the accident site.
Using evaluation metrics suggested in [28], the arrival time and vehicle queue length of optimal path under different traffic control schemes in case 1 are shown in Figure 5, where the arrival time, dissipation time, and queue length      Tables 1 and 2. As demonstrated in Figure 5, the dissipation time is far longer than the arrival time because the duration of accident on-site disposal is included in the dissipation. e queue length on prograde main road is much larger than that on other three traffic control points because traffic control at L1, L3, and L4 is released as soon as the emergency vehicles reach the accident site while traffic control at L2 sustains until the on-site accident disposal is accomplished. According to Figure 5(a), the paths planned under schemes 1 and 2 using PERDQN have the same arrival time and dissipation time, so the queue lengths of four traffic control points under schemes 1 and 2 are compared in Figure 5(b); through comparison, it can be found that optimal path under scheme 2 has the shortest arrival time, and compared with scheme 1 which has the same arrival time, the length of the vehicle queue is less, which indicates that by driving on the optimal path suggested in scheme 2 by PERDQN, the emergency vehicles can reach the accident site in the shortest time with least queuing length. Figure 6 shows the training curve of PERDQN and DQN in case 2 under different traffic control schemes, which shows the superiority of PERDQN when compared with DQN, and also shows that scheme 4 leads to the optimal rescue path among four schemes. Figure 7 shows the arrival time, dissipation time of queue, and vehicle queue length at four traffic control points. It is obvious that following the rescue path planned by PERDQN in scheme 4 would lead to a minimum arrival time and least queue length. It is also believed that the trajectory of emergency vehicle passing P16-P8-P7-P6-P14 would be the optimal path for rescuing in case 2.

Conclusions
is paper proposes a refined path planning method for emergency vehicles in congested urban arterial road networks based on reinforcement learning algorithm. By abstracting the position of the road nodes and the length of the road segments, a MDP model that describes decision making for path planning is established, and regarding the travel efficiency and the impact on the road network under different traffic control schemes during accidents, the PERDQN algorithm is introduced to make decision for path planning. Taking traffic efficiency and road network impact into account and paying special attention to the congestions caused by accidents and traffic control schemes, the proposed method is capable to provide optimal path plan for emergency vehicles to reach the traffic accident site on urban arterial roads with shortest time and least road queuing length.
Based on our proposed method, our future works include extending current study to urban road with longer distance, considering the path planning with multiple rescue points, and improving the performance of the path planning algorithms.
Data Availability e data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.