A Cooperative Q-Learning Path Planning Algorithm for Origin-Destination Pairs in Urban Road Networks

As an important part of intelligent transportation systems, path planning algorithms have been extensively studied in the literature. Most of existing studies are focused on the global optimization of paths to find the optimal path between Origin-Destination (OD) pairs. However, in urban road networks, the optimal path may not be always available when some unknown emergent events occur on the path. Thus a more practical method is to calculate several suboptimal paths instead of finding only one optimal path. In this paper, a cooperativeQ-learning path planning algorithm is proposed to seek a suboptimal multipath set for OD pairs in urban road networks. The road model is abstracted to the form that Q-learning can be applied firstly. Then the gray prediction algorithm is combined into Q-learning to find the suboptimal paths with reliable constraints. Simulation results are provided to show the effectiveness of the proposed algorithm.


Introduction
Recent years have seen a growing interest in the study of route-guidance system in intelligent transportation systems, due to its advantages in reducing traffic congestion and CO 2 emissions, minimizing travel time, and conserving energy [1].More and more vehicle manufacturers have installed the route-guidance system into their products to assist the drivers' travel.
As an essential part of the route-guidance system, path planning is usually modeled as the shortest-path problem in graph theory [2][3][4][5][6][7][8].When a vehicle departs from the origin and travels to its destination, the map it is involved in can be abstracted as a graph by treating streets as edges and intersections as nodes.The weight of an edge represents the average travel time over the street, which may dynamically change when traffic flows fluctuate.For the graph of a static network, the most efficient one-to-one node shortest path algorithm is Dijkstra's algorithm [2].When the dynamic graph is considered,  * algorithm might be a better choice to solve the Origin-Destination shortest path problem [3]. * algorithm estimates the minimum distance between the destination and a node to determine whether the node is on the optimal route.However, even if the generated route is the shortest one, it may not be always available because of traffic emergencies such as sudden accidents.So it may be more practical to provide a number of candidate paths rather than just one optimal path.Lee revealed that finding multiple paths instead of one is a good way to avoid the path overload phenomenon [4].This optimal path will even accelerate the deterioration of the road network when the overload phenomenon occurs.
Traditionally, the alternative paths could be calculated by two categories of algorithms in graph theory, namely, the -shortest path algorithm proposed by Eppstein [5] and Jiménez and Marzal [6] and the totally disjoint path algorithms proposed by Dinic [7] and Torrieri [8].These so-called alternative path planning methods typically find the optimal path using Dijkstra's algorithm first.Then the candidate path set can be generated by applying link weight increment methods.These algorithms seek for the next suboptimal path iteratively until the generated alternative path satisfies some given constraints.
However, the generated way of alternative paths of these algorithms unavoidably lengthens the response time, especially when the network is huge and the traffic load is crowded and time varying.These algorithms need to adjust the link weight of the generated optimal path and then recalculate the suboptimal paths using Dijkstra's algorithm repeatedly, thus leading to heavy computation burden.In addition, these algorithms generally concern path planning of just one vehicle, while it is essential to simultaneously consider all vehicles' path in practical city road networks.
With the development of intelligent science, some researchers have focused on path planning using reinforcement learning in guidance systems.Reinforcement learning is a category of machine learning algorithms, in which a group of agents can decide how to behave according to their interaction with environment and achieve an optimal objective [9].Recently, multiagent reinforcement learning has been proposed to find the best and shortest path between the origin and the destination.Some studies treat each intersection as one agent, which needs a large amount of information interaction between traffic intersections to find the optimal path [10] while more studies cast each intersection as the state and take each link as the action in the model, which could deal with the road networks on the whole [11,12].Thus our proposed -learning adopts the latter method, treating the intersections as states in the model.
With -learning, the computational complexity of path planning algorithm could be reduced significantly and the efficiency would be improved.While most existing learning algorithms are designed to solve the optimal path planning for just one OD pair in the literature, the proposed -learning algorithm in this paper aims to seek multiple paths for different OD pairs simultaneously.By choosing the suboptimum -value of every intersection, it is convenient to provide some alternative paths rather than seeking every alternative route incrementally.This paper makes the following contributions in particular.
First, the multipath set is found for different OD pairs simultaneously using -learning.Compared with other multipath algorithms, the proposed algorithm significantly reduces the computational complexity.
Second, some reliability constraints are introduced to choose suboptimal paths in -learning.It would not be appropriate to increase the dimension of the multipath set without considering the overall reliability, which ensures that at least one alternative path is available at all times [13].
Third, the FNN prediction is combined with -learning.In order to improve the real-time capability, short-term traffic prediction is essential [14,15].This paper adopts the FNN prediction mechanism in the -learning scheme to predict the traffic condition, with which the reward of the action can be computed in advance.
Fourth, the multiagent cooperative mechanism is applied to path planning.The cooperative mechanism introduced in -learning coordinates the actions and strategies among agents with different OD pairs for long-time benefits.
In this paper, we propose a new multiagent reinforcement learning (MARL) algorithm using -learning with prediction for multipath planning for OD Pairs in the road navigation system.Compared with traditional multipath algorithms, it reduces computational complexity and improves the efficiency of vehicles' guidance with traffic prediction.The scheme could improve the overall performance of urban traffic networks and balance the traffic flow.

Model of Road Networks
2.1.Graph Abstraction of Road Networks.For urban areas, two important elements of traffic guidance are intersections and roads.During the process of modeling, the intersection can be seen as the node and the road can be seen as the edge connecting two nodes.The weight on the line stands for the traffic condition of the road, and the arrows mean the allowable direction of forward motion for vehicles.By this abstracting, a graph  = (, ) with a nonempty finite set of intersections (nodes)  = { 1 ,  1 , . . .,   } and a set of roads  ⊆ × can be used to describe the road map.Once we have the model and the route algorithm, we can find the needed optimal route.
For instance, Eastern Town of Changsha in China could be taken as an example, whose map is shown in Figure 1.The abstract graph model of Figure 1 is showed in Figure 2.   stands for each intersection that is taken as one state in reinforcement learning.  has three or four directions to neighbor intersections, including the loop direction that returns to   .For example, if one vehicle at intersection  1 drives west, it will return to  1 .The setting is convenient to model the complex road networks.
The weight of each direction will contain two elements: traffic condition () and the distance from the destination  3 ().These two elements will be illustrated in the next section.

Model Using Reinforcement Learning.
To address the model more clearly, it is necessary to provide the background on reinforcement learning (RL).Reinforcement learning is a kind of multiagent intelligent algorithms, in which agents select the best actions to maximize the cumulative reward by interacting with the environment.The RL agent interacts with its environment over a sequence of discrete time steps to pick out the optimal actions.The agent in this paper is a processing  center that deals with the path planning from one origin to one destination.
The underlying concept of RL is the finite Markov Decision Process (MDP), which is defined by the tuple ⟨, , , ⟩, where  is a finite set of environment states,  is a finite set of agent actions, :  ×  ×  → [0, 1] is the state transition probability function, and :  ×  ×  →  is the reward function.The MDP models an agent's action in an environment where it learns (through prior experiences and short-term rewards) the best control policy (a mapping of states to actions) that maximizes the expected discounted long-term reward.This mapping can be stochastic : × → [0, 1] or deterministic :  ×  → 0 ‖ 1.
For deterministic state transition models, the transition probability function  reduces to :  ×  ×  → 0 ‖ 1 and, as a result, the reward is completely determined by the current state and the action; that is, :  ×  → .The stateaction pair's value is called the -value and the function that determines the -value is called the -function.An agent can find the optimal control policy by approximating its values using prior estimates iteratively, the short-term reward  = (, ) ∈ , and discounted future reward.This modelfree successive approximation technique is called -learning.One way to satisfy this criterion is adopting  greedy approach where a random action is performed with probability  and the current knowledge is exploited with probability 1 − .
This paper denotes the link from intersection  to intersection  as a paired index of .And, accordingly, the reward of link  is defined as   ; the mean travel time of link  is defined as   ; the distance between the intersection  and the intersection  is defined as   .To cast the path planning problem of the road network as a RL problem, we identify an individual agent 's states (  ), available actions (   ), and reward (  ).
First, the two weights of each direction should be illustrated.Traffic condition can be described by the mean travel time   [13,[16][17][18][19].The distance from the destination that points out the close degree of destination is needed by the idea of reinforcement learning.
Mean travel time   could be obtained by probe vehicles such as taxis.Equipped with GPS sensors, probe vehicles can collect data on position, speed, and direction, store them, and send reports at regular intervals of time.By analyzing these data, the mean travel time of each link could be calculated.For instance, Hellinga and Fu advanced the method of probe based arterial link travel time estimation [16].Tomio et al. used probe vehicle data to identify routes and predict travel times [17].The distance   stands for the Euclidean distance between the intersection  and the intersection .
States.States of each interaction are form the basis for making choices.We model one intersection as one state   , which can easily indicate the location of vehicles.
Actions.Actions are the choices made by the agent.In one state, the vehicle will have four actions    : turning left, turning right, going straight, and turning round.For example,    = {up, down, right, left}.
Rewards.Rewards are the basis for evaluating choices.We model the link weight of each link as the rewards, which will contain two elements: the reward of mean travel time (   ) and the reward of the distance from the destination (   ).For reinforcement learning, everything inside the agent should be completely known and controllable by the agent; everything outside is incompletely controllable but may be or may not be completely known.A policy is a stochastic rule by which the agent selects actions as a function of states.The agent's objective is to maximize the amount of rewards it receives over time.The return   is the function of future rewards that the agent seeks to maximize: where  (0 <  < 1) is called the discount rate.A whole path has  links.

Cooperative Multipath Planning for OD Pairs
In this section, we propose a cooperative multipath planning method for OD pairs.In the proposed method, the agent is a processing center that deals with the path planning from one origin to one destination.In practical applications, there are typically many paths that are planned synchronously.By introducing the concept of multiagent systems, the path planning problem can be modeled to a form such that reinforcement learning is applicable.Then we introduce a multiagent reinforcement learning mechanism to optimize the -value of different paths.

Reward of Traffic Flow
Using FNN Prediction.The travel time of each link  is defined as   .For given OD pairs (  ,   ), a set of binary variables are given to represent the selection of links on a path (i.e., a path solution).Thus, for a given path , its travel time  could be calculated by [20].Consider After getting the history data of travel time   of each link , we can use some prediction algorithms to compute the future data for improving the real-time characteristics of the guidance system.In this paper, T-S FNN (fuzzy neural network) is introduced to predict the future travel time.T-S FNN is a highly adaptive fuzzy system that can automatically update the membership function of fuzzy subset [13,21].This FNN is defined by the following if-then rules.In the case of   , fuzzy reasoning is as follows: where    is the fuzzy set of the fuzzy system and    is the parameters of the fuzzy system, ( = 1, 2, . . ., ).t is the predictive output based on the th fuzzy rule.The input part is fuzzy, while the output part is deterministic, which is the linear combination of the input.
Suppose that the input of   is [ 1 ,  2 , . . .,   ], and then the membership degree of each input variable   can be computed by fuzzy rules: where    and    are the center and width of the membership function, respectively,  is the number of the inputs, and  is the number of fuzzy subsets.
All membership degrees are computed by the fuzzy operators: The output of this fuzzy model is computed by the above results: FNN is divided into four layers: input layer, fuzzy layer, fuzzy rules calculating layer, and output layer.The input layer is connected with the input vector   , so the number of nodes is equal to the dimensions of input vectors.Fuzzy Layer obtains fuzzy membership values  using membership functions and fuzzy input values (4).Fuzzy rules' calculating layer gets  by the fuzzy multiply equation (5).The output layer uses (6) to calculate the fuzzy neural network outputs.
The parameters of FNN are updated by the following equations.The error  between the desired output and actual output is defined as follows: where t is the desired output,  is the actual output, and t is the predictive mean travel time.
The parameters    of FNN are updated by where  is the learning rate,   is the input, and   is the weight computed by (5).
The center    and the width    of membership function are updated by ( 9) and (10), respectively.Consider When one finds multiple paths using -learning, it is important to determine how every action is assessed.A link can be simply described as unblocked, normal, and busy.To simplify the learning process, the precise specific flow density is neglected because finding multipath is the final goal.Thus some links having a very small difference can be regarded as the same optimal choice.This paper gives discrete weights in terms of the traffic condition and the distance from the destination to simplify the -learning reward's iterative calculations.For discretion of the mean travel time   , we first gather the maximum travel time of current intersection as "−1"; then we get the difference value between the current maximum travel time and the minimum travel time.So the travel time can be graded into "0," "1," "2," and "3" as the reward    .The distance from the destination (   ) can be graded into two levels: "1" and "0"; the nearest neighbor intersections are "1," while the others are "0."So   is deduced in the following equation, where  (0 ≤  ≤ 1) is the scaling factor between    and    :

Cooperative Multiagent Multipath Planning Algorithm.
If each agent acts independently without cooperation, the learning procedure at node  can be written as where   ∈ (0, 1] is the learning factor and  ∈ [0, 1) is the discount factor.RL have been well developed for discrete-time systems to solve the optimal problem online by using adaptive learning techniques to determine the optimal value function.
An iterative solution technique is given by Algorithm 1.
An agent is a processing center that deals with the path planning from one origin to one destination.The above algorithm focuses on one single agent, while path planning agents with different destinations will observably impact on each other.Thus we propose a cooperative multiagent reinforcement learning (MARL) multipath path planning method, in which all -values of different path plans for every intersection are considered, and the maximum value is chosen to ensure that all path planning is optimized under the consideration of each other.It is worth mentioning that, in the proposed algorithm, the decision-making process is assumed ideal, and then the waiting time at the intersections is thus ignored.
In this approach, the -value estimated at each autonomous agent is updated based on the individual rewards as well as on information obtained from other agents in the neighborhood."The neighborhood" here refers to a group of agents that own different destinations.Every agent exchanges the largest -value that is associated with its current state with every other agent in its neighborhood.The value iteration procedure at agent  for the state-action pair (  ,   ) can be summarized as where (, ) is the weight to reflect the effect of agent  to agent  and   refers to the set of neighboring agents of .
The simplest strategy for computing the weights (, ) is to just consider the total number of agents in the neighborhood; that is, (, ) = 1/|  |, in which case ∑  (, ) = 1.
It is possible to adopt more complex strategies to take into account the different effects of the different neighbors.
When the additional information obtained from agents was incorporated into the value iteration procedure, each agent can ensure that the agent's strategies are decided based on all its neighbors' actions.

Constraint Conditions of Multipath Set.
By the policy iteration algorithm in Section 3.2, we can derive the -value table for multipath planning.By comparing the four -values of one intersection, we can easily find which action is the best.
Then the optimal path is easily obtained.Although the obtained optimal path is the fastest one, it may encounter traffic emergencies such as a sudden accident, resulting in unavailability of the optimal planning path for the running vehicle.So it is essential to provide several candidate paths rather than just one path, avoiding the deterioration of the road network environment when one guided vehicle is well popular.
In most cases, there may be several actions forming a best action set.Then we can find the multipath by choosing the best actions.However, when the road network is huge, it is difficult to find multipath because there exists only one optimal action for all intersections in most cases.So we should find the suboptimum action that satisfies the following constraints.
First, we introduce   as the average -value of one intersection.Then we compute the average difference between each (, ) and the average -value of one intersection.Furthermore, we can get  value, the average difference for all states: When vehicle arrives at one intersection, it has to make the choice about which way to go by computing the difference between this (, ) and the average -value of this intersection: Once the -value table has been calculated, we can select  ( ∈ (0, 1)) value and compute the corresponding  value.So we must solve the problem of how to ensure the  value.When  is closer to 1,  is larger.And there are more paths that could be taken as candidates.When  is closer to 0,  is smaller, resulting in fewer candidate paths.While more candidate paths sometimes are not stable, the reliability of a path set  should be taken into account.
The reliability of a path can be defined as the probability of not encountering an abnormal delay during a trip along the path, which can be estimated by the reliability of a series of links of the path.Under dynamic conditions, some or all candidate paths may fail together, resulting in a joint failure.In this situation, the reliability of the path set shown in (10) will be weakened.Thus, in the calculation of the candidate paths, it is important to reduce the chance of joint failure of candidate paths [18,19]: where Φ stands for the reliability of the path set ,  is the number of disjoint subpaths in the subpath set ,  1 is the th link in a normal state on the th disjoint subpath,   1 is the reliability of the th link on the th disjoint subpath,  1 is the number of links in a normal state on the th disjoint subpath,  2 is the th failed link on the th disjoint subpath,   2 is the reliability of the th link in a failed state on the ith disjoint subpath, and, finally,  2 is the number of links in a failed state on the th disjoint subpath.Generally, the higher the reliability of the candidate path set, the less the chance that all candidate routes will be unacceptable during one trip.Given Φ value (such as 0.9), we can choose  value.Then we find the stable multipath paths.

Simulation Results and Analysis
To test the proposed method in different traffic environments, simulations have been conducted in randomly generated grid road networks.In the simulation scenarios, there are 6 to 25 nodes with 10 to 50 edges.The general network discussed in this paper is shown in Figure 1.In the graph, a node is represented by a box with the node's number shown in it.The start and destination nodes are represented by the rectangles with thicker outlines.The mean travel time and the distance from the destination of all links are set by a random matrix.

Path Planning for Single Agent.
The scenario is shown in Figure 1.The single agent only computes the path towards one single destination.The main objective of Figure 3  Given the matrix of -value, we can quickly get the optimal multipath.Compared with the other multipath algorithm, our proposed algorithm only uses the information of the whole road networks once.

The Comparison of Several Multipath
Algorithms.This simulation is made up of 30 nodes with 49 edges in Figure 8.
Each circle stands for one intersection   .Each edge has three parameters: mean travel time (3 or 7), levels of mean travel time (−1 or −3), and reliability of this edge (0.98 or 0.87).
Under the same simulation environment, we have the following results in Table 6 by Dijkstra's algorithm, -shortest planning, -learning multipath path planning.From the results, we can draw the follow conclusions: Dijkstra's algorithm can give the shortest path with the least time, while the reliability of this path is lower because reliability of each edge is less than 1.The last two algorithms are almost the same.The mean cost of the four paths of -shortest planning is less than -learning multipath path planning, while the reliability of -learning multipath path planning is superior to -shortest planning.
Next, the performance comparison is given for the last two algorithms on the planning time elapsed and the path  reliability.Figure 9 shows that -shortest planning's time increases linearly over time, while -learning multipath planning increases logarithmically.In addition, we can ensure that the reliability of -learning multipath planning is more than 0.9.But the reliability of -shortest planning is less than 0.8.

Conclusion
This paper proposes a new -learning algorithm to solve the multipath planning problem for OD pairs in a city urban road network.Different from traditional multipath algorithms, this paper focuses on multipath planning via FNN-based -learning, an algorithm that makes it easier to choose alternative paths by recurring to the suboptimum -value.Furthermore, the paper uses logic to increase the response speed and imposes constraint conditions on path generating to ensure the reliability of the set of candidate paths, which has rarely been taken into account in existing works.Simulation results validate the efficiency and adaptability of the proposed algorithm.In the future work, we   will further consider the waiting time at intersections in the algorithm.

Figure 9 :
Figure 9: Planning time over the increase of node number of networks with different methods. )

Table 1 :
value table of 4 * 4 road network with destination Intersection 11.

Table 2 :
value table of 5 * 5 road network with destination Intersection 17.

Table 3 :
value table of 5 * 5 road network with destination Intersection 14.

Table 4 :
Cooperative  value table of 5 * 5 road network with destination Intersection 17.

Table 5 :
Cooperative  value table of 5 * 5 road network with destination Intersection 14.

Table 6 :
The comparison results of different algorithms.