Hierarchical Sarsa Learning Based Route Guidance Algorithm

.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In modern society, route guidance problems can be found everywhere. Reinforcement learning models can be normally used to solve such kind of problems; particularly, Sarsa Learning is suitable for tackling with dynamic route guidance problem. But how to solve the large state space of digital road network is a challenge for Sarsa Learning, which is very common due to the large scale of modern road network. In this study, the hierarchical Sarsa learning based route guidance algorithm (HSLRG) is proposed to guide vehicles in the large scale road network, in which, by decomposing the route guidance task, the state space of route guidance system can be reduced. In this method, Multilevel Network method is introduced, and Differential Evolution based clustering method is adopted to optimize the multilevel road network structure. The proposed algorithm was simulated with several different scale road networks;theexperimentresultsshowthat,inthelargescaleroadnetworks,theproposedmethodcangreatlyenhancetheefficiency ofthedynamicrouteguidancesystem.


Introduction
In the recent decades, more and more people own their private vehicles, and the traffic pressure in the city increased rapidly.Citizens' life quality is always undermined by daily delay which is one of the consequences of traffic congestion.The congestion can also cause the aggravation of pollution and the increasing of travelling cost.The dynamic route guidance method, which can not only provide travel routes but also relieve the traffic congestion, attracted many scholars' attention [1][2][3].Dynamic route guidance system (DRGS) is an important part of Intelligent Transportation System (ITS), in which centrally determined route guidance system (CDRGS) [4] is economically effective and efficient for drivers and can avoid Braess's paradox [5].CDRGS guides all the vehicles for all the possible origin destination (OD) pairs with the real-time information and considers guidance in terms of the whole traffic system.However, traditional route guidance methods, like Dijkstra Algorithm [6] and A * Algorithm [7], are not suitable in the dynamic traffic environment [8], because these shortest path algorithms may cause traffic concentration and overreaction phenomenon when they are adopted to guide plenty of vehicles.Multiple paths routing algorithm [9] could relief the traffic jam by distributing traffic into different paths and does not depend too much on the realtime data, but when it needs to compute new solutions, the response time may be lengthened.Reinforcement learning strategy has been widely used in the dynamic environment [10][11][12][13], because it can reduce the computational time and make full use of real-time information.With these characters, reinforcement learning strategy has been used in the dynamic route guidance system.Shanqing et al. [14] applied Sarsa learning to guide vehicles in the dynamic environments by considering minimizing the route computational time.In our earlier study [15], Sarsa learning is adopted to guide vehicles in CDRGS and the Boltzmann distribution is selected as the action selection method.The results show that, compared with traditional methods, the proposed Sarsa learning based route guidance algorithm (SLRGA) and Sarsa learning with Boltzmann distribution algorithm (SLWBD) can strongly reduce the travelling time and relieve traffic congestion.
However, the scale of real-world road networks is usually large, and then the scale of state set of reinforcement learning based route guidance system responding to these road networks is huge.Thus it is really difficult for reinforcement learning based route guidance system to be convergent in the larger scale traffic environment.So, how to solve the route guidance problem in the large scale road network with reinforcement learning method is a challenge.Hierarchical reinforcement learning (HRL) can improve in both time and searching space for learning and execution of the whole task by recursively decomposing larger and complex tasks into sequentially executed smaller and simpler subtasks [13].The decomposition strategy is a key point in the hierarchical context [16], and when HRL is used in solving the route guidance problem in the large scale road networks, avoiding congestion phenomenon and reducing vehicles' traveling time can be achieved by an effective decomposition of the route guidance.
Heng Ding et al. [17] proposed a macroscopic fundamental diagram (MFD) based traffic guidance perimeter control coupled (TGPCC) method to improve the performance of macroscopic traffic networks.They establish a programming function according to the network equilibrium rule of traffic flow amongst multiple MFD subregions, which reduce the congestion phenomenon by effectively assigning the traffic flow amongst different subregions.So, partitioning the original network and assigning traffic flows in subnetworks are effectively considered as the objective of the decomposition strategy when HRL is adopted for solving route guidance problems.
Multilevel approach has been successfully employed in a variety of problems [18] and Multilevel Network method [19] is considered to be introduced to segment the original network into several subnetworks and generate higher level network.S. Jung et al. [20] indicated that the optimal route on higher level network between two nodes is equivalent to that on original road network.Thus, Multilevel Network method can be utilized to perform the route guidance task in the large scale road network, in which route guidance on the higher level network can be seen as the decomposition of the route guidance task, and as a result, this method would not affect the preciseness of route guidance.
Therefore, Multilevel Network structure based HRL is adopted in this study, and considering the on-line learning characteristic of Sarsa learning method and its effective performance in solving route guidance problems [15], the hierarchical Sarsa learning based route guidance algorithm (HSLRG) is proposed to guide vehicles with proper routes in the large scale road network.The route guidance task can be divided into several smaller route guidance tasks, and then these smaller route guidance tasks perform on the corresponding subnetworks.To generate the Multilevel Network structure, traditional clustering methods like K-means [21] and K-modes [22] have been considered.However comparing with conventional clustering methods, evolution based clustering method can avoid tripping into local optimal problem [19].In addition, evolutionary algorithm can always deal with multiobjective problems effectively [23][24][25][26].In this study, Differential Evolution [27,28] based clustering method, which can be adopted in complex environment [29], is introduced, and multiobjective functions are designed to optimize the Multilevel Network structure.
The contribution of this work is shown as follows: Firstly, we proposed a novel Multilevel Network structure based dynamic route guidance method.By reducing the state action space with Multilevel Network structure, the route guidance method can greatly reduce the congestion phenomenon in the road network and improve the efficiency of the whole transportation system notably.Secondly, we provide a Differential Evolution based clustering method to construct the Multilevel Network with multiobjectives.These objectives consider optimizing the structure from both higher level network and subnetwork aspects and optimize the structure greatly.
This paper includes seven sections.Section 2 introduces the Multilevel Network based route guidance model (MNRGM).Section 3 introduces the Differential Evolution based clustering method.Section 4 proposes HSLRG and describes the main procedure and details of it.Section 5 introduces the experimental conditions and discusses and analyzes the results.The last parts of this paper are the conclusion and acknowledgement sections.

Multilevel Network Based Route Guidance Model
In this section, MNRGM is introduced.HRL can reduce the searching space, and in this study, it is used to decompose the vehicle guidance from the original network into subnetworks.Sarsa learning, which fits for solving dynamic environment problems [30,31], is adopted to guide vehicles in the Multilevel Network.The purpose of this model can be seen as follows: (i) Reduce the average travelling time of vehicles in the large scale road network.
(ii) Reduce the probability of congestion in the large scale road network.
(iii) Reduce the searching space of reinforcement learning in the large scale road.
And we assumed that the real-time travelling information in the Multilevel Network can be collected.
. .Multilevel Network Model.Multilevel Network is constructed by dividing the original network into several subnetworks.The example of two-level network can be seen as Figure 1.The boundary nodes of subnetworks and the optimal routes between them are nodes and links on higher level network.
In this model, the topographical road map is seen as the directed network (, ), where  denotes the set of nodes of road network and  denotes the set of links of road network; i.e.,   corresponds to the link from node  to node .The cost of it in this model is measured by the traveling time.If G(V, E) can be divided into m subnetworks like  1 ( 1 ,  1 ),  2 ( 2 ,  2 ), . . .,   (  ,   ) then Original network In the subnetwork, the nodes can be divided into two categories: interior nodes and boundary nodes.A node is a boundary node if it belongs to more than one subnetwork, and vice versa.

Higher level network
The Multilevel Network model is shown as follows. Indices The optimal path on Multilevel Network can be calculated as follows: where constraints ( 4) and ( 5) can ensure the flow conservation rule to be observed for  \ {, }.
The   is set as 2 in the simulations of this study.We use  ℎℎ (  ,   ) to represent the higher level network, where   and   are the set of nodes and links of higher level network, respectively.
The set of boundary nodes between any subnetworks   (  ,   ) and   (  ,   ) is   ∩  , where  ̸ = .We use (  ) to represent the set of boundary nodes of subnetwork   (  ,   ).Then, Let   represent the set of the boundary nodes: Links of the higher level network are calculated and generated based on   .In   (  ,   ), we use (, V) to represent the optimal route between any node pair  and V in (  ); the cost function   (, V) of (.) is shown as follows: without any other boundary node on the route; For subnetwork   (  ,   ), let Let   represent the set of links of the higher level network: In order to guide vehicles in this structure, once the OD pairs are determined, the higher level network is extended, the extension of higher level network can be denoted as   ℎℎ (   ,    ), where    is the extension of   , which can be shown as    =   ∪  ∪ , and    is the extension of   , which is shown as denotes the set of routes from original node to boundary nodes in the corresponding subnetwork, and () denotes the set of routes from boundary nodes to destination node in the corresponding subnetworks, which can be shown as where  is the set of original nodes,  is set of destination nodes, and   and   are the corresponding subnetworks of  and .
. .Multilevel Network Based Hierarchical Reinforcement Learning . . .Hierarchical Sarsa Learning.Hierarchical reinforcement learning (HRL) [32] decomposes a reinforcement learning task into a hierarchy of subtasks so that lower-level child tasks can be invoked by higher-level parent tasks to reduce computing time and searching space.In this study, the route guidance tasks are decomposed according to the structure of the Multilevel Network.As shown in Figure 2, the guidance in the higher level network (the selected series of links in the higher level network) determines the subtasks in the subnetworks.It guides vehicles from a node in the subnetwork to a boundary node or a destination node in this subnetwork.For example, as shown in Figure 3, the vehicle guidance on the original network is decomposed into guidance on three subnetworks, which can be seen as follows: (i) Vehicle departs from original node  and arrives at boundary node   in subnetwork  1 ; (ii) Vehicle departs from boundary node   and arrives at boundary node   in subnetwork  2 ; (iii) Vehicle departs from boundary node   and arrives at destination node  in subnetwork  3 .
In the hierarchical Sarsa learning model, the agent is the CDRGS in each road network (both subnetworks and higher level network), and the purpose of the CDRGS is to guide all the vehicles in the traffic road network and to pursue the optimal travelling time.For each agent, the state is continuous, which is the positions and destinations of all the vehicles in the corresponding subnetwork (or higher level network); the description of the continuous state space of any graph   can be shown as follows: (  ) = (( (V 1 ) ,  (V 1 )) , . . ., ( (V  ) ,  (V  )) , . ..) (13) where   is the th subnetwork, V  ∈ (  ) are the vehicles in   , (V  ) is the position of vehicle V  , and (V  ) is the destination of vehicle (V  ).
In order to reduce the state space, the discrete states which are the nodes and destinations of each vehicle are adopted.In the original network, the state space is   (); with the Multilevel Network structure, the state space is reduced, each subnetwork has the state space   (  ), the state space of higher level network is   ( ℎℎ ), and the function can be seen as follows: (  ) = ((V (V 1 ) ,  (V 1 )) , . . ., (V (V  ) ,  (V  )) , . ..) where   is the th subnetwork, V  ∈ (  ) are the vehicles in subnetwork   , V ∈   ,   is the set of nodes in subnetwork   , V(V  ) is the nearest node in front of vehicle V  , and (V  ) is the destination node of vehicle (V  ).
The action of each agent is an array which is composed of selections of next guided link of each vehicle, which is shown as follows: where (.) ∈   is the guided next link of vehicle, and   is the set of links in subnetwork   .
According to the (  ), as shown in Figure 4, in each network (both higher level network and subnetwork), vehicles would receive their guidance information.And the passing time which is the time spent by each vehicle in the corresponding link composes the penalty; the penalty can be seen as follows: where (V  ) is passing time of vehicle V  for the link (V  ).Q-value matrix is used to guide vehicles in each subnetwork and higher level network, in which each Q-value represents the estimate optimal traveling time from the corresponding link to the destination.The proposed vehicle guidance method on both level networks is based on Sarsa learning.The equation of updating Q-values in the matrix with Sarsa learning method is shown as follows: (, ) ←   (, ) +  * (  +  *   (, ) −   (, )) (17) where   (, ) is the estimated optimal traveling time to destination  for each vehicle which selects moving to node  in node ;   is the travelling time of the latest passing time of link   ;  is the node belonging to () (the set of nodes connected from node ), through which vehicles travel to destination  after they passed link   ;  is the learning rate. is the discount rate.
Boltzmann distribution [33] is adopted as the probability distribution of action selection in this study which can balance the exploration and exploitation of action selection according to the Q-values.The probability model of action selection is shown as follows: where () is the average Q-value from node  to destination ;  is .where   , ,  are constants;  is the total number of vehicles in the road network.
. . .Optimizing Multilevel Network Structure.In this study, in order to accelerate the convergence of reinforcement learning in the Multilevel Network, the structure of the Multilevel Network should be considered.Both state action space of subnetworks and higher level network can be optimized with clustering method.Two objective functions have been considered, which are described as follows: where (.) is the searching space of the road network, and it can be calculated as follows: where (V) is the number of links departing from node V if the set is not null; otherwise it is 1.

Differential Evolution Based Clustering Method
Ding et al. [17] divided the heterogeneous networks into homogeneous subregions, which have small variances in link densities, such that each subregion has a well-defined MFD shape.In the proposed method, multiple homogeneous similar scale subnetworks and a virtual higher level network which can effectively assign traffic flows among them are required.In this section, a Differential Evolution based clustering method is used to generate the previous Multilevel Network structure offline.
. .DE Based Clustering Method.DE [27,28] is a well-known direction based evolution method which can search the optimal solution effectively in large scale searching space.In order to construct the proper Multilevel Network structure, various individuals should be maintained in the population, and an effective evolution direction is necessary.Thus DE is selected as the clustering method.
In the proposed method, decoding operator is clustering the road network, and after decoding, each gene in the chromosome becomes a subnetwork.On the other word, subnetwork   (  ,   ) is cluster  of the clustering result of the corresponding chromosome.
In order to accelerate the convergence of reinforcement learning in the Multilevel Network, two factors are considered when the Multilevel Networks are constructed.The first one is the convergence efficiency of reinforcement learning on each subnetwork.The second one is the convergence efficiency of reinforcement learning on the higher level network.Therefore, there are two objective functions, minimizing the state action space of all subnetworks in (23) and minimizing the state action space of the higher level network in (24).
( ℎℎ ) In order to achieve these two objective functions simultaneously, a fitness function is used, which is shown as follows: . .Genetic Representation.When the Multilevel Network structure is constructed by the DE based clustering method, the number of clusters has strong influence on the number of nodes and links of the higher level network [34], which will affect the two objective functions.So, an appropriate number of clusters should be found to optimize the structure of the Multilevel Network.
In this study, in order to get the proper number of clusters, two vectors, coordinate value vector and available vector, are defined in the chromosome.Each element in the coordinate value vector is corresponding to the element in the same position of the available vector.The maximum length of these vectors is , the coordinate values vectors present cluster centroids, and each number of the available vector represents the validity of the corresponding centroid; if the number is bigger than the threshold V, the corresponding centroid is valid, and visa versa.
The decoding procedure is the clustering procedure, in which the Multilevel Network structure is generated with each valid gene.
. .Differential Evolution.The DE operator of any individual   can be seen as follows: where  1 ,  2 , and  3 are three different individuals which are randomly selected from the population,   is the mutants of   , ( 2 - 3 ) forms a vector, and  which is a positive real number controls the length of the vector.The overall procedure of DE based clustering method can be seen in Algorithm 1.

Hierarchical Sarsa Learning Based Route Guidance Algorithm
. .Overall Procedure.After generating the optimized Multilevel Network structure, the proposed hierarchical Sarsa learning based route guidance algorithm (HSLRG) can be divided into 3 stages: (i) Initializing stage: initialize Q-values of all the boundary nodes and destination nodes in the Multilevel Network.
(ii) Route guidance stage: guide vehicles in the higher level network and subnetworks.
where ,  ∈  is set of nodes;  ∈  is set of destinations;   is link departure from node  to node ;    is the history traveling time of link   ; () is set of nodes depart from node .
In this study, the procedure of initialization can be seen as Algorithm 3.
. .Route Guidance Procedure.In the HSLRG, the guidance is based on the Sarsa learning in the Multilevel Network.The guidance in the higher level network determines the actual destinations of vehicles in each subnetwork.The route guidance procedure for each vehicle of CDRGS can be divided into 3 steps, which can be seen as follows.
Step .Guide vehicle in the higher level network with Algorithm 4 and get the selected link (the subtask on the subnetwork).
Step .According to the result of Step 1, guide vehicle in the subnetwork with Algorithm 4 until the vehicle reaches the boundary node or destination.
Step .If the vehicle does not reach destination, turn to Step 1.
. .Updating Procedure.During the updating stage, the following steps should be performed: The procedure of updating is presented as Algorithm 5.
The updates of Q-value for each subnetwork/high level network are independent of each other, so the updating of the proposed method is designed computing parallel, and the time complexity of updating stage is (|  | * |  |), where, |  | and |  | are the number of elements in destination set and link set in the road network , respectively.

Simulation
In this study, the SUMO [35] simulator is used to implement the experiments with three different digital road networks as shown in Table 1.All the algorithms were coded in Java and a PC with 8-core Xeon E5-2640 v3 2.60GHZ processor and 128GB of RAM running Linux (centos 6.6) was used for the all experiments.Our experiments are conducted using real networks, representing various roads of Japan (Experiment 1 and Experiment 2) and US (Experiment 3).The Japan digital road maps are taken from Japan Digital Road Map Association (JDRMA).The US digital networks is provided by the Topologically Integrated Geographic Encoding and Referencing (TIGER)/line collection, available at http://www.diag.uniroma1.it/challenge9/data/tiger/.In the simulation, a time step means a second, and the length of simulation of experiments is set as 15000 time steps. . .Multilevel Network.DE based clustering method is used to generate Multilevel Network of each experiment, the evolution process can be seen as Figure 5, the x-axis is the generation, and the y-axis is the average fitness of individuals in the population.The results of the DE can be seen as Table 2.
It can be seen that the DE based clustering method can reduce the fitness during the process of evolution effectively and Multilevel Network structure which is used in the proposed algorithm has been optimized greatly.
. .Comparing Method.In the experiments, the Dijkstra algorithm (DA) and Sarsa learning based route guidance on the original road network method are adopted to compare with the proposed method.
(2)Sarsa learning based route guidance on the original road network method: In order to evaluate the efficiency of Multilevel Network based route guidance method, Sarsa learning with Boltzmann distribution algorithm (SLWBD), which only considers the route guidance on the original road network, is adopted as comparing method in the simulations.The Boltzmann distribution is selected as the action selection method.The Q-values are updated with (17) every 60 time steps.
. .Evaluation.Two kinds of criteria are adopted to evaluate the performance of route guidance algorithm.
(1) The number of vehicles in the traffic system ; (2) The average traveling time of vehicles arriving destinations in the a period of time, which is calculated as follows: where  is the time step; () is the total number of vehicles arriving destinations in a period of time until ; V  is one of the vehicles that reached destination in the time period.(V  ) is the traveling time of vehicle V  ; Every 100 time steps, these figures are estimated, and the time period is set as 100 time steps.These two criteria can reflect the traffic condition in the road network; lower  means less congestion happened in the road network; lower V reflects that vehicles were guided by better routes and the time they cost on waiting in the road network is reduced.So these two criteria are adopted to evaluate whether the HSLRG is converged.
. .Experiment.In this part, simulations are conducted to evaluate the performance of the proposed HSLRG.In order to evaluate the performance of the proposed method, the drivers' acceptance of guidance is supposed as 100%.The updating interval of higher level network is set as 30 time steps, and the updating interval of subnetworks network is 60 time steps.The data shown in the following tables are results of the average of multiple independent simulations.In order to accelerate the converge of reinforcement learning at early stage of simulation and keep Q-values stable at middle and final stage, the learning rate  of Sarsa learning is changed depending on the time step of simulation.The concept of Simulated Annealing [36] is introduced, and the equation can be seen as follows: where  is the current time of simulation, MAXTIME is the total simulation time,  and  are constants, and minimum is the lower limit of .6(f) in Figure 6 show  and V of these experiments, respectively.Table 4 shows the mean and standard deviation (Std) of these experiments.
As shown in Figures 6(a)-6(f), HSLRG has lower figures of evolution values than SLWBD and DA almost during the entire simulations.These data indicate that HSLRG is fitting for guide vehicles in the large scale route network; it can alleviate the congestion phenomena and reduce the traveling time and traveling distance of vehicles in the larger scale route network.In Figures 6(a)-6(d), the tendency of  and V of HSLRG and SLWBD becomes decreasing after early stage of simulation (about 5000 time steps in Experiment 1, and about 2000 time steps in Experiment 2) while as shown in Figures 6(e) and 6(f), the evaluation values of SLWBD increased dramatically during the total 15000 time steps.The data indicate that, in limited size of road network, SLWBD has reasonable performance; however, in the larger scale road the performance of SLWBD becomes poor.As Figures 6(a)-6(f) show, the measured values of DA increased continuously.This performance indicates that DA is not a proper method for route guidance in the dynamic environments.The main reason is that DA only considers the static shortest routes, which may cause negative behavioral phenomena in dynamic transportation system, including overreaction and concentration phenomena.As shown in Table 4, from the mean and Std of  and V, we can see that the performance of proposed HSLRG dominates that of SLWBD and DA, which can prove the effectiveness of the proposed HSLRG.
As shown in Table 3, it can be seen that in all the experiments HSLRG has the best performance and outweighs the other two methods; the statistic data indicate that vehicles guided by this algorithm have not only the largest number of vehicles arriving destinations and the least mean traveling time, but also the least traveling distance.SLWBD has better performance than DA in Experiment 1 and Experiment 2 but worse performance in Experiment 3. The statistic result indicates that Sarsa learning based route guidance on the original road network is not suitable for guiding vehicles in the large scale road network.It is because the speed of convergence of reinforcement learning depends on the scale of the searching space, and it is exponential growth with the increasing of the scale of road network.And the proposed HSLRG introduced optimized Multilevel Network structure, by which route guidance on the subnetwork and route guidance on the higher level network are combined to compress the searching space of the traffic system.So, the proposed HSLRG can enhance the efficiency of CDRGS greatly.

Conclusion
In this paper, we have proposed the hierarchical Sarsa learning based route guidance algorithm (HSLRG) to solve route guidance problem in large scale road networks.HSLRG applies Multilevel Network method to reduce the state space of the traffic environment, which can greatly accelerate convergence of the route guidance algorithm.The effectiveness and efficiency of HSLRG were studied in three different scale road networks.The simulation results show that, in the large scale road network, comparing with SLWBD and DA, HSLRG can guide vehicles to the destinations more effectively.How to guide vehicles with multiobjective and considering personality of drivers are worthwhile for future research.

Figure 1 :
Figure 1: An example of Multilevel Network.

Figure 2 :
Figure 2: An example of vehicle guidance in the higher level network.

Figure 3 :
Figure 3: An example of decomposition of route guidance.

Figure 4 :
Figure 4: Demonstration of vehicle guidance in the network.

(
iii) Updating stage: update Q-values of all the boundary nodes and destination nodes in the Multilevel Network.Before each updating stage, the CDRGS collects travelling information from the environment.During the period, the CDRGS guides vehicles with the Q-values updated in last updating stage.The overall procedure of the proposed HSLRG is shown as Algorithm 2. . .Initializing Q-Values.Q-value based Dynamic Programming is adopted to initialize the Q-values of Sarsa of the Multilevel Network, and Q-values are iteratively calculated by the following equation. ()  (, ) =    + min ∈()  (−1)  (, )  ∈  −  −  () ,  ∈  ()

FitnessFigure 5 :
Figure 5: The evolution process of road network in Experiment 1, Experiment 2, and Experiment 3.
: cost of link   in level  of Multilevel Network. 1, if and only if link    included   (, ) in level  0, otherwise

Table 1 :
Data of experiments.

Table 2 :
Results of DE based clustering method.
begin //Initializing Q-value of    in each subnetwork for each  ∈    do Initialize   According to Eq. (27) in the corresponding subnetwork end for //Initializing Q-value of  in the higher level network for each  ∈  do Initialize   According to Eq. (27) in the   Next link   begin Get the link   link of vehicle V //Calculating the probabilities of next links according to Eq. (18).

Table 3 :
Results of Experiments.

Table 4 :
Mean and Std of Experiment results.