Reinforcement Learning RampMetering without Complete Information

This paper develops a model of reinforcement learning ramp metering (RLRM) without complete information, which is applied to alleviate traffic congestions on ramps. RLRM consists of prediction tools depending on traffic flow simulation and optimal choice model based on reinforcement learning theories. Moreover, it is also a dynamic process with abilities of automaticity, memory and performance feedback. Numerical cases are given in this study to demonstrate RLRM such as calculating outflow rate, density, average speed, and travel time compared to no control and fixed-time control. Results indicate that the greater is the inflow, the more is the effect. In addition, the stability of RLRM is better than fixed-time control.


Introduction
Increasing dependence on car-based travel has led to the daily occurrence of recurrent and nonrecurrent freeway congestions not only in China but also around the world.Congestion on highways forms when the demand exceeds capacity.Recurrent congestion reduces substantially the available infrastructure capacity at rush hour, that is, at the time this capacity is most urgently needed.Moreover, congestion also causes delays, increases environmental pollution, and reduces traffic safety.
Ramp metering is essential to the efficient operation of highways, particularly when volumes are high.According to Papageorgiou and others, ramp metering is divided roughly into the reacted type and the preceded type [1].DC (demand-capacity), OCC (occupancy), and ALNEA [2] are among the well-known local response type ramp metering [3].In DC, the actual upstream volume is measured at regular short intervals and is then compared to the downstream capacity, which may be calculated by using downstream traffic conditions.OCC uses a predetermined relationship between occupancy rate and lane volume, developed from data previously collected at the highway adjacent to the ramp being considered.ALNEA is the ramp metering which sets up the private-use rate of an onramp based on the measured value of main line traffic.ALINEA has an example of application in some countries of Europe and is made highly validated compared to DC and OCC.Iwata, Tsubota, and Kawashima have proposed the ramp metering technique using the predicted value by a traffic simulator [4].Reinforcement learning ramp metering based on traffic simulation model with desired speed was proposed by Wang et al. [5].The aim of this study is to propose reinforcement learning ramp metering without complete information.

Traffic Flow Simulation Model
. Figure 1 describes carfollowing behaviors.In a microsimulation model, a modeled fundamental behavior is the "car-following" which adjusts the driver's characteristics: the distance between two adjacent cars, the relative speed, and so forth.
In 1953, Pipes proposed the following basic differential equation model for car-following behavior: where ẍ, ẋ, and x denote the acceleration, speed, and distance from the reference point of vehicle n, respectively, and a Figure 1: Car-following behavior.
is a constant.In the model, the acceleration of the vehicle which follows a leading vehicle is proportional to the speed difference between the vehicles.It is assumed that the delay of time in which the vehicle responds to the speed difference is so small that it can be neglected.To remove this drawback, Chandler introduced a reactive delay time T. Based on the rationale that the acceleration of the following car is also influenced by its speed and the distance between the vehicles, Gazis, Herman, and Rothery proposed the general type of car-following model: Newell proposed the following model in which the acceleration is propositional to an exponential function of the distance between the vehicles, based on real data: Although the above modifications have improved the reality of car-following model, they have the following two drawbacks.When the proceeding vehicle does not exist, this implies that a car will maintain an initial speed.On the other hand, when the speed difference is 0, the acceleration is 0. This implies the unrealistic phenomenon that the following car will not apply the brake even when the distance to the preceding car approaches 0 and will not accelerate even if the distance is very long.To solve the above-mentioned problems, Treiber and Helbing introduced the intelligent driver model [6], which introduces a desired speed and a shortest distance between cars.The IDM is given as Δv where x is distance; n is the nth car; v is the speed; l is the length of car; s 0 is the desired minimum gap; a is the maximum acceleration; s * is the effective gap; b is the comfortable deceleration (a ≤ b); δ is the parameter; T is the time gap; ν 0 is the desired speed.Figure 2 presents lane change behaviors.To simulate driver's behavior in the merging section on freeways and the merging behavior in the weave section, and so forth, the lane change model is needed [7].We propose a new lane change model which describes driver's behavior depending on judgment functions [8,9].We focus on a vehicle approaching to a confluence point and describe its behavior with several variables: the relative speed between the car and cars in current lane, the locations of both the main line cars and the on-ramp cars, driver's judgment functions for changing his lane, and driver's desired speed.The driver's judgment function for the free merging is different from the judgment function for the forced merging.A free merging implies that a car on the ramp can merge into the main line without influences, and cars on the main line are not interfered.
When forced merging models of psychological condition and physical condition are both satisfied, the driver conducts lane change behaviors.Otherwise, the driver continues the carfollowing behavior without lane change behaviors.Physical condition presents the ability of lane change.The lane change model with driver's judgment function is expressed as follows: where h, g are judgment function; x is the distance from reference point; ν is the speed; L is the length of a vehicle; t is the judgment time; ν 0 is the desired speed, subject to normal distribution; δ, ζ, θ, ξ (δ, ζ, θ, ξ ∈ [0, 1]) are the adjustment coefficients; A is the rapid acceleration with upper bound e; and B is the rapid deceleration with upper bound d.Parameters A and B are associated with vehicle c's judgment functions for lane change and decide the free merging or the forced merging.Since vehicle c judges to accelerate or decelerate to merge into the main line, two events are mutually exclusive.The function h judges whether vehicle c accelerates or decelerates to merge according to the given space and speed conditions between vehicles f and c.Similarly, the function g is applied to judge in the relationship between vehicles c and b.If both A and B take 0, the distance between two vehicles f and b is large enough for vehicle c to be accommodated to enter into the main line, then the free merging occurs (no acceleration or deceleration behavior is required for vehicle c).Conversely, in the case of the forced merging, we need to examine whether the solution of inequality ( 8) to (11) exists.If A and B are mutually exclusive, then the following two conditions ( 1) and ( 2) are obtained.
(1) When a rapid brake event B does not exist, then B = 0, and only an event A could happen.
(2) When a rapid acceleration event A does not exist, then A = 0, and only an event B is approved.
The lane changing behavior of vehicle c could happen when a solution of ( 1) or (2) exists.
Psychological constraints describe driver's motivations on lane change.If the present car has not reached the desired speed and if the predicted speed of lane change is greater than that of no change, or gain speed advantage, a 1 and a 2 describe predicted acceleration of lane change and no lane change, respectively.a 1 and a 2 are given from the IDM.Then the psychological constraints can be given by If (12) has a solution, the driver has maneuvers of changing the current lane to the target lane.Conversely, the driver does not conduct the lane changing maneuvers.
Lane change behaviors can be characterized as a sequence of three stages: the ability of lane change (physical condition); the motivation of lane change (psychological constraints); the execution of lane change.When lane change models of psychological condition and physical condition are both satisfied, the driver conducts the above-mentioned three stages.Otherwise, the driver continues the car-following behavior without lane change behaviors.
We develop a traffic flow simulation model consisting of car-following model and lane change model [10][11][12].The basic concept of car-following theories is the relationship between stimuli and response.In the classic car-following theory, the stimuli are represented by the relative speed of following and leading vehicle, and the response is represented by the acceleration (or deceleration) rate of the following vehicle.The car-following model describes following behaviors that drivers follow each other in the traffic stream on only one lane.To reproduce the traffic flow in two or more lanes, lane change model which explores lane change behaviors is needed.By using the car-following model and lane change model, we express dynamic and complex traffic behaviors in two or more lanes.Moreover, traffic flow simulation models are applied to reproduce the traffic congestion represented by Helbing and Kerner [13][14][15][16].

The Reinforcement Learning Ramp Metering.
Reinforcement learning is a kind of machine learning treating the problem at which the agent under a certain environment determines the action.And the action should observe and take the present state.An agent gets reward from environment by choosing actions.Reinforcement learning learns a policy from which most reward is obtained through a series of actions [17].Reinforcement learning is a broad class of optimal control methods depending on estimating value functions from experience or simulations [18][19][20][21].
The model of reinforcement learning ramp metering (RLRM) is shown in Figure 3. qin is the inflow of the upstream of the main line; r is the metering rate; qout is the outflow of the downstream of main line; dm is the density of the main line in merging section; dr is the density of onramp; vm is the average speed of the main line; vr is the average speed of onramp.
According to the volume q in merging section, upstream traffic qin is updated by where qin called state variable can be collected by the control variable detector.r is set as a choosing action variable.Moreover, qout is the reward based on the choosing action.ρ L is the traffic density in the merging section of L long.ρ L can be obtained by According to Figure 4, the framework of RLRM is explained briefly.RLRM consists of metering rate choice model, outflow function, value function, and environmental model.The metering rate choice model is a rule to choose the optimal metering rate.Outflow function describes the data of downstream traffic which can be collected and calculated by detectors.Value function presents the total of volumes of downstream traffic.Environmental model predicts inflow and outflow in the next period of time depending on optimal metering rate and inflow.

RLRM with Complete Information.
The RLRM with complete information faces a Markov decision problem (MDP).In addition, since inflow and metering rate's set denotes S, A(qin) (qin t ∈ S) is finite.We typically use a set of matrices R r qinqin' = P r qin t+1 = qin | qin t = qin, r t = r (16) to describe the transition structure.Traffic outflow at time t is obtained by for all qin ∈ S, for all r ∈ A(qin), and for all qin ∈ S + .

Metering rate choice model
Metering rate r Environmental model Outflow function If maximum outflow V * or Q * is given by Bellman formula, we have or We can obtain transit probability P r qin qin and next outflow V π (qin) with MDP's complete information.And we assume that traffic outflow is finite.Moreover, we can also compute traffic outflow.

RLRM without Complete
Information.Supposed Markov decision process with complete information is given in Section 2.3.But this argument is untenable in fact.We can give ramp metering rate by using evaluation of the experience without complete information.Since transit probability is not necessary, we can rewrite (18) as where qout t is real time outflow at time t, and constant a t is transit probability function of t.Equation ( 19) can be replaced by If expected value of metering rate is not given, we also replace We get We suppose that the probability of on-ramp control policy π can be obtained in (24).Here, it is difficult to satisfy the initial condition.The values a π(qin t , r t )Q(qin t+1 , r t ) associated with an optimal on-ramp control policy are called the optimal ramp inflow and are often written as maxQ(qin t+1 , r).We get where In the (25), the action value function Q is gained by learning approximates Q * (the optimal action value function) directly by using current policy.The state variable can be updated depending on the policy.
When the traffic reaches the jam density, it is possible to result in closure of the ramp for a long period of time, which must be taken into consideration.Maximum of waiting time (T max ) and its metering rate (r T ) are given.When m n=1 TS n > T max , the control (qin t , r T ) is selected.In order to remove the curse of dimensionality, the discrete equation of the continuous variable r t is represented.The average difference between 0 and r max is divided by r n .r n is given by where N r is the amount of the metering rate, and cell is the function of the bottom integral function.The metering rate is max(kr n , r max ) for k ∈ N.
The algorithm of reinforcement learning on-ramp metering is shown in Figure 5.
(2) Determine cycle time of a traffic signal t.
(4) Give metering rate by r t = k × r n .
(5) Determine the traffic state (qin t , r t ).( 6) Generate the density ρ L by using traffic simulation and choose the metering rate.
(7) If r t < r max , then update k = k + 1 and go to (4), and otherwise generate the optimal control (qin t , r * ).
(8) If one closes the ramp, then update waiting time T by T = T + t, and otherwise initialize the waiting time T by T = 0.If T > T max , then update metering rate by r T → r * .
(9) Operate the optimal control (qin t , r * ) and update Q.
When the cycle time t is over, determine to continue the ramp metering.If yes, then collect the data of inflow qin t+1 , go to (3), and update qin t , that is, qin t+1 → qin t ; otherwise, complete the ramp metering.

Data Combination and Reduction
Our aim is to design a reinforcement learning control law for the ramp metering controller without complete information.We need to control the inflow from the ramp into main line, and the metering rate should be given by traffic states.Traffic flow simulation is conducted to demonstrate this control of the ramp metering.In our simulation, we set the main line length on highways to 1000 m, ramp length to 200 m, and length in merging sections of the main line and ramp to100 m.Parameters of RLRM are shown in Table 1, and the metering rate matrix is {0, 100, 200, 300, . . . . . ., 900, 1000, 1100}.Table 2 shows the inflow of cases A, B, C, D, E, and F. Inflow rate of the main line increases from 1200 pcu/hour of case A to 2500 pcu/hour of case F.Moreover, inflow rate of ramp rises from 300 pcu/hour of case A to 900 pcu/hour of case F. The cycle length of the fixed-time control is 20 s which consists of 15 s green time and 5s red time.

Result and Discussion
The results of no control, fixed-time control, and RLRM are shown in Figures 6-9.Total inflow increases from 1500 pcu/h in case A to 3400 pcu/h in case F. Figure 6 presents average speed and its rate compared to no control.The average speed of no control, about 108 km/h, is faster than fixedtime and RLRM in case A. The similar results are shown in case B. The average speed of no control, about 79 km/h, is faster than fixedtime and is slower than RLRM in case C. The average speed of no control, about 51 km/h, is slower than fixedtime and RLRM in case F. According to the average speed, rates of congestion reliefs of fixed-time control from case A to case F arrive at −7.80%, −6.65%, −3.77%, 0. 26%, 2.70%, and 8.26%, respectively.In addition, rates of congestion reliefs of RLRM from case A to case F arrive at −6.31%, −6.49%, 5.69%, 13.55%, 20.50%, and 18.18%, respectively.
Figure 7 describes density and its rate compared to no control.Densities of fixed-time control and RLRM are about 38 pcu/km, an about 60% increase, in case A. Densities of fixed-time control and RLRM are about 52 pcu/km and 45 pcu/km, about 11.46% and 22.60% decreases, in case C. Densities of fixed-time control, no control, and RLRM are about 120 pcu/km.According to densities, rates of congestion reliefs of fixed-time control from case A to case F

Strategy execution
Update Q
Figure 9 represents travel time and its rate compared to no control.According to travel time, 6.25% and 9.38% increases are explored in case A. Travel time rises from 342 s without control to 370 s with fixed-time control and falls into 330 s with RLRM in case C. Travel time falls from 617 s to 469 s with fixed-time control and 343 s with RLRM in case F. Rates of congestion reliefs of fixed-time control from case A to case F arrive at −6.25%, −25.26%, −8.19%, 7.36%, 27.06%, and 23.99%, respectively.On the other hand, rates of congestion reliefs of RLRM from case A to case F arrive at −9.38%, −5.26%, 3.51%, 38.17%, 40.32%, and 44.41%, respectively.
According to Figures 6-9 when the traffic inflows are low, controls not efficient.Controls get efficient with the traffic inflows increasing.Controls are very efficient, and RLRM is optimal control when the traffic inflows are high.Moreover,

Conclusion
The on-ramp metering ensures that traffic moves at a speed approximately equal to the optimum speed which results in maximum flow rates or travel time.This study develops an RLRM model without complete information, which consists of prediction tools depending on traffic flow simulation and optimal choice model based on reinforcement learning theories.Numerical cases are given to demonstrate RLRM compared to no control and fixed-time control.In addition, densities and outflow rates are calculated.Moreover, average speeds are computed, and travel times are assessed.According to cases A, B, C, D, E, and F, fixed-time control and RLRM are discussed depending on average speeds, densities, outflow rates, and travel times.When traffic inflow is low, controls are not efficient, and there are little differences among no control, fixed-time control, and RLRM.On the other hand, when traffic inflow is high, controls are very efficient, and RLRM is optimal control.Moreover, the greater is inflow, the more is the effect.In addition, the stability of RLRM is better than fixed-time control.

Figure 4 :
Figure 4: Block diagram for reinforcement learning ramp metering.

(
q in t , r * ) (q in t , r * ) by r t = kr n Determine the traffic state Predict outflow Update k = k + 1 Generate the density ρ L by traffic simulation and select the metering rate Yes Yes No If r t < r max Cycle time t is over Generate the optimal control If r = 0, then T = T + t; else T = 0 If T≥T max Update metering rate r T r *

Figure 6 :
Figure 6: Average speed and its rate compared to no control.

Figure 7 :
Figure 7: Density and its rate compared to no control.

Figure 8 :Figure 9 :
Figure 8: Outflow and its rate compared to no control.