Ramp Metering for a Distant Downstream Bottleneck Using Reinforcement Learning with Value Function Approximation

Ramp metering for a bottleneck located far downstream of the ramp is more challenging than for a bottleneck that is near the ramp.)is is because under the control of a conventional linear feedback-type rampmetering strategy, when metered traffic from the ramp arrive at the distant downstream bottleneck, the state of the bottleneck may have significantly changed from when it is sampled for computing the metering rate; due to the considerable time, these traffic will have to take to traverse the long distance between the ramp and the bottleneck. As a result of such time-delay effects, significant stability issue can arise. Previous studies have mainly resorted to compensating for the time-delay effects by incorporating predictors of traffic flow evolution into the control systems. )is paper presents an alternative approach. )e problem of ramp metering for a distant downstream bottleneck is formulated as a Q-learning problem, in which an intelligent ramp meter agent learns a nonlinear optimal ramp metering policy such that the capacity of the distant downstream bottleneck can be fully utilized, but not to be exceeded to cause congestion. )e learned policy is in pure feedback form in that only the current state of the environment is needed to determine the optimal metering rate for the current time. No prediction is needed, as anticipation of traffic flow evolution has been instilled into the nonlinear feedback policy via learning. To deal with the intimidating computational cost associated with the multidimensional continuous state space, the value function of actions is approximated by an artificial neural network, rather than a lookup table. )e mechanism and development of the approximate value function and how learning of its parameters is integrated into the Q-learning process are well explained. )rough experiments, the learned ramp metering policy has demonstrated effectiveness and benign stability and some level of robustness to demand uncertainties.


Introduction
A genuine motivation behind ramp metering strategies is to reduce the total time spent within the freeway network of interest [1]. Minimization of the total time spent can be shown to be equivalent to maximizing time-weighted discharging flow from the network, i.e., encouraging early discharge of flow [1]. is motivation, combined with the knowledge of traffic flow theory, implies that the objective of a ramp metering strategy is to maintain the flow rate into the most restrictive bottleneck of the network to be close to the capacity of the bottleneck, but not to exceed it, so that congestion will not be caused. is objective can be achieved by regulating the traffic density (or occupancy) of the bottleneck to stay close to the critical density (or critical occupancy) through metering the ramp flow.
is is the principle behind many conventional linear feedback-type ramp metering strategies, e.g., [2][3][4][5]. For this kind of ramp metering strategies, the control target bottleneck is usually near the ramp, and in most cases, the bottleneck is incurred by the merging of the mainline and ramp traffic itself. In some other cases, however, the control target bottleneck is located far away from the metered ramp, for example, a lanedrop that is a few kilometers downstream. In these latter cases, conventional linear feedback-type ramp metering strategies can perform poorly in stability due to the long distance between the ramp and the bottleneck. Specifically, when metered traffic from the ramp arrive at the distant downstream bottleneck, the traffic density (or occupancy) of the bottleneck may have significantly changed from when it is sampled for computing the metering rate. To overcome this issue, many previous studies have resorted to compensating for the time-delay effects by incorporating predictors of traffic flow evolution into the control systems.
is study presents an alternative approach. e proposed approach formulates the problem of ramp metering for a distant downstream bottleneck as a Q-learning problem, in which an intelligent ramp meter agent learns an optimal ramp metering policy such that the capacity of the distant downstream bottleneck can be fully utilized but not to be exceeded to cause congestion. To our best knowledge, this is the first such effort in the literature. e learned policy is in pure feedback form in that only the current state of the environment is needed to determine the optimal metering rate for the current time. No prediction is needed, as anticipation of traffic flow evolution has been instilled into the learned nonlinear feedback policy. To deal with the intimidating computational cost associated with the multidimensional continuous state space of the formulated Q-learning problem, the value function of ramp metering rates is approximated by an artificial neural network (ANN), rather than a lookup table.
In the remainder of this paper, Section 2 reviews previous studies in ramp metering for distant downstream bottlenecks and Q-learning applications in freeway control. Section 3 develops the proposed approach, including formulation of the Q-learning problem with value function approximation and the algorithm to solve the problem. Section 4 evaluates the proposed approach by experiments. Section 5 concludes this study.

Ramp Metering for a Distant Downstream Bottleneck.
Compared with the richness of the literature in ramp metering strategies for bottlenecks near ramps, studies in ramp metering for distant downstream bottlenecks are much fewer. ese studies include [6][7][8][9][10][11][12][13]. In [6], the notable ALINEA strategy, which is a linear "proportional" control strategy, was extended by adding to it an "integral" term, resulting in the so-called PI-ALINEA strategy. e authors theoretically proved the stability of the PI-ALINEA strategy. Later, Kan et al. [7] evaluated the performance of PI-ALI-NEA in controlling a distant downstream bottleneck by simulation. e simulation model employed was META-NET [14], a second-order discrete-time macroscopic model of traffic flow dynamics. e simulation evaluation showed that PI-ALINEA outperformed ALINEA in terms of stability. In [8], to deal with the time-delay effects of ramp metering for distant lane-drop bottlenecks, the authors incorporated a Smith predictor [15] into ALINEA and termed the resulting strategy as SP-ALINEA. rough simulation, they showed that the stability region of SP-ALINEA is much broader than the PI-ALINEA. e simulation model employed by Felipe de Souza and Jin [8] was the cell transmission model (CTM) [16], a first-order discrete-time macroscopic model of traffic flow dynamics. Similar to [8], Frejo and De Schutter [9] added a feedfoward term to ALINEA to incorporate anticipated evolutions of the bottleneck density in order to improve the performance of ALINEA. e resulting strategy is termed FF-ALINEA. Similar to [8,9], Yu et al. [10] coupled a predictor to an extremum-seeking controller for controlling a distant downstream lane-drop bottleneck by metering upstream mainline flow. In [12,13], fuzzy theory was applied to a proportional-integral-derivative-(PID-) type ramp metering controller to learn the PID gains in real time. e resulting controller has the capability of anticipation, hence performs better in controlling a distant downstream bottleneck than a controller with fixed gains. Stylianopoulou et al. [11] proposed a linear-quadratic-integral (LQI) regulator-type ramp metering strategy for controlling a distant downstream bottleneck. Unlike all the studies that were summarized above which only take measurements near the bottleneck, in [11], however, measurements which spread along the whole stretch between the ramp and the downstream bottleneck are utilized by the controller, so the controller has a better sense of traffic flow evolutions along the stretch, hence possessing better stability and robustness.

Q-Learning Applications in Freeway Control.
Application of Q-learning to freeway control has been widely studied. However, to our best knowledge, no effort has been made to apply Q-learning to ramp metering for distant downstream bottlenecks. Notwithstanding this, this section summarizes previous studies in Q-learning applications to ramp metering (RM) control for nearby bottlenecks and to variable speed limit (VSL) control. ese studies are summarized in Table 1. Although this summary may not be thorough, it should have included most previous studies in freeway control by Q-learning approaches. Among these studies, [18-22, 27, 28, 32] were concerned with ramp metering. [23,30,31,33] studied variable speed limits (VSL). Ramp metering and variable speed limits were jointly applied by [29]. [17,[24][25][26] simultaneously used ramp metering and variable message signs (VMS) for dynamic routing. Most of these studies aimed to achieve one of the following three objectives: minimization of the total time spent by vehicles [17,19,27,28,31,33], maximization of early discharge of flow [24][25][26], and minimization of deviations of the traffic density of the control target section from the critical density [20,23,29,30]. As discussed in Section 1, these three objectives are equivalent.
It is important to note that state-space approximation is not primarily a tool for reducing the computational cost of reinforcement learning. For a multidimensional continuous state-space problem, the lookup table after state-space approximation can still be very large. Admittedly, if the statespace approximation is made very coarse, the table size can be decreased (hence the computational cost), however, at the expense of undermining the effectiveness of the learned policy. Such a difficulty is born with lookup table methods because they aim at directly updating the value of each stateaction pair, hence cannot avoid the curse of dimensionality of the state space [35]. e above difficulty can be circumvented by introducing value function approximation. A value function approximation-based reinforcement learning method uses a parameterized function to replace the lookup table to serve as the approximate value function [34]. Consequently, the reinforcement learning process entails learning the unknown parameters of the approximate value function instead of learning the values of state-action pairs. Compared with the number of state-action pairs of a lookup table for a (discretized) multidimensional continuous state-space problem, the number of unknown parameters of an approximate value function is usually profoundly smaller, hence making the learning computationally affordable. Only three studies in Table 1, i.e., [31][32][33], applied value function approximation-based reinforcement learning methods. e approximate value functions used by these three studies were all artificial neural networks.
An outstanding feature of reinforcement learning that distinguishes it from supervised and unsupervised learning is that, for reinforcement learning, data from which the intelligent agent learns an optimal policy are generated from within the learning process itself. Specifically, the intelligent agent learns through a great amount of interactions with the environment which are enabled by simulation. Hence, simulation models play an important role in reinforcement learning. Among the studies summarized in this section, [19,22,24,30,33] employed microscopic traffic simulation models such as VISSIM, Paramics, and SUMO; [17,18,20,21,23,31,32] used macroscopic dynamic traffic flow models such as CTM [16] and METANET [14] as the simulation tools.

Multidimensional Continuous State Space.
Consider the freeway section depicted in Figure 1. A lane-drop bottleneck exists far downstream of the metered ramp. e ramp meter is supposed to regulate the traffic flow into the bottleneck by metering the ramp inflow so that the bottleneck capacity can be fully utilized but not to be exceeded. To this end, the objective of the ramp metering policy is such that it can maintain the per-lane traffic density of the control target location to stay close to a predetermined desired value, which is (λ 2 /λ 1 )ρ cr , where λ 1 and λ 2 denote the number of lanes before and after the lane-drop, respectively, and ρ cr is the per-lane critical density. As discussed before, due to the long distance between the metered ramp and the downstream bottleneck, a conventional ramp metering strategy that only senses and utilizes traffic condition near the bottleneck can perform poorly due to the lack of anticipation capability. erefore, one main requirement in designing our reinforcement learning approach is that it needs to take into account traffic densities measured along the long stretch between the metered ramp and the downstream bottleneck so that an anticipation capability can be built by learning. Since the computational cost of Q-learning grows exponentially with the increase of the dimension of the state space, it would not be computationally cost-effective to take into account measurements at too many places. As a result, three representative places are selected. ey are located at the two ends and the middle of the stretch, respectively. Such a treatment, on the one hand, enables the intelligent ramp meter agent to learn to anticipate traffic flow evolution on the stretch, and on the other hand, it limits the computational cost associated with learning. Note that the place of the downstream end of the stretch happens to be the control target location, whose traffic density will be regulated to stay close to the desired value by ramp metering. erefore, the first three state variables of the proposed Q-learning problem are traffic densities of the three representative places, denoted by ρ 1 , ρ 2 , and ρ 3 , respectively. Note that when the distance between the metered ramp and the downstream bottleneck is sufficiently long and meanwhile the traffic demand pattern is complicated enough in terms of having frequent and large fluctuations, the resulting temporal-spatial traffic flow pattern may be too complicated for the three mainline sampling locations to effectively represent the environment state for the purpose of learning. Under such a circumstance, more sampling locations may be needed. What kind of combinations of the stretch length and traffic demand pattern may yield complicated enough temporal-spatial traffic flow patterns that would cause the three representative mainline sampling locations to result in suboptimal solutions and, accordingly, how many sampling locations should be taken under these circumstances are considered beyond the scope of this paper. e fourth and also the last state variable is known as the estimated traffic demand on the ramp, denoted by D ramp .
is state variable is needed because to learn how much flow from the ramp should be released into the mainline, the intelligent ramp meter agent needs to know not only the traffic conditions of representative mainline places but also the current (estimated) traffic demand on the ramp so as to avoid picking up a metering rate that is too high. e estimated traffic demand on the ramp over the current time step is computed by (1), where D ramp (t) denotes the estimated traffic demand on the ramp (in vehicles per hour) for the current time step; l ramp_queue (t) represents the queue length on the ramp at the current time step; Δt is the time step length (in seconds); and q ramp_arrival (t − 1) represents the arrival flow rate at the ramp over the previous time step.
e reason to use the arrival flow rate at the ramp over the previous time step rather than the current time step is for the following realistic consideration. Ramp metering rate for the current time step needs to be computed at the end of the previous time step (or, equivalently, at the beginning of the current time step) so that it can be implemented over the current time step; however, by that time, the actual arrival flow rate over the current time step is unknown because it has not yet happened. erefore, the arrival flow rate at the ramp over the previous time step is used as a proxy to the arrival flow rate at the ramp over the current time step. Such a treatment that brings anticipation of the ramp condition into learning and thus may enhance the learning efficiency appears to be first used by Davarynejad et al. [18]. Note that the queue length on the ramp of the current time step does not need a proxy because it can be readily calculated at the end of the previous time step.
To summarize, the state vector contains four continuous variables, i.e., s≐ ρ 1 ρ 2 ρ 3 D ramp , resulting in a fourdimensional continuous state space.

State-Dependent Action Space.
e actions in the proposed approach are composed of discrete ramp metering rates, as in [29], ranging from the lowest allowable metering rate, a min , to the highest allowable metering rate, a max . e values of a min and a max and the number of discrete metering rates are up to the user's specification. In Section 4.1, an example of such a specification is given which is consistent with the requirements of the so-called "full traffic cycle" signal policy for ramp metering [36] so that the results can be implemented by a traffic light. At any time step, the set of admissible actions may not necessarily consist of all the specified discrete metering rates; it is bounded from above by the estimated traffic demand on the ramp introduced in Section 3.1. Such a treatment can prevent the agent from picking up a metering rate that is higher than the ramp traffic demand, hence may enhance the learning efficiency. us, the action space at any time step is state-dependent. To emphasize this point, the action space in this paper is written as A(s), as will be seen in the remainder of this paper.

Reward.
e rewards earned by the intelligent ramp meter agent during learning should reflect the objective of the ramp metering policy to be learned. As introduced in Section 3.1, the objective of the ramp metering policy to be learned is to maintain the traffic density of the control target location, ρ 3 , to stay close to the desired value, (λ 2 /λ 1 )ρ cr . erefore, the reward function can be defined as In (2), r is the reward received by the agent for resulting in ρ 3 ; k is a user-defined negative constant value, serving as a scaling factor; the other notations have been defined earlier.
e implication of this reward is straightforward: it penalizes the traffic density of the control target location for deviating from the desired value. Similar reward designs have been applied by [20,23,29,30]. In our approach, the reward is a function of the state resulting from taking an action; but, in general, depending on needs, the reward can be a function of the states both before and after taking an action, as well as the action itself [34].
Note that although the reward defined by (2) is based on the state of the current time step, reinforcement learning aims to maximize the total of these rewards over the entire control horizon. ere also exist traffic flow optimization methods which optimize performance measures that are solely based on the current traffic state but repeat the optimization at every time step, e.g., [37,38]. ese two approaches are different.

Value Function Approximation by an Artificial Neural
Network. If a lookup table method was to be used, the fourdimensional continuous state space needs to be approximated (discretized) first. If, for example, using the simple aggregation method for approximating the continuous state space, the range of the traffic density is aggregated into 40 intervals and the range of the estimated traffic demand on the ramp is aggregated into 20 intervals, then there will be as many as 40 3 × 20, i.e., 1.28 million discrete states. en, if the action space consists of 20 metering rates, it implies that the dimension of the resulting lookup table will be 1.28 million × 20. is means that there will be a total of 25.6 million action values (i.e., Q-values) to learn, which will be computationally very demanding. is motivates the introduction of value function approximation.
We use an artificial neural network (ANN) to serve as the approximate value function. e role of this approximate value function in the Q-learning process is at each time step, it takes as inputs all the state variables, i.e., ρ 1 , ρ 2 , ρ 3 , and Journal of Advanced Transportation D ramp , based on which it computes the values for all the available actions, as outputs. at is, the approximate value function maps the state vector to another vector, each element of which is the value of the pair of that state and a candidate action. In general, a value function approximated by an ANN is a nonlinear mapping: In (3), ANN represents the value function approximated by an ANN and |S| and |A| denote the dimensions of the state space and action space, respectively.

State
Encoding. In many cases, the state variables are not directly fed into ANNs; they are first transformed into some other variables called features [34,39], which will then be taken by ANNs. Such a transformation is known as state encoding or feature extraction [34,39]. As pointed out by Bertsekas [39], state encoding can be instrumental in the success of value function approximation, and with good state encoding, an ANN need not to be very complicated.
e state encoding method used by this study is a simple tile coding method [34], which is described as follows. For each of the four continuous state variables, its value range is divided into equal intervals that do not overlap with each other; as a result, at any time step, the value of a state variable will fall into one of the intervals that collectively cover the value range of this state variable; the interval into which the value of this state variable falls will be given value 1, while all the others will be given value 0. Such a state encoding treatment can give the ANN stronger stimuli than a treatment that normalizes state variables to have continuous values between 0 and 1. To emphasize the fact that the feature vector is a function of the state vector, in this paper, the feature vector is written as x(s), as can be seen in the remainder of this paper.

Structure of the Value Function Approximated by the ANN.
e feature vector, x(s), is then taken by the ANN. e ANN works in the following way. First, through a linear mapping which is specified by a weight matrix, W, it generates the so-called raw values [40]. Subsequently, each of these raw values is transformed by a nonlinear function, e.g., a sigmoid function, to obtain the so-called threshold values [40]. Such a nonlinear transformation is also known as activation [41]. en, the threshold values are transformed again through a linear mapping which is specified by another weight matrix, V. Finally, the newly transformed values are added by a vector of coefficients, c, known as the bias coefficients [40], yielding the outputs from the ANN, i.e., the vector of action values, q. Note that the dimension of c is equal to the number of actions. erefore, we see that the ANN is characterized by three sets of parameters, i.e., W, V, and c. In other words, the value function approximated by the ANN is parameterized by W, V, and c. e mapping from the input state vector to the output action-value vector can thus be written in a compact form as q � ANN(x(s); W, V, c). (4) e structure of the ANN described above is presented in Figure 2. e three sets of parameters, W, V, and c, are unknown and need to be learned through the Q-learning process. e algorithm used for achieving this is presented in Section 3.5.

Benefit in Computational
Cost. It is worth demonstrating the benefit in computational cost brought by introducing the ANN approximate value function. Recall that we have estimated the computational cost of the lookup table method in the beginning of Section 3.4. To enable a "fair" comparison with the lookup table method, for the ANN approximate value function, we also assume that the value range of each traffic density variable is divided into 40 intervals, and the value range of the estimated traffic demand on the ramp is divided into 20 intervals. is implies that there will be a total of 40 × 3 + 20, i.e., 140 state features. We further assume that the number of hidden nodes is specified as 3 times of the number of features, which has been found to be sufficient to yield good learning outcomes in this study.
is implies that the dimension of the weight matrix W will be 140 × 420. We still assume that there are 20 available metering rates, as in the lookup table case. is implies that the dimension of the weight matrix V will be 420 × 20, and the dimension of the bias coefficient vector c will be 20. As a result, there will be a total of 67, 220 unknown parameters to learn. Compared with the 25.6 million action values (i.e., Q-values) to learn for the lookup table method, the benefit in computational cost brought by the value function approximation is tremendous.
3.5. e Learning Algorithm. As shown above, thanks to the approximate value function, the computational cost of learning can be profoundly reduced. e price is that the learning algorithm will no longer be as straightforward as lookup table methods. For a lookup table method, for any encountered state-action pair, the new Q-value computed by the so-called temporal difference (TD) rule is directly used to replace the original Q-value in the lookup table. In general, the TD rule of Q-learning is defined as [34] In (5), s and s ′ denote states before and after taking the action, respectively; a and b denote actions; A is the statedependent action space; r represents the reward received by the agent moving from state s to state s ′ by taking action a; α is the learning rate; and c is the discounting factor. In our approach, the reward r depends only on the state after taking the action, as described in Section 3.3.
For a value function approximation-based method, however, replacements of Q-values in a lookup table are no longer applicable as there is not a lookup table at all; instead, at each time step, the original and new Q-values are jointly used to update the parameters of the approximate value function. In other words, unlike a lookup table method for which a final lookup table filled by converged Q-values will be the ultimate outcome of the learning process, a value function approximation-based method uses Q-values as training data to calibrate the parameters of the approximate value function, and the Q-values will not be part of the ultimate outcome of the learning process. is is a distinct difference between the two kinds of methods. It is worth noting that the calibration of the parameters of the approximate value function is itself a learning problem. Specifically, it is an incremental supervised learning problem. It is incremental as information encapsulated in the datum generated at each time step (i.e., the new Q-value) needs to be absorbed by the parameters as soon as it becomes available. It is supervised as the target output (i.e., the new Q-value) for the approximate value function (i.e., the ANN in this study) is specified at each time step. e ANN calibration method employed in this study is the so-called incremental backpropagation algorithm [40]. e above process is formally presented by Algorithm 1, the pseudocode of the algorithm of Q-learning with ANN value function approximation used for this study. ere are two minor abuses of notations in Algorithm 1 for convenience of presentation. First, by argmax a∈A(s) ANN (x(s); W, V, c), we mean the metering rate of the highest action-value among all admissible metering rates under the current state s. Second, similarly, by max a∈A(s) ANN (x(s); W, V, c), we mean the highest admissible action-value under the current state s.

Experiment Settings.
is section evaluates the effects of the proposed reinforcement learning approach. e layout of the experiment freeway section is illustrated in Figure 3. As shown in Figure 3, a lane-drop is located as far as 3500 meters downstream of the metered ramp. Before the lane-drop, there are 3 lanes in the mainline, and after that, there are 2 lanes in the mainline. e ramp has only one lane.
e classical first-order discrete-time macroscopic model of traffic flow dynamics, the cell transmission model (CTM) [16], is employed as the simulation model. e freeflow speed is set as 120 km/h, the critical density is set as 20 veh/km/lane, and the jam density is set as 100 veh/km/ lane. e flow-density fundamental diagram employed is triangular.
us, the capacity of one lane is 120 × 20 � 2400 veh/h. Since the number of lanes before and after the lane-drop is 3 and 2, respectively, and the critical density is 20 veh/km/lane, the desired traffic density for the control target cell is (2/3) × 20 � 13.33 veh/km/lane.
In general, it may not be possible to quantify the threshold distance value between the metered ramp and the downstream bottleneck that will fail a conventional linear feedback-type ramp metering controller, as this value may vary from case to case, depending on factors including the free-flow speed and design of the linear feedback controller. For the specific experiment environment as described above, we found that a proportional-integral (PI) controller, which is a conventional linear feedback-type controller and can work well for close bottlenecks, will no longer be stable if the distance between the metered ramp and the downstream lane-drop location exceeds 1000 meters.
Traffic demands of the mainline and ramp are given in Figure 4. is demand profile is similar to what was used in [18,23,[29][30][31]. It is assumed in this study that the traffic flow is composed of only passenger cars. Multiclass traffic flow cases are not considered in this study. Note that, in order for the problem to be meaningful, the mainline demand should not exceed the mainline capacity after the lane drop, for otherwise the ramp metering cannot help in anyway. e method described in Section 3.4.1 is applied for state encoding. e value range of each of the three traffic density Journal of Advanced Transportation variables, [0, ρ jam ], is equally divided into 40 intervals. e value range of the estimated traffic demand on the ramp is divided into 20 intervals. Unlike the value range of any traffic density variable which has an explicit fixed upper bound (i.e., ρ jam ), it is not that straightforward to specify a proper upper bound for the value range of the estimated traffic demand on the ramp. We could specify a very large upper bound to ensure that any estimated traffic demand on the ramp will fall within the value range. However, this can cause the estimated traffic demand on the ramp to be much lower than the specified upper bound for most of the times, hence may not be efficient. To handle this issue, it is worth recalling the purpose of state encoding: to facilitate the efficiency of learning through translating the state variable into some other variable(s) that is(are) more representable under the specific learning task. Here, the learning task is to determine the ramp metering rate which is bounded by the highest allowable value, a max , regardless of the traffic demand on the 3500 m 2000m 500 m Figure 3: Layout of the freeway section used for assessment.

Data: mainline and ramp traffic demands
Result: calibrated parameters of the artificial neural network Initialization: set W, V, and c to small random numbers [40]. while episode reward not yet converged do Set the freeway network of interest as empty Initialize the state s while not the end of this episode do (1) Determine ramp metering rate a according to the ϵ−greedy strategy: a ⟵ argmax a∈A(s) ANN(x(s); W, V, c) or a ⟵ a is a random element in A(s) (2) Simulate to obtain the new state s′, with a implemented.
Compute reward r based on s′ (4) Compute Q old by the ANN: Q old ⟵ max a∈A(s) ANN(x(s); W, V, c) (5) Compute Q next by the ANN: Q next ⟵ max a∈A(s′) ANN(x′(s′); W, V, c) (6) Compute Q new by updating Q old using the temporal difference rule Update the parameters of the ANN by the incremental backpropagation algorithm using Q old as the input to the ANN and Q new as the desired output [40]: W, V, c ⟵ Backpropagation(Q old , Q new , W, V, c) (8) Update the state s ⟵ s′ end end ALGORITHM 1: Pseudocode of the algorithm of Q-learning with value function approximated by an artificial neural network.  ramp. erefore, a reasonable way to discretize the value range of the estimated traffic demand on the ramp is as follows: the range [0, a max ] is equally divided into 19 intervals; the range (a max , ∞) accounts for the last interval. e above state encoding treatment converts the four-dimensional state vector of continuous variables into a 140-dimensional (40 × 3 + 20 � 140) feature vector of binary variables.
In this experiment, the lowest allowable metering rate, a min , is set as 200 veh/h, and the highest allowable metering rate, a max , is set as 1200 veh/h. e range [a min , a max ] is equally divided into 10 intervals, resulting in a total of 11 discrete metering rates: 200, 300, . . . , 1100, 1200 { } veh/h. is specification for the action space is determined following the so-called "full traffic cycle" signal policy for ramp metering [36] to ensure that the optimal metering rates learned through the proposed method can be implemented by a traffic light. Note that 200, 300, . . . , 1100, 1200 { } veh/h is the largest admissible action space. As introduced in Section 3.2, in the proposed approach, at any time step, the admissible action space can be smaller than the largest set because it is constrained by the estimated traffic demand on the ramp. e hyperparameters used in the experiments are specified as follows. e number of hidden neurons is set as 3 times of the features, i.e., 3 × 140 � 420. e determination of this number was based on a considerable amount of trial-and-error experiments. If this number is set too big, the training time would be excessively long; if it is set too small, the approximate value function would not be able to effectively discriminate state inputs. e learning rate, α, of TD updating rule (5) is set as such that, for the first 0.1 million episode iterations, it is equal to 0.05, and it is equal to 0.01 afterwards. e discounting factor, c, of TD updating rule (5) is set as 0.95. e exploration rate, ε, in the ε-greedy policy in Algorithm 1 is set as decaying with the increase of the number of iterated episodes [34].

Results.
e experiment was coded and executed by MATLAB R2019a. Learning converged after about 0.7 million of episodic iterations. e left column of Figure 5 presents the resulting traffic density time series of the control target cell for the case of no control, the case of a PI controller (which is a conventional linear feedbacktype controller), and the case of the proposed reinforcement learning approach; the right column of Figure 5 illustrates the traffic density contours of the entire freeway section for the three cases. e black dash line in each traffic density contour indicates the location of the lanedrop; the origin of the y-axis of each traffic density contour corresponds to the beginning location of the concerned freeway section as depicted in Figure 3. From Figure 5, it can be seen that, without any control measure, as traffic demands increase, the traffic density of the control target cell soon grows beyond the desired value, and hence, congestion initiates from the bottleneck and grows into the upstream. Under the PI ramp metering control, the traffic density of the control target cell can be maintained around the desired value in the large, however, with severe oscillations which propagate into the upstream and influence the whole section. Under the ramp metering policy learned through the proposed reinforcement learning approach, the traffic density of the control target cell is managed to stay close to the desired value with almost no fluctuations, and accordingly, the traffic density contour of the entire section is much smoother than the case of the PI controller. Figure 6 compares the ramp metering rates computed by the PI controller (Figure 6(a)) and by the policy learned through the proposed reinforcement learning approach (Figure 6(b)). It indicates that the patterns of the two sets of metering rates are quite different. Moreover, microscopically, the metering rates given by the learned policy are very shredded in order to avoid the potential timedelay effects due to the long distance, thanks to the facts that it is a highly nonlinear feedback policy and takes in traffic conditions at multiple locations along the stretch. It is these shredded metering rates that manage to stabilize the traffic density of the control target cell around the desired value with almost no fluctuations, as shown in Figure 5. By contrast, the metering rates given by the PI controller lack subtle variations but can only constantly oscillate with large amplitudes, which results in quite unstable traffic densities of the control target cell, as shown in Figure 5.

Robustness.
It is of interest to what extent the learned ramp metering policy can tolerate uncertainties in traffic demands. To this end, the traffic demands are corrupted by white noise. Figure 7 presents the results for the cases in which the standard deviation of the white noise of the traffic demands is 50, 100, 150, 200, and 250 veh/h, respectively. It can be seen that the metering policy learned from the proposed approach can perform satisfactorily up to the noise level of 200 veh/h; its performance starts to go down as the demand noise grows bigger.

Conclusions
is paper proposes a reinforcement learning approach to learn an optimal ramp metering policy controlling a downstream bottleneck that is far away from the metered ramp. An artificial neural network replaces the lookup table in the ordinary Q-learning approach to serve as the approximate value function. e state vector is chosen so that a tradeoff between the capability to anticipate traffic flow evolution and the computational cost is achieved. e action space is state-dependent to enhance the learning efficiency. A simple tile coding method is employed to convert the continuous state vector to a binary feature vector to give stronger stimuli to the artificial neural network.
e experiment results indicate that the ramp metering policy learned through the proposed approach is able to yield clearly more stable results than a conventional linear feedback-type controller. Specifically, under the learned ramp metering policy, the traffic density of the control target cell is successfully maintained to stay close to the desired value with almost no fluctuations. As a result, traffic flow evolution over the entire freeway section is also smooth. In comparison, under a conventional linear feedback-type ramp metering strategy, the traffic density of the control target cell oscillates significantly around the desired value. Consequently, traffic flow evolution over the entire freeway section also suffers from significant instability. e metering policy learned through the proposed approach has also demonstrated some level of robustness in terms of yielding satisfactory results under uncertain traffic demands.
For the next step, we plan to extend the proposed method so that it can manage queue length on the ramp at the expense of trading off some mainline efficiency. Another interesting direction is to replace the artificial neural network approximate value function by a simpler linear approximate value function but with employing more sophisticated state encoding techniques to better capture the interactions among the state variables so that a sophisticated approximate value function such as an ANN may be avoided. It will also be interesting to examine the impact of the number of representative mainline sampling locations, especially under the circumstances of excessively long distance between the ramp and the downstream bottleneck and complicated traffic demand patterns. Finally, we will also look into the approach of policy approximation as an alternative to the action-value approximation approach in this paper.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.