To improve the convergence rate and the sample efficiency, two efficient learning methods AC-HMLP and RAC-HMLP (AC-HMLP with
Reinforcement Learning (RL) [
DP aims at solving optimal control problems, but it is implemented backward in time, making it offline and computationally expensive for complex or real-time problems. To avoid the curse of dimensionality in DP, approximate dynamic programming (ADP) received much attention to obtain approximate solutions of the Hamilton-Jacobi-Bellman (HJB) equation by combining DP, RL, and function approximation [
The iterative nature of the ADP formulation makes it natural to design the optimal discrete-time controllers. Al-Tamimi et al. [
Extensions of ADP for continuous-time systems face the challenges involved in proving stability and convergence meanwhile ensuring the algorithm being online and model-free. To approximate the value function and improve the policy for continuous-time system, Doya [
All the aforementioned ADP variants utilized the NN as the function approximator; however, linear parameterized approximators are usually more preferred in RL, because they make it easier to understand and analyze the theoretical properties of the resulting RL algorithms [
Though the above works learn a model during learning of the value function and the policy, only the local information of the samples is utilized. If the global information of the samples can be utilized reasonably, the convergence performance will be improved further. Inspired by this idea, we establish two novel AC algorithms called AC-HMLP and RAC-HMLP (AC-HMLP with
The main contributions of our work on AC-HMLP and RAC-HMLP are as follows: Develop two novel AC algorithms based on hierarchal models. Distinguishing from the previous works, AC-HMLP and RAC-HMLP learn a global model, where the reward function and the state transition function are approximated by LFA. Meanwhile, unlike the existing model learning methods [ As MLAC and Dyna-MLAC did, AC-HMLP and RAC-HMLP also learn a local model by LLR. The difference is that we design a useful error threshold to decide whether to start the local planning process. At each time step, the real-next state is computed according to the system dynamics whereas the predicted-next state is obtained from LLR. The error between them is defined as the state-prediction error. If this error does not surpass the error threshold, the local planning process is started. The local model and the global model are used for planning uniformly. The local and the global models produce local and global samples to update the same value function and the policy; as a result the number of the real samples will decrease dramatically. Experimentally, the convergence performance and the sample efficiency are thoroughly analyzed. The sample efficiency which is defined as the number of samples for convergence is analyzed. RAC-HMLP and AC-HMLP are also compared with S-AC, MLAC, and Dyna-MLAC in convergence performance and sample efficiency. The results demonstrate that RAC-HMLP performs best and AC-HMLP performs second best, and both of them outperform the other three methods.
This paper is organized as follows: Section
RL can solve the problem modeled by MDP. MDP can be represented as four-tuple
Policy
Under the policy
The optimal state-value function
Therefore, the optimal policy
AC algorithm mainly contains two parts, actor and critic, which are stored separately. Actor and critic are also called the policy and value function, respectively. The actor-only methods approximate the policy and then update its parameter along the direction of performance improving, with the possible drawback being large variance resulting from policy estimation. The critic-only methods estimate the value function by approximating a solution to the Bellman equation; the optimal policy is found by maximizing the value function. Other than the actor-only methods, the critic-only methods do not try to search the optimal policy in policy space. They just estimate the critic for evaluating the performance of the actor; as a result the near-optimality of the resulting policy cannot be guaranteed. By combining the merits of the actor and the critic, AC algorithms were proposed where the value function is approximated to update the policy.
The value function and the policy are parameterized by
Eligibility is a trick to improve the convergence via assigning the credits to the previously visited states. At each time step
By introducing the eligibility, the update for
The policy parameter
S-AC (Standard AC algorithm) serves as a baseline to compare with our method, which is shown in Algorithm
(1) Initialize (2) (3) (4) (5) Choose (6) (7) Execute (8) Update the eligibility of the value function: (9) Compute the TD error: (10) Update the parameter of the value function: (11) Update the parameter of the policy: (12) (13) (14)
The model in RL refers to the state transition function and the reward function. When the model is established, we can use any model-based RL method to find the optimal policy, for example, DP. Model-based methods can significantly decrease the number of the required samples and improve the convergence performance. Inspired by this idea, we introduce the hierarchical model learning into AC algorithm so as to make it become more sample-efficient. Establishing a relative accurate model for the continuous state and action spaces is still an open issue.
The preexisting works are mainly aimed at the problems with continuous states but discrete actions. They approximated the transition function in the form of probability matrix which specifies the transition probability from the current feature to the next feature. The indirectly observed features result in the inaccurate feature-based model. The convergence rate will be slowed significantly by using such an inaccurate model for planning, especially at each time step in the initial phase.
To solve these problems, we will approximate a state-based model instead of the inaccurate feature-based model. Moreover, we will introduce an additional global model for planning. The global model is applied only at the end of each episode so that the global information can be utilized as much as possible. Using such a global model without others will lead to the loss of valuable local information. Thus, like Dyna-MLAC, we also approximate a local model by LLR and use it for planning at each time step. The difference is that a useful error threshold is designed for the local planning in our method. If the state-prediction error between the real-next state and the predicted one does not surpass the error threshold, the local planning process will be started at the current time step. Therefore the convergence rate and the sample efficiency can be improved dramatically by combining the local and global model learning and planning.
The global model establishes separate equations for the reward function and the state transition function of every state component by linear function approximation. Assume the agent is at state
Likewise, the reward
After
The parameter
Though the local model also approximates the state transition function and the reward function as the global model does, LLR is served as the function approximator instead of LFA. In the local model, a memory storing the samples in the form of
The last row of
Let the current input vector be
Let the error threshold be
AC-HMLP algorithm consists of a main algorithm and two subalgorithms. The main algorithm is the learning algorithm (see Algorithm
(1) Initialize: (2) (3) (4) (5) Choose (6) Execute the action: (7) Observe the reward (8) Predict the next state: (9) Predict the reward: (10) Update the parameters (11) Update the parameter (12) (13) Insert the real sample (14) (15) Replace the oldest one in (16) (17) Select (18) Predict the next state and the reward: (19) Update the parameter (20) Compute the local error: (21) (22) Call Local-model planning ( (23) (24) Update the eligibility: (25) Estimate the TD error: (26) Update the value-function parameter: (27) Update the policy parameter: (28) (29) Update the number of samples: (30) (31) Call Global-model planning ( (32)
(1) (2) (3) Choose (4) (5) Predict the next state and the reward: (6) Update the eligibility: (7) Compute the TD error: (8) Update the value function parameter: (9) Update the policy parameter: (10) (11) (12) (13) Update the number of samples: (14)
(1) (2) (3) (4) Choose (5) Compute exploration term: (6) Predict the next state: (7) Predict the reward: (8) Update the eligibility: (9) Compute the TD error: (10) Update the value function parameter: (11) Update the policy parameter: (12) (13) (14) (15) Update the number of samples: (16) (17)
There are several parameters which are required to be determined in the three algorithms.
Notice that Algorithm
Regression approaches in machine learning are generally represented as minimization of a square loss term and a regularization term. The
The goal of learning the value function is to minimize the square of the TD-error, which is shown as
The update for the parameter
The update for the parameter
The update for the global model can be denoted as
After we replace the update equations of the parameters in Algorithms
AC-HMLP and RAC-HMLP are compared with S-AC, MLAC, and Dyna-MLAC on two continuous state and action spaces problems, pole balancing problem [
Pole balancing problem is a low-dimension but challenging benchmark problem widely used in RL literature, shown in Figure
Pole balancing problem.
There is a car moving along the track with a hinged pole on its top. The goal is to find a policy which can guide the force to keep the pole balance. The system dynamics is modeled by the following equation:
The state
RAC-HMLP and AC-HMLP are compared with S-AC, MLAC, and Dyna-MLAC on this experiment. The parameters of S-AC, MLAC, and Dyna-MLAC are set according to the values mentioned in their papers. The parameters of AC-HMLP and RAC-HMLP are set as shown in Table
Parameters settings of RAC-HMLP and AC-HMLP.
Parameter | Symbol | Value |
---|---|---|
Time step |
|
0.1 |
Discount factor |
|
0.9 |
Trace-decay rate |
|
0.9 |
Exploration variance |
|
1 |
Learning rate of the actor |
|
0.5 |
Learning rate of the critic |
|
0.4 |
Learning rate of the model |
|
0.5 |
Error threshold |
|
0.15 |
Capacity of the memory |
|
100 |
Number of the nearest samples |
|
9 |
Local planning times |
|
30 |
Global planning times |
|
300 |
Number of components of the state |
|
2 |
Regularization parameter of the model |
|
0.2 |
Regularization parameter of the critic |
|
0.01 |
Regularization parameter of the actor |
|
0.001 |
The planning times may have significant effect on the convergence performance. Therefore, we have to determine the local planning times
Comparisons of different planning times and different algorithms.
Determination of different planning times
Comparisons of different algorithms in convergence performance
The convergence performances of the five methods, RAC-HMLP, AC-HMLP, S-AC, MLAC, and Dyna-MLAC, are shown in Figure
The comparisons of the five different algorithms in sample efficiency are shown in Figure
Comparisons of sample efficiency.
The optimal policy and optimal value function learned by AC-HMLP after the training ends are shown in Figures
Optimal policy and value function learned by AC-HMLP and RAC-HMLP.
Optimal policy of AC-HMLP learned after training
Optimal value function of AC-HMLP learned after training
Optimal policy of RAC-HMLP learned after training
Optimal value function of RAC-HMLP learned after training
As for the optimal policy (see Figures
In terms of the optimal value function (see Figures
After the training ends, the prediction of the next state and the reward for every state-action pair can be obtained through the learned model. The predictions for the next angle
Prediction of the next state and reward according to the global model.
Prediction of the angle at next state
Prediction of the angular velocity at next state
Prediction of the reward
Continuous maze problem is shown in Figure
Continuous maze problem.
RAC-HMLP and AC-HMLP are compared with S-AC, MLAC, and Dyna-MLAC in cumulative rewards. The results are shown in Figure
Comparisons of cumulative rewards and time steps for reaching goal.
Comparisons of cumulative rewards
Comparisons of time steps for reaching goal
The comparisons of the time steps for reaching goal are simulated, which are shown in Figure
The comparisons of sample efficiency are shown in Figure
Comparisons of different algorithms in sample efficiency.
After the training ends, the approximate optimal policy and value function are obtained, shown in Figures
Final optimal policy and value function after training.
Optimal policy learned by AC-HMLP
Optimal value function learned by AC-HMLP
This paper proposes two novel actor-critic algorithms, AC-HMLP and RAC-HMLP. Both of them take LLR and LFA to represent the local model and global model, respectively. It has been shown that our new methods are able to learn the optimal value function and the optimal policy. In the pole balancing and continuous maze problems, RAC-HMLP and AC-HMLP are compared with three representative methods. The results show that RAC-HMLP and AC-HMLP not only converge fastest but also have the best sample efficiency.
RAC-HMLP performs slightly better than AC-HMLP in convergence rate and sample efficiency. By introducing
S-AC is the only algorithm without model learning. The poorest performance of S-AC demonstrates that combining model learning and AC can really improve the performance. Dyna-MLAC learns a model via LLR for local planning and policy updating. However, Dyna-MLAC directly utilizes the model before making sure whether it is accurate; additionally, it does not utilize the global information about the samples. Therefore, it behaves poorer than AC-HMLP and RAC-HMLP. MLAC also approximates a local model via LLR as Dyna-MLAC does, but it only takes the model to update the policy gradient, surprisingly with a slightly better performance than Dyna-MLAC.
Dyna-MLAC and MLAC approximate the value function, the policy, and the model through LLR. In the LLR approach, the samples collected in the interaction with the environment have to be stored in the memory. Through KNN or
The planning times for the local model and the global model have to be determined according the experimental performance. Thus, we have to set the planning times according to the different domains. To address this problem, our future work will consider how to determine the planning times adaptively according to the different domains. Moreover, with the development of the deep learning [
The authors declare that there are no competing interests regarding the publication of this article.
This research was partially supported by Innovation Center of Novel Software Technology and Industrialization, National Natural Science Foundation of China (61502323, 61502329, 61272005, 61303108, 61373094, and 61472262), Natural Science Foundation of Jiangsu (BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04), and Suzhou Industrial Application of Basic Research Program Part (SYG201422).