^{1}

^{2}

^{1}

^{1}

^{1}

^{1}

^{2}

Adaptive Dynamic Programming (ADP) with critic-actor architecture is an effective way to perform online learning control. To avoid the subjectivity in the design of a neural network that serves as a critic network, kernel-based adaptive critic design (ACD) was developed recently. There are two essential issues for a static kernel-based model: how to determine proper hyperparameters in advance and how to select right samples to describe the value function. They all rely on the assessment of sample values. Based on the theoretical analysis, this paper presents a two-phase simultaneous learning method for a Gaussian-kernel-based critic network. It is able to estimate the values of samples without infinitively revisiting them. And the hyperparameters of the kernel model are optimized simultaneously. Based on the estimated sample values, the sample set can be refined by adding alternatives or deleting redundances. Combining this critic design with actor network, we present a Gaussian-kernel-based Adaptive Dynamic Programming (GK-ADP) approach. Simulations are used to verify its feasibility, particularly the necessity of two-phase learning, the convergence characteristics, and the improvement of the system performance by using a varying sample set.

Reinforcement learning (RL) is an interactive machine learning method for solving sequential decision problems. It is well known as an important learning method in unknown or dynamic environment. Different from supervised learning and unsupervised learning, RL interacts with the environment through trial mechanism and modifies its action policies to maximize the payoffs [

Traditional RL research focused on discrete state/action systems; state/action only takes on a finite number of prescribed discrete values. The learning space grows exponentially as the number of states and the number of allowed actions increase. This leads to the so-called curse of dimensionality (CoD) [

Currently, research work on RL in continuous environment to construct learning systems for nonlinear optimal control has attracted attention of researchers and scholars in control domains for the reason that it can modify its policy only based on the value function without knowing the model structure or the parameters in advance. A family of new RL techniques known as Approximate or Adaptive Dynamic Programming (ADP) (also known as Neurodynamic Programming or Adaptive Critic Designs (ACDs)) has received more and more research interest [

ADP researches always adopt multilayer perceptron neural networks (MLPNNs) as the critic design. Vrabie and Lewis proposed an online approach to continuous-time direct adaptive optimal control which made use of neural networks to parametrically represent the control policy and the performance of the control system [

Besides the benefits brought by NNs, ADP methods always suffer from some problems concerned in the design of NNs. On one hand, the learning control performance greatly depends on empirical design of critic networks, especially the manual setting of the hidden layer or the basis functions. On the other hand, due to the local minima in neural network training, how to improve the quality of the final policies is still an open problem [

As we can see, it is difficult to evaluate the effectiveness of the parametric model when the knowledge on the model’s order or nonlinear characteristics of the system is not enough. Compared with parametric modeling methods, nonparametric modeling methods, especially kernel methods [

In addition to SVMs, Gaussian processes (GPs) have become an alternative generalization method. GP models are powerful nonparametric tools for approximate Bayesian inference and learning. In comparison with other popular nonlinear architectures, such as multilayer perceptrons, their behavior is conceptually simpler to understand, and model fitting can be achieved without resorting to nonconvex optimization routines [

An alternative approach employing GPs in RL is model-based value iteration or policy iteration method, in which GP model is used to model system dynamics and represent the value function [

Kernel-based method is also introduced to ADP. In [

We think GPDP and ACDs with sparse kernel machines are complementary. As indicated in [

The major problem here is how to realize the value function learning and GP models updating simultaneously, especially under the condition that the samples of state-action space can hardly be revisited infinitely in order to approximate their values. To tackle this problem, a two-phase iteration is developed in order to get optimal control policy for the system whose dynamics are unknown a priori.

In general, ADP is an actor-critic method which approximates the value functions and policies to encourage the realization of generalization in MDPs with large or continuous spaces. The critic design plays the most important role, because it determines how the actor optimizes its action. Hence, we give a brief introduction on both kernel-based ACD and GPs, in order to derive the clear description of the theoretical problem.

Kernel-based ACDs mainly consist of a critic network, a kernel-based feature learning module, a reward function, an actor network/controller, and a model of the plant. The critic constructed by kernel machine is used to approximate the value functions or their derivatives. Then the output of the critic is used in the training process of the actor so that policy gradients can be computed. As actor finally converges, the optimal action policy mapping states to actions is described by this actor.

Traditional neural network based on kernel machine and samples serves as the model of value functions, just as the following equation shows, and the recursive algorithm, such as KLSTD [

The key of critic learning is the update of the weights vector

If only using ALD-based kernel sparsification, it is only independent of samples that are considered in sample selection. So it is hard to evaluate how good the sample set is, because the sample selection does not consider the distribution of value function, and the performance of ALD analysis is affected seriously by the hyperparameters of kernel function, which are predetermined empirically and fixed during learning.

If the hyperparameters can be optimized online and the value function w.r.t. samples can be evaluated by iteration algorithms, the critic network will be optimized not only by value approximation but also by hyperparameter optimization. Moreover with approximated sample values, there is a direct way to evaluate the validity of sample set, in order to regulate the set online. Thus in this paper we turn to Gaussian processes to construct the criterion for samples and hyperparameters.

For an MDP, the data samples

Given a sample set collected from a continuous dynamic system,

Assuming additive independent identically distributed Gaussian noise

The parameters

For an arbitrary input

Comparing (

The hyperparameters

The values

With Gaussian-kernel-based critic network, the sample state-action pairs and corresponding values are known. And then the criterion such as the comprehensive utility proposed in [

Consider Condition

Consider Condition

According to the analysis, both conditions are interdependent. That means the update of critic depends on known hyperparameters, and the optimization of hyperparameters depends on accurate sample values.

Hence we need a comprehensive iteration method to realize value approximation and optimization of hyperparameters simultaneously. A direct way is to update them alternately. Unfortunately, this way is not reasonable because these two processes are tightly coupled. For example, temporal differential errors drive value approximation, but the change of weights

To solve this problem, a kind of two-phase value iteration for critic network is presented in the next section, and the conditions of convergence are analyzed.

First a proposition is given to describe the relationship between hyperparameters and the sample value function.

The hyperparameters are optimized by evidence maximization according to the samples and their

Then the two-phase value iteration for critic network is described as the following theorem.

Given the following conditions

the system is boundary input and boundary output (BIBO) stable,

the immediate reward

for

From (

The convergence of iterative algorithm is proved based on stochastic approximation Lyapunov method.

Define that

Equation (

Define approximation errors as

Further define that

Let

Define

Obviously (

It is easy to compute the first-order Taylor expansion of (

Substituting (

Consider the last two items of (

If the system is BIBO stable,

According to Proposition

According to Lemma 5.4.1 in [

Now we focus on the first two items on the right of (

The first item

For the second item

This inequality holds because of positive

On the other hand,

Substituting (

Let us check the final convergence position. Obviously,

It is clear that the selection of the samples is one of the key issues. Since now all samples have values according to two-phase iteration, according to the information-based criterion, it is convenient to evaluate samples and refine the set by arranging relative more samples near the equilibrium or with great gradient of the value function, so that the sample set is better to describe the distribution of value function.

Since the two-phase iteration belongs to value iteration, the initial policy of the algorithm does not need to be stable. To ensure BIBO in practice, we need a mechanism to clamp the output of system, even though the system will not be smooth any longer.

Theorem

Up to now we have built a critic-actor architecture which is named Gaussian-kernel-based Approximate Dynamic Programming (GK-ADP for short) and shown in Algorithm

Initialize:

Let

Loop:

Get the reward

Observe next state

Update

Update the policy

Update

Until the termination criterion is satisfied

In this section, we propose some numerical simulations about continuous control to illustrate the special properties and the feasibility of the algorithm, including the necessity of two phases learning, the specifical properties comparing with traditional kernel-based ACDs, and the performance enhancement resulting from online refinement of sample set.

Before further discussion, we firstly give common setup in all simulations:

The span

To make output bounded, once a state is out of the boundary, the system is reset randomly.

The exploration-exploitation tradeoff is left out of account here. During learning process, all actions are selected randomly within limited action space. Thus the behavior of the system is totally unordered during learning.

The same ALD-based sparsification in [

The sampling time and control interval are set to 0.02 s.

The proof of Theorem

Consider a simple single homogeneous inverted pendulum system:

We test the success rate under different learning rates. Since, during the learning the action is always selected randomly, we have to verify the optimal policy after learning; that is, an independent policy test is carried out to test the actor network. Thus the success of one time of learning means in the independent test the pole can be swung up and maintained within

In the simulation, 60 state-action pairs are collected to serve as the sample set, and the learning rates are set to

The learning is repeated 50 times in order to get the average performance, where the initial states of each run are randomly selected within the bound

Figure

Success times over 50 runs under different learning rate

Hence both phases are necessary if the hyperparameters are not initialized properly. It should be noted that the learning rates w.r.t. two phases need to be regulated carefully in order to guarantee condition (

If all samples’ values on each iteration are summed up and all cumulative values w.r.t. iterations are depicted in series, then we have Figure

The average evolution processes over 50 runs w.r.t. different

As mentioned in Algorithm

The control objective in the simulation is a one-dimensional inverted pendulum, where a single-stage inverted pendulum is mounted on a cart which is able to move linearly, just as Figure

One-dimensional inverted pendulum.

The mathematic model is given as

Thus the state-action space is 5D space, much larger than that in simulation 1. The configurations of the both algorithms are listed as follows:

A small sample set with 50 samples is adopted to build critic network, which is determined by ALD-based sparsification.

For both algorithms, the state-action pair is limited to

In GK-ADP, the learning rates are

The KHDP algorithm in [

All experiments run 100 times to get the average performance. And in each run there are 10000 iterations to learn critic network.

Before further discussion, it should be noted that it makes no sense to figure out ourselves with which algorithm is better in learning performance, because, besides the debate of policy iteration versus value iteration, there are too many parameters in the simulation configuration affecting learning performance. So the aim of this simulation is to illustrate the learning characters of GK-ADP.

However to make the comparison as fair as possible, we regulate the learning rates of both algorithms to get similar evolution processes. Figure

The comparison of the average evolution processes over 100 runs.

Then the success rates of all algorithms are depicted in Figure

The success rates w.r.t. all algorithms over 100 runs.

To discuss the performance in deep, we plot the test trajectories resulting from the actors, which are optimized by GK-ADP and KHDP, in Figures

The outputs of the actor networks resulting from GK-ADP and KHDP.

The resulted control performance using GK-ADP

The resulted control performance using KHDP

Apparently the transition time of GK-ADP is much smaller than KHDP. We think, besides the possible well regulated parameters, an important reason is nongradient learning for actor network.

The critic learning only depends on exploration-exploitation balance but not the convergence of actor learning. If exploration-exploitation balance is designed without actor network output, the learning processes of actor and critic networks are relatively independent of each other, and then there are alternatives to gradient learning for actor network optimization, for example, the direct optimum seeking Gaussian regression actor network in GK-ADP.

Such direct optimum seeking may result in nearly nonsmooth actor network, just like the force output depicted in the second plot of Figure

The final best policies w.r.t. all samples.

It is obvious that GK-ADP is with high efficiency but also with potential risks, such as the impact to actuators. Hence how to design exploration-exploitation balance to satisfy Theorem

Finally we check the samples values, which are depicted in Figure

The

If we execute 96 successful policies one by one and record all final cart displacements and linear speed, just as Figure

The final cart displacement and linear speed in the test of successful policy.

To solve this problem and improve performance, besides optimizing learning parameters, Remark

Due to learning sample’s value, it is possible to assess whether the samples are chosen reasonable. In this simulation we adopt the following expected utility [

Let

Sort all

Add the first

Since in simulation 2 we have carried out 100 runs of experiment for GK-ADP, 100 sets of sample values w.r.t. the same sample states are obtained. Now let us apply Algorithm

We repeat all 100 runs of GK-ADP again, with the same learning rates

Learning performance of GK-ADP if sample set is refined by adding 10 samples.

The average evolution process of

The enhancement of the control performance about cart displacement after adding samples

However if we check the cart movement, we will find the enhancement brought by the change of sample set. For the

Obviously the positive enhancement indicates that the actor network after adding samples behaves better. As all enhancements w.r.t. 100 runs are illustrated together, just as Figure

Finally let us check which state-action pairs are added into sample set. We put all added samples over 100 runs together and depict their values in Figure

The

Projecting on the dimensions of pole angle and cart displacement

Projecting on the dimension of pole angle

ADP methods are among the most promising research works on RL in continuous environment to construct learning systems for nonlinear optimal control. This paper presents GK-ADP with two-phase value iteration which combines the advantages of kernel ACDs and GP-based value iteration.

The theoretical analysis reveals that, with proper learning rates, two-phase iteration is good at making Gaussian-kernel-based critic network converge to the structure with optimal hyperparameters and approximate all samples’ values.

A series of simulations are carried out to verify the necessity of two-phase learning and illustrate properties of GK-ADP. Finally the numerical tests support the viewpoint that the assessment of samples’ values provides the way to refine sample set online, in order to enhance the performance of critic-actor architecture during operation.

However there are some issues needed to be concerned in future. The first is how to guarantee condition (

The authors declare that there is no conflict of interests regarding the publication of this paper.

This work is supported by the National Natural Science Foundation of China under Grants 61473316 and 61202340 and the International Postdoctoral Exchange Fellowship Program under Grant no. 20140011.