^{1,2}

^{1}

^{1}

^{1}

^{2}

By combining with sparse kernel methods, least-squares temporal difference (LSTD) algorithms can construct the feature dictionary automatically and obtain a better generalization ability. However, the previous kernel-based LSTD algorithms do not consider regularization and their sparsification processes are batch or offline, which hinder their widespread applications in online learning problems. In this paper, we combine the following five techniques and propose two novel kernel recursive LSTD algorithms: (i) online sparsification, which can cope with unknown state regions and be used for online learning, (ii)

The least-squares temporal difference (LSTD) learning may be the most popular approach for policy evaluation in reinforcement learning (RL) [

In the last two decades, kernel methods have been intensively and extensively studied in supervised and unsupervised learning [

Intuitively, we can also bring the benefits of kernel machine learning to LSTD algorithms. In fact, kernel-based RL algorithms have become more and more popular in recent years [

In this paper, we propose two online SKRLSTD algorithms with

In this section, we introduce the basic definitions and notations, which will be used throughout the paper without any further mention. We also review the LSTD algorithm, which is needed to establish our algorithms described in Section

In RL and dynamic programming (DP), an underlying sequential decision-making problem is often modeled as a Markov decision process (MDP). An MDP can be defined as a tuple

RL and DP often use the state-value function

However, different from the case in DP,

The LSTD algorithm presents an efficient way to find

To overcome the weaknesses of the previous kernel-based LSTD algorithms, we propose two regularized OSKRLSTD algorithms in this section.

Now, we use

First, we use the kernel trick to kernelize (

Second, we try to derive the

Third, we derive the recursive formulas of

For the first case, (

For the second case, (

Finally, we summarize the whole algorithm in Algorithm

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

Here, we do not restrict the OSKRLSTD-

Although the OSKRLSTD-

Our simulation results show that a big sliding window cannot help improve the convergence performance of the OSKRLSTD-

In this subsection, we use

First, we try to derive the

Second, we investigate how to find the fixed point of (

(

(

(

(

(

(

(

(

(

(

Our simulation results show that Algorithm

Third, we derive the recursive formulas of

Finally, we summarize the whole algorithm in Algorithm

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

(

By pruning the weakly dependent features, the OSKRLSTD-

In this section, we use a nonnoise chain and a noise chain [

As shown in Figure

The 50-state chain problem.

In the implementations of all tested algorithms for both chain problems, the settings are summarized as follows: (i) For all OSKRLSTD algorithms, the Mercer kernel is defined as

We first report the comparison results of all tested algorithms with the simulation settings described in Section

Main simulation results on both chains at the final episode.

Algorithm | Nonnoise chain | Noise chain | ||||
---|---|---|---|---|---|---|

RMSE | Dictionary size | Subiterations | RMSE | Dictionary size | Subiterations | |

RLSTD | 0.47 ± 0.03 | 20 | — | 0.50 ± 0.04 | 20 | — |

SKRLSTD | 0.47 ± 0.05 | 15.36 ± 0.78 | — | 0.49 ± 0.06 | 15.32 ± 0.71 | — |

OKRLSTD- |
0.45 ± 0.05 | 15.30 ± 0.81 | — | 0.47 ± 0.04 | 15.32 ± 0.84 | — |

OKRLSTD- |
0.49 ± 0.08 | 11.52 ± 1.16 | 1.81 ± 1.82 | 0.53 ± 0.10 | 12.42 ± 1.13 | 2.60 ± 2.56 |

OKRLSTD- |
2.21 ± 0.05 | 15.25 ± 0.87 | — | 32.92 ± 68.67 | 15.24 ± 0.77 | — |

OKRLSTD- |
0.44 ± 0.05 | 15.40 ± 0.76 | 5.08 ± 3.24 | 0.47 ± 0.05 | 15.28 ± 0.88 | 4.90 ± 3.26 |

Learning curves of all tested algorithms.

In the nonnoise chain

In the noise chain

In the nonnoise chain

In the noise chain

Dictionary growth curves of all tested algorithms.

In the nonnoise chain

In the noise chain

Average subiterations in OSKRLSTD-

In the nonnoise chain

In the noise chain

Next, we evaluate the effect of the sliding-window size on our proposed algorithms and OSKRLSTD-

Effect of the sliding-window size

In the nonnoise chain

In the noise chain

Effect of the sliding-window size

In the nonnoise chain

In the noise chain

As an important approach for policy evaluation, LSTD algorithms can use samples more efficiently and eliminate all step-size parameters. But they require users to design the feature vector manually and often require many features to approximate state-value functions. Recently, there are some works on these issues by combining with sparse kernel methods. However, these works do not consider regularization and their sparsification processes are batch or offline. In this paper, we propose two online sparse kernel recursive least-squares TD algorithms with

There are also some interesting topics to be studied in future work: (i) How to select proper regularization parameter should be investigated. (ii) A more thorough simulation analysis is needed, including an extension of our algorithms to learning control problems. (iii) Eligibility traces would be combined for further improving the performance of our algorithms. (iv) The convergence and prediction error bounds of our algorithms will be analyzed theoretically.

The authors declare that there are no competing interests regarding the publication of this paper.

This work is supported in part by the National Natural Science Foundation of China under Grant nos. 61300192 and 11261015, the Fundamental Research Funds for the Central Universities under Grant no. ZYGX2014J052, and the Natural Science Foundation of Hainan Province, China, under Grant no. 613153.

_{1}regularized linear temporal difference learning