Kernel Recursive Least-Squares Temporal Difference Algorithms with Sparsification and Regularization

By combining with sparse kernel methods, least-squares temporal difference (LSTD) algorithms can construct the feature dictionary automatically and obtain a better generalization ability. However, the previous kernel-based LSTD algorithms do not consider regularization and their sparsification processes are batch or offline, which hinder their widespread applications in online learning problems. In this paper, we combine the following five techniques and propose two novel kernel recursive LSTD algorithms: (i) online sparsification, which can cope with unknown state regions and be used for online learning, (ii) L 2 and L 1 regularization, which can avoid overfitting and eliminate the influence of noise, (iii) recursive least squares, which can eliminate matrix-inversion operations and reduce computational complexity, (iv) a sliding-window approach, which can avoid caching all history samples and reduce the computational cost, and (v) the fixed-point subiteration and online pruning, which can make L 1 regularization easy to implement. Finally, simulation results on two 50-state chain problems demonstrate the effectiveness of our algorithms.


Introduction
The least-squares temporal difference (LSTD) learning may be the most popular approach for policy evaluation in reinforcement learning (RL) [1,2]. Compared with the standard temporal difference (TD) learning, LSTD uses samples more efficiently and eliminates all step-size parameters. However, LSTD also has some drawbacks. First, LSTD requires a matrix-inversion operation at each time step. To reduce computational complexity, Bradtke and Barto proposed a recursive LSTD (RLSTD) algorithm [1], and Xu et al. proposed a RLSTD( ) algorithm [3]. But these two algorithms still require many features especially for highly nonlinear RL problems, since the RLS approximator assumes a linear model [4]. Second, when the number of features is larger than the number of training samples, LSTD is prone to overfitting. To overcome this problem, Kolter and Ng proposed an 1 -regularized LSTD algorithm called LARS-TD for feature selection [5], but it is only applicable for batch learning and its implementation is complicated. On this basis, Chen et al. proposed an 2 -regularized RLSTD algorithm [6].
In contrast with LARS-TD, it has an analytical solution, but it cannot obtain a sparse solution. Third, LSTD requires users to design the feature vector manually, and poor design choices can result in estimates that diverge from the optimal value function [7].
In the last two decades, kernel methods have been intensively and extensively studied in supervised and unsupervised learning [8]. The basic idea behind kernel methods can be summarized as follows: By using a nonlinear transform, the origin input data can be mapped into a highdimensional feature space, and an inner product in this space can be interpreted as a Mercer kernel function. Thus, as long as a linear algorithm can be formulated in terms of inner products, there is no need to perform computations in the high-dimensional feature space [9]. Recently, there have also been many research works on kernelizing leastsquares algorithms [9][10][11][12][13]. Here, we only review some works related to our proposed algorithms. One typical work is the sparse kernel recursive least-squares (SKRLS) algorithm with the approximate linear dependency (ALD) criterion [11]. Compared with traditional RLS algorithms, it not only has 2 Computational Intelligence and Neuroscience a good nonlinear approximation ability but also can construct the feature dictionary automatically. Similarly, Chen et al. proposed an 2 -regularized SKRLS algorithm with the online vector quantization [12]. Besides having the good properties of SKRLS-ALD, it can avoid overfitting. In addition, Chen et al. proposed an 1 -regularized SKRLS algorithm with the fixed-point subiteration [13], which can yield a much sparser dictionary.
Intuitively, we can also bring the benefits of kernel machine learning to LSTD algorithms. In fact, kernel-based RL algorithms have become more and more popular in recent years [14][15][16][17][18][19][20][21][22], and several works have been done for kernelizing LSTD algorithms. In an earlier paper, Xu proposed a sparse kernel-based LSTD( ) (SKLSTD( )) algorithm with the ALD criterion [19]. Although this algorithm can avoid selecting features manually, it is only applicable for batch learning and its derivation is complicated. After that, Xu et al. proposed an incremental version of the SKLSTD( ) algorithm for policy iteration [20], but this algorithm still requires a matrix-inversion operation at each time step. Moreover, the feature dictionary is required to be constructed offline, which makes this algorithm only approximate the value function correctly in the area of the state space that is covered by the training samples. Recently, Jakab and Csató proposed a sparse kernel RLSTD (SKRLSTD) algorithm by using a proximity graph sparsification method [21]. Unfortunately, its sparsification process is also offline. In addition, all of these algorithms do not consider regularization, whereas many real problems exhibit noise and the high expressiveness of the kernel matrix can result in overfitting [22].
In this paper, we propose two online SKRLSTD algorithms with 2 and 1 regularization, called OSKRLSTD-2 and OSKRLSTD-1 , respectively. Compared with the derivation of SKLSTD( ), our derivation uses Bellman operator along with projection operator and thus is more simple. To cope with unknown state-space regions and avoid overfitting, our algorithms use online sparsification and regularization techniques. Besides, to reduce computational complexity and avoid caching all history samples, our algorithms also use the recursive least-squares and the sliding-window technique. Moreover, different from LARS-TD, OSKRLSTD-1 uses the subiteration and online pruning to find the fixed point. These techniques make our algorithms more suitable for online RL problems with a large or continuous state space. The rest of this paper is organized as follows. In Section 2, we present preliminaries and review the LSTD algorithm. Section 3 contains the main contribution of this paper: we derive OSKRLSTD-2 and OSKRLSTD-1 algorithms in detail. In Section 4, we demonstrate the effectiveness of our algorithms for two 50-state chain problems. Finally, we conclude the paper in Section 5.

Background
In this section, we introduce the basic definitions and notations, which will be used throughout the paper without any further mention. We also review the LSTD algorithm, which is needed to establish our algorithms described in Section 3.

Preliminaries.
In RL and dynamic programming (DP), an underlying sequential decision-making problem is often modeled as a Markov decision process (MDP). An MDP can be defined as a tuple M = ⟨S, A, , , , ⟩ [5], where S is a set of states, A is a set of actions, : S × A × S → [0, 1] is a state transition probability function where (s, a, s ) denotes the probability of transitioning to state s when taking action a in state s, ∈ R is a reward function, ∈ [0, 1] is the discount factor, and is an initial state distribution. For simplicity of presentation, we assume that S and A are finite. Given an MDP M and a policy : S → A, the sequence s 1 or be expressed in vector form, If and are known, can be solved analytically; that is, where I is the |S| × |S| identity matrix. However, different from the case in DP, and are unknown in RL. The agent has to estimate by exploring the environment. Furthermore, many real problems have a large or continuous state space, which makes (s) hard to be expressed explicitly. To overcome this problem, we often resort to linear function approximation; that is, where w ∈ R is a parameter vector, (s) ∈ R is the feature vector of state s, and Φ = [ (s 1 ), . . . , (s |S| )] T is a |S| × feature matrix. Unfortunately, when approximating in this manner, there is usually no way to satisfy the Bellman equation exactly, because + Φw may lie outside the span of Φ [5].

LSTD Algorithm.
The LSTD algorithm presents an efficient way to find w such that̂"approximately" satisfies the Bellman equation [5]. By solving the least-squares problem min u∈R ‖Φu − ( + Φw)‖ 2 D , we can find a closest approximation Φu * in the span of Φ to replace + Φw. Then, from (2) and (4), we can use w = u * for approximating . That means we can find w by solving the fixed-point equation: where D is a nonnegative diagonal matrix indicating a distribution over states. Nevertheless, since and are unknown and since Φ is too large to form anyway in a large or continuous state space, we cannot solve (5) exactly. Instead, given a trajectory = {(s , s , ) | = 1, . . . , } following . , ] T to replace Φ, Φ, and , respectively. Then, (5) can be approximately rewritten as Thus, the fixed point w =̃(w ) can be found by

Regularized OSKRLSTD Algorithms
To overcome the weaknesses of the previous kernel-based LSTD algorithms, we propose two regularized OSKRLSTD algorithms in this section.

OSKRLSTD-2
Algorithm. Now, we use 2 regularization and online sparsification to derive the first OSKRLSTD algorithm, which is called OSKRLSTD-2 . First, we use the kernel trick to kernelize (6). Suppose the feature dictionary D = {d | d ∈ S, = 1, . . . , }, and let Φ = [ (d 1 ), . . . , (d )] T denote the corresponding feature matrix. By the Representer Theorem [24], w and u can be expressed as follows: where = [ ,1 , . . . , , ] T and = [ 1 , . . . , ] T are the coefficient vector of w and u, respectively. Then, from (6), we have By the Mercer Theorem [24], the inner product of two feature vectors can be calculated by (s , s ) = (s ) T (s ). Thus, we On this basis, (10) can be rewritten as Second, we try to derive the 2 -regularized solution of (11). Add an 2 -norm penalty into (11); that is, where Since w = u * , we easily have = * from (9). Then, the above equation can be rewritten as where I is the × identity matrix. Thus, can be analytically solved as where where Third, we derive the recursive formulas of A −1 and . Under online sparsification, there are two cases: , and I is expanded as where 0 −1 is the −1 dimensional zero vector. For the first case, (16) can be rewritten as follows: Applying the matrix-inversion lemma [25] for A −1 , we get 4 Computational Intelligence and Neuroscience Thus, plugging (19) and (20) into (15), we obtain . (21) For the second case, (16) can be rewritten as follows: whereÃ andb are the same as the updated A and b when the feature dictionary keeps unchanged, requires caching all history samples, and the computational cost will become more and more expensive as increases. Inspired by the work of Van Vaerenbergh et al. [26], we introduce a sliding window H to deal with these problems.
where is the window size. We only use the samples in H to evaluate h , g , , and ; that is, Then, similar to those in the first case, A −1 and can be derived as follows: where =̃−g TÃ −1h and̃is the same as the updated when the dictionary keeps unchanged.
Finally, we summarize the whole algorithm in Algorithm 1.

Remark 1.
Here, we do not restrict the OSKRLSTD-2 algorithm to a specific online sparsification method. That means it can be combined with many popular sparsification methods such as the novelty criterion (NC) [27] and the ALD criterion.
(1) Input: to be evaluated, (⋅, ⋅), , , T a k ea 1 given by , and observe s 1 , 1 Initialize T a k ea given by , and observe s , U p d a t eH , D = D −1 (12) U p d a t e , A −1 by (21) and (20)  (13) if s satisfies the sparsification condition then U p d a t e , A −1 by (25) and (24)  (17) end if (18) end if (19) s +1 = s (20) end for Remark 2. Although the OSKRLSTD-2 algorithm is designed for infinite horizon tasks, it can be modified for episodic tasks. When s is an absorbing state, it only requires setting = 0 temporarily and setting s +1 as the start state of next episode.

Remark 3.
Our simulation results show that a big sliding window cannot help improve the convergence performance of the OSKRLSTD-2 algorithm. Thus, to save memory and reduce the computational cost, should be set to a small integer.

OSKRLSTD-1 Algorithm.
In this subsection, we use 1 regularization and online sparsification to derive the second OSKRLSTD algorithm, which is called OSKRLSTD-1 .
First, we try to derive the 1 -regularized solution of (11). Add an 1 -norm penalty into (11); that is, where ∈ [0, ∞) is a regularization parameter. However, ‖ ‖ 1 is not differentiable. Similar to Painter-Wakefield and Parr in [28], we resort to the subdifferential of ( ) = ‖K T − (̂+ (K ) T )‖ 2 2 + 2 ‖ ‖ 1 ; that is, Computational Intelligence and Neuroscience 5 where sgn( ) is the set-valued function defined componentwise as Let ∇ ( ) = 0, so that Since w = u * , we also have = * from (9). Then, the above equation can be rewritten aŝ where sgn( ) has the same meaning as sgn( ). To avoid the singularity ofK (K −K ) T and further reduce the complexity of the subsequent derivation, we introduce into both sides; that is, where ∈ [0, ∞) is a regularization parameter. Obviously, the left hand side of (31) is the same as that of (14). Thus, from (16), the above equation can be rewritten as Then, we have the following fixed-point equation: where denotes Unfortunately, here, cannot be solved analytically. Second, we investigate how to find the fixed point of (33). In 1 -regularized LSTD algorithms [5,29], researchers often used the LASSO method to tackle this problem. However, the LASSO method is inherently a batch method and is unsuitable for online learning. Instead, we resort to the fixedpoint subiteration method introduced in [13]. We first use the sign function sign( ) to replace sgn( ) in (33). Then, we can construct the following subiteration: where ∈ N + denotes the th subiteration and 1 is initialized to since the fixed point will be close to if and are small. If the subiteration number reaches a preset value ∈ N + or ‖ V+1 − V ‖ is less than or equal to a preset threshold ∈ R + , the subiteration will stop. From (32) and (28), if |(b + −A ) | < , , should be 0. Obviously, the replacement  (4) Update by (35) (5) if ‖ +1 − ‖ ≤ then (6) Break out of the loop (7) end if (8) end for (9) = +1 (10) Determine the index set I by (37) (11) Perform Ψ I (D ), Ψ I ( ), Ψ I ( ) and Ψ I (A −1 ) Algorithm 2: Fixed-point subiteration and online pruning. of sgn( ) makes lose the ability to select features. To remedy this situation, after the whole subiteration, we remove the weakly dependent elements from D according to the magnitude of ; that is, where Ψ I (⋅) denotes the operation to remove the elements indexed by the set I , which is determined by where V ∈ R + is a preset threshold. Note that we do not remove the last element d of D , since | | is probably very small, especially when d is just added to D . Similarly, we perform Ψ I ( ) and Ψ I ( ) to remove the weakly dependent coefficients. From (16), A −1 also requires removing some rows and columns. Unfortunately, we cannot use the method in [30] to do this like Chen et al. in [13], since A −1 is not a symmetry matrix. Considering that b will remove the corresponding elements if D is pruned, we directly perform Ψ I (A −1 ) to remove the rows and columns indexed by I . Although this method may bring some bias into A −1 , our simulation results show that it is feasible and effective. The whole fixed-point subiteration and online pruning algorithm are summarized in Algorithm 2.

Remark 4.
Our simulation results show that Algorithm 2 will converge in few iterations. Thus, Algorithm 2 does not become the computational bottleneck of the OSKRLSTD-1 algorithm, and the maximum subiteration number can be set to a small positive integer.
Third, we derive the recursive formulas of A −1 and . Although the dictionary can be pruned by using Algorithm 2, it still has the risk of rapidly growing if new samples are allowed to be added continually. Thus, the conventional sparsification method is also required to be considered here. Similar to Section 3.1, there are two cases under online sparsification. Since A and have the same definitions as A and in the OSKRLSTD-2 algorithm, we can directly use (20) and (24) for updating A −1 and rewrite (21) and (25) 6 Computational Intelligence and Neuroscience (1) Input: to be evaluated, (⋅, ⋅), , , , , , , V T a k ea 1 given by , and observe s 1 , 1 Initialize Perform Algorithm 2 (10) else (11) T a k ea given by , and observe s , (12) U p d a t eH , D = D −1 (13) U p d a t e , A −1 by (38) and (20) Perform Algorithm 2 (16) if s satisfies the sparsification condition then U p d a t e , A −1 by (39) and (24)  (20) Perform Algorithm 2 (21) end if (22) end if (23) s +1 = s (24) end for Algorithm 3: OSKRLSTD-1 .
for updating . If s dissatisfies the sparsification condition, will be updated by Otherwise, will be updated by whereh ,g ,̃, and̃are also calculated by (23). Since D , A −1 , and will be pruned by Algorithm 2 after the update, it is important to note thatÃ −1 and̃in (39) denote A −1 and updated by D −1 but unpruned by Ψ I (⋅). Likewise, when (24) is used here,Ã −1 has the same meaning.
Finally, we summarize the whole algorithm in Algorithm 3. For episodic tasks, the modification is the same as Remark 2. In addition, similar to Remark 3, the slidingwindow size should also be set to a small integer.
Remark 5. By pruning the weakly dependent features, the OSKRLSTD-1 algorithm can yield a much sparser solution than the OSKRLSTD-2 algorithm.

Simulations
In this section, we use a nonnoise chain and a noise chain [2,20,31] to demonstrate the effectiveness of our proposed algorithms. For comparison purposes, RLSTD [1] and SKRLSTD [21] algorithms are also tested in the simulations. To analyze the effect of regularization and online pruning on the performance of our algorithms, the OSKRLSTD-2 algorithm with = 0 and the OSKRLSTD-1 algorithm with V = 0 (called OSKRLSTD-0 and OSKRLSTD-1 , resp.) are tested here, too. In addition, the effects of the sliding-window size on the performance of our algorithms and OSKRLSTD-1 are evaluated as well. Figure 1, in both chain problems, each chain consists of 50 states, which are numbered from 1 to 50. For each state, there are two actions available, that is, "left" (L) and "right" (R). Each action succeeds with probability 0.9, changing the state in the intended direction, and fails with probability 0.1, changing the state in the opposite direction. The two boundaries of each chain are dead-ends, and the discount factor of each chain is set to 0.9. For the nonnoise chain, the reward is 1 only in states 10 and 41, whereas, for the noise chain, the reward is corrupted by an additive Gaussian noise 0 .3N(0, 1). Due to the symmetry, the optimal policy for both chains is to go right in states 1-9 and 26-41 and left in states 10-25 and 42-50. Here, we use it as the policy to be evaluated. Note that the state transition probabilities are available only for solving the true state-value functions , and they are assumed to be unknown for all algorithms compared here.

Simulation Settings. As shown in
In the implementations of all tested algorithms for both chain problems, the settings are summarized as follows: (i) For all OSKRLSTD algorithms, the Mercer kernel is defined as (x, y) = exp(−‖x − y‖ 2 /16), the sparsification condition is defined as min d ∈D −1 ‖s − d ‖ > 2, and the slidingwindow size is set to 5. Besides, for the OSKRLSTD-1 algorithm, the regularization parameters and are set to 0.8 and 0.3, the maximum subiteration number is set to 10, the precision threshold is set to 0.1, and the pruning threshold V is set to 0.4; for the OSKRLSTD-1 algorithm, Computational Intelligence and Neuroscience , , and are the same as those in the OSKRLSTD-1 algorithm; for the OSKRLSTD-2 algorithm, is set to 1. (ii) For the SKRLSTD algorithm, the Mercer kernel and the sparsification condition are the same as those in each OSKRLSTD algorithm. (iii) For the RLSTD algorithm, the feature vector (s) consists of 19 Gaussian radius basis functions (GRBFs) plus a constant term 1, resulting in a total of 20 basis functions. The GRBF has the same definition as the Mercer kernel used in each OSKRLSTD algorithm, and the centers of GRBFs are uniformly distributed over [1,50]. In addition, the variance matrix 0 of RLSTD is initialized to 0.4I, where I is the 20 × 20 identity matrix. (iv) In the simulations, each algorithm performs 50 runs, each run includes 100 episodes, and each episode is truncated after 100 time steps. In particular, the SKRLSTD algorithm requires an extra run for offline sparsification before each regular run.

Simulation Results.
We first report the comparison results of all tested algorithms with the simulation settings described in Section 4.1. Their learning curves are shown in Figure 2. At each episode, the root mean square error (RMSE) of each algorithm is calculated by RMSE = (1/50) ∑ 50 =1 ((1/50) ∑ 50 s=1 (̂(s) − (s)) 2 ) 0.5 , where (s) is solved by (1) and̂(s) is the approximate value of the th run. From Figure 2, we can observe that (i) OSKRLSTD-2 and OSKRLSTD-1 can obtain the similar performance as RLSTD and converge much faster than SKRLSTD. (ii) Without regularization, the performance of OSKRLSTD-0 becomes very poor, especially in the noise chain. In contrast, OSKRLSTD-2 and OSKRLSTD-1 still perform well. (iii) The performance of OSKRLSTD-1 is only slightly better than that of OSKRLSTD-1 , which indicates that online pruning has little effect on the performance. Figure 3 illustrateŝ(s) approximated by all tested algorithms at the final episode. Clearly, OSKRLSTD-0 has lost the ability to approximate (s) of the noise chain. Figure 4 shows the dictionary growth curves of all tested algorithms. Compared with RLSTD and SKRLSTD, all OSKRLSTD algorithms can construct the dictionary automatically, and OSKRLSTD-1 yields a much sparser dictionary. Figure 5 shows the average subiterations per time step in OSKRLSTD-1 and OSKRLSTD-1 . As episodes increase, the subiterations decline gradually. In addition, online pruning can reduce the subiterations significantly. Even in the noise chain, the subiterations are small. Finally, the main simulation results of all tested algorithms at the final episode are summarized in Table 1.
Next, we evaluate the effect of the sliding-window size on our proposed algorithms and OSKRLSTD-1 with = 1, 5, 10, . . . , 45, 50. The logarithmic RMSEs of each algorithm at the final episode are illustrated in Figure 6. Note that the parameter settings of these algorithms are the same as those described in Section 4.1 except for . From Figure 6, OSKRLSTD-1 and OSKRLSTD-1 obviously become worse rather than better as the window size increases, and OSKRLSTD-2 has a strong adaptability to different window sizes. The reason for this result is analyzed as follows: From the derivation of our algorithms, the influence of the window size is mainly manifest in A −1 . Since here A −1 is calculated by recursive update instead of matrix inversion and samples are used one by one, using too many history samples together may increase the calculation error. In OSKRLSTD-2 , a moderate regularization parameter can relieve the influence of this error. In contrast, in OSKRLSTD-1 and OSKRLSTD-1 , the subiteration may expand the influence. Especially for OSKRLSTD-1 , online pruning can introduce the new error, which further worsens the convergence performance. To verify the above analysis, we reset = 0.6, = 0.3, and = 1 for OSKRLSTD-1 and OSKRLSTD-1 and reevaluate the effect of the window  size. The new results are illustrated in Figure 7. As expected, OSKRLSTD-1 and OSKRLSTD-1 can also adapt to . Nevertheless, there is still no proof that a big window size can help improve the convergence performance of OSKRLSTD-2 and OSKRLSTD-1 . Thus, as stated in Remark 3, is suggested to be set to a small integer in practice.

Conclusion
As an important approach for policy evaluation, LSTD algorithms can use samples more efficiently and eliminate all step-size parameters. But they require users to design the feature vector manually and often require many features to approximate state-value functions. Recently, there are some works on these issues by combining with sparse kernel methods. However, these works do not consider regularization and their sparsification processes are batch or offline. In this paper, we propose two online sparse kernel recursive leastsquares TD algorithms with 2 and 1 regularization, that is, OSKRLSTD-2 and OSKRLSTD-1 . By using Bellman operator along with projection operator, our derivation is more simple. By combining online sparsification, 2 and      advantages make them more suitable for online RL problems with a large or continuous state space. In particular, compared with the OSKRLSTD-2 algorithm, the OSKRLSTD-1 algorithm can yield a much sparser dictionary. Finally, we illustrate the performance of our algorithms and compare them with RLSTD and SKRLSTD algorithms by several simulations.
There are also some interesting topics to be studied in future work: (i) How to select proper regularization parameter should be investigated. (ii) A more thorough simulation analysis is needed, including an extension of our algorithms to learning control problems. (iii) Eligibility traces would be combined for further improving the performance of our algorithms. (iv) The convergence and prediction error bounds of our algorithms will be analyzed theoretically.