Kernel Temporal Differences for Neural Decoding

We study the feasibility and capability of the kernel temporal difference (KTD)(λ) algorithm for neural decoding. KTD(λ) is an online, kernel-based learning algorithm, which has been introduced to estimate value functions in reinforcement learning. This algorithm combines kernel-based representations with the temporal difference approach to learning. One of our key observations is that by using strictly positive definite kernels, algorithm's convergence can be guaranteed for policy evaluation. The algorithm's nonlinear functional approximation capabilities are shown in both simulations of policy evaluation and neural decoding problems (policy improvement). KTD can handle high-dimensional neural states containing spatial-temporal information at a reasonable computational complexity allowing real-time applications. When the algorithm seeks a proper mapping between a monkey's neural states and desired positions of a computer cursor or a robot arm, in both open-loop and closed-loop experiments, it can effectively learn the neural state to action mapping. Finally, a visualization of the coadaptation process between the decoder and the subject shows the algorithm's capabilities in reinforcement learning brain machine interfaces.


Introduction
Research in brain machine interfaces (BMIs) is a multidisciplinary effort involving fields such as neurophysiology and engineering. Developments in this area have a wide range of applications, especially for subjects with neuromuscular disabilities, for whom BMIs may become a significant aid. Neural decoding of motor signals is one of the main tasks that needs to be executed by the BMI.
Ideas from system theory can be used to frame the decoding problem. Bypassing the body can be achieved by modelling the transfer function from brain activity to limb movement and utilizing the output of the properly trained model to control a robotic device to implement the intention of movement. The design of neural decoding systems has been approached using machine learning methods. In order to choose the appropriate learning method, factors such as learning speed and stability help in determining the usefulness of a particular method.
Reinforcement learning brain machine interfaces (RLBMI) [1] have been shown to be a promising avenue for practical implementations. Fast adaptation under changing environments and neural decoding capability of an agent have been shown in [2,3] using the actor-critic paradigm. Adaptive classification of event-related potential (ERP) in electroencephalography (EEG) using RL in BMI was proposed in [4]. Moreover, partially observable Markov decision processes (POMDPs) have been applied in the agent to account for the uncertainty when decoding noisy brain signals [5]. In a RLBMI, a computer agent and a user in the environment cooperate and learn coadaptively. The decoder learns how to correctly translate neural states into action direction pairs that indicate the subject's intent. In the agent, the proper neural decoding of the motor signals is essential to control an external device that interacts with the physical environment.
However, to realize the advantages of RLBMIs in practice, there are several challenges that need to be addressed.

Computational Intelligence and Neuroscience
The neural decoder must be able to handle high-dimensional neural states containing spatial-temporal information. The mapping from neural states to actions must be flexible enough to avoid making strong assumptions. Moreover, the computational complexity of the decoder should be reasonable such that real time implementations are feasible.
Temporal difference learning provides an efficient learning procedure that can be applied to reinforcement learning problems. In particular, TD( ) [6] can be applied to approximate a value function that is utilized to compute an approximate solution to Bellman's equation. The algorithm allows incremental computation directly from new experience without having an associated model of the environment. This provides a means to efficiently handle high-dimensional states and actions by using an adaptive technique for function approximation that can be trained directly from the data. Also, because TD learning allows system updates directly from the sequence of states, online learning is possible without having a desired signal at all times.
Note that TD( ) and its variants (least squares TD( ) [7], recursive least squares TD [8], incremental least squares TD( ) [9], Gradient TD [10], and linear TD with gradient correction [11]) have been mostly treated in the context of parametric linear function approximation. This can become a limiting factor in practical applications where little prior knowledge can be incorporated. Therefore, here, our interest focuses on a more general class of models with nonlinear capabilities. In particular, we adopt a kernel-based function approximation methodology.
Kernel methods are an appealing choice due to their elegant way of dealing with nonlinear function approximation problems. Unlike most of the nonlinear variants of TD algorithms which are prone to fall into local minima [12,13], the kernel based algorithms have nonlinear approximation capabilities yet the cost function can be convex [14]. One of the major appeals of kernel methods is the ability to handle nonlinear operations on the data by an implicit mapping to the so called feature space (reproducing kernel Hilbert space (RKHS)) which is endowed with an inner product. A linear operation in the RKHS corresponds to a nonlinear operation in the input space. In addition, algorithms based on kernel methods are still reasonably easy to compute based on the kernel trick [14].
Temporal difference algorithms based on kernel expansions have shown superior performance in nonlinear approximation problems. The close relation between Gaussian processes and kernel recursive least squares was exploited in [15] to provide a Bayesian framework for temporal difference learning. Similar work using kernel-based least squares temporal difference learning with eligibilities (KLSTD( )) was introduced in [16]. Unlike the Gaussian process temporal difference algorithm (GPTD), KLSTD( ) is not a probabilistic approach. The idea in KLSTD is to extend LSTD( ) [7] using the concept of duality. However, the computational complexity of KLSTD( ) per time update is ( 3 ) which precludes its use for online learning.
An online kernel temporal difference learning algorithm called kernel temporal differences (KTD)( ) was proposed in [17]. By using stochastic gradient updates, KTD( ) reduces the computational complexity from ( 3 ) to ( 2 ). This reduction along with other capacity control mechanisms such as sparsification make real time implementations of KTD( ) feasible.
Even though nonparametric techniques are inherently of growing structure, these techniques produce better solutions than any other simple linear function approximation methods. This has motivated work on methods that help overcome scalability issues such as the growing filter size [18]. In the context of kernel based TD algorithms, sparsification methods such as approximate linear dependence (ALD) [19] have been applied to GPTD [15] and KLSTD [20]. A Quantization approach proposed in [21] has been used in KTD( ) [22]. In a similar flavor, the kernel distance based online sparsification method was proposed for a KTD algorithm in [23]. Note that ALD is ( 2 ) complexity, whereas quantization and kernel distances are ( ). The main difference between the quantization approach and the kernel distance is the space where the distances are computed. Quantization approach uses criterion of input space distances whereas kernel distance computes them in the RKHS associated with the kernel [23].
In this paper, we investigate kernel temporal differences (KTD)( ) [17] for neural decoding. We first show the advantages of using kernel methods. Namely, we show that convergence results of TD( ) in policy evaluation carry over KTD( ) when the kernel is strictly positive definite. Examples of the algorithm's capability for nonlinear function approximation are also presented. We apply the KTD algorithm to neural decoding in open-loop and closed-loop RLBMI experiments where the algorithm's ability to find proper neural state to action map is verified. In addition, the trade off between the value function estimation accuracy and computation complexity under growing filter size is studied. Finally, we provide visualizations of the coadaptation between the decoder and the subject highlighting the usefulness of KTD( ) for reinforcement learning brain machine interfaces. This paper starts with a general background on reinforcement learning which is given in Section 2. Section 3 introduces the KTD algorithm and provides its convergence properties for policy evaluation. This algorithm is extended in Section 4 to policy improvement using -learning. Section 5 introduces some of the kernel sparsification methods for the KTD algorithm that address the naturally growing structure of kernel adaptive algorithms. Section 6 shows empirical results on simulations for policy evaluation, and Section 7 presents experimental results and comparisons to other methods in neural decoding using real data sets for both, open-loop and closed-loop RLBMI frameworks. Conclusions are provided in Section 8.

Reinforcement Learning Brain Machine Interfaces and Value Functions
In reinforcement learning brain machine interfaces (RLBMI), a neural decoder interacts with environment over time and adjusts its behavior to improve performance [1].  Figure 1: The decoding structure of reinforcment learning model in a brain machine interface using a -learning based function approximation algorithm.
The controller in the BMI can be considered as a neural decoder, and the environment includes the BMI user ( Figure 1). Assuming the environment is a stochastic and stationary process that satisfies the Markov condition, it is possible to model the interaction between the learning agent and the environment as a Markov decision process (MDP). For the sake of simplicity, we assume the states and actions are discrete, but they can also be continuous.
At time step , the decoder receives the representation of the user's neural state ( ) ∈ X as input. According to this input, the decoder selects an action ( ) ∈ A which causes the state of the external device to change, namely, the position of a cursor on a screen or a robot's arm position. Based on the updated position, the agent receives a reward ( + 1) ∈ R. At the same time, the updated position of the actuator will influence the user's subsequent neural states; that is, going from ( ) to ( + 1) because of the visual feedback involved in the process. The new state ( + 1) follows the state transition probability P given the action ( ) and the current state ( ). At the new state ( + 1), the process repeats; the decoder takes an action ( + 1), and this will result in a reward ( + 2) and a state transition from ( + 1) to ( + 2). This process continues either indefinitely or until a terminal state is reached depending on the process.
Note that the user has no direct access to actions, and the decoder must interpret the user's brain activity correctly to facilitate the rewards. Also, both systems act symbiotically by sharing the external device to complete their tasks. Through iterations, both systems learn how to earn rewards based on their joint behavior. This is how the two intelligent systems (the decoder and the user) learn coadaptively, and the closed loop feedback is created. This coadaptation allows for continuous synergistic adaptation between the BMI decoder and the user even in changing environments [1].
The value function is a measure of long-term performance of an agent following a policy starting from a state ( ). The state value function is defined as and action value function is given by where R( ) is known as the return. Here, we apply a common choice for the return, the infinite-horizon discounted model that takes into account the rewards in the long run but weighs them with a discount factor to prevent the function from growing unbounded as → ∞ and provides mathematical tractability [24]. Note that our goal is to find a policy : X → A which maps a state ( ) to an action ( ). Estimating the value function is an essential step towards finding a proper policy.

Kernel Temporal Difference( )
In this section, we provide a brief introduction to kernel methods followed by the derivation of the KTD algorithm [17,22]. One of the contributions of the present work is the convergence analysis of KTD( ) presented at the end of this section.
The inner product in the high-dimensional feature space can be calculated by evaluating the kernel function in the input space. Here, H is called a reproducing kernel Hilbert space (RKHS), for which the following property holds, The mapping implied by the use of the kernel function can also be understood through Mercer's Theorem [26]. The implicit map allows one to transform conventional linear algorithms in the feature space to nonlinear systems in the input space, and the kernel function provides an implicit way to compute inner products in the RKHS without explicitly dealing with the high-dimensional space.
In supervised learning, by treating the observed input sequence and the desired prediction as a sequence of pairs ( (1), ), ( (2), ), . . . , ( ( ), ) and making ≜ ( +1), we can obtain the updates of function after the whole sequence of inputs has been observed as Here, Δ = [ − ⟨ , ( ( ))⟩] ( ( )) are the instantaneous updates of the function from input data based on the kernel expansion (5).
The key observation to extend the supervised learning approach to the TD method is that the difference between desired and predicted output at time can be written as where ( + 1) ≜ . Using this expansion in terms of the differences between sequential predictions, we can update the system at each time step. By replacing the error − ( ( )) in (7) using the relation with temporal differences (8) and rearranging the equation as in [6], we obtain the following update: In this case, all predictions are used equally. Using exponential weighting on recency yields the following update rule: Here, represent an eligibility trace rate that is added to the averaging process over temporal differences to emphasize on the most recently observed states and to efficiently deal with delayed rewards. The above update rule (10) is called kernel temporal difference (KTD)( ) [17]. The difference between predictions of sequential inputs is called temporal difference (TD) error, TD ( ) = ( ( + 1)) − ( ( )) .
In reinforcement learning, the prediction ( ) = ( ( )) can be considered as the value function (1) or (2). This is how the KTD algorithm provides a nonlinear function approximation to Bellman's equation. When the prediction ( ) represents the state value function, the TD error (11) is extended to the combination of a reward and sequential value function predictions. For instance, in the case of policy evaluation, the TD error is defined as TD ( ) = ( + 1) + ( ( + 1)) − ( ( )) .
3.3. Convergence of Kernel Temporal Difference( ). It has been shown in [6,27] that for an absorbing Markov chain, TD( ) converges with probability 1 under certain conditions. Recall that the conventional TD algorithm assumes the function class to be linearly parametrized satisfying = ⊤ . KTD( ) can be viewed as a linear function approximation in the RKHS. Using this relation, convergence of KTD( ) can be obtained as an extension of the convergence guarantees already established for TD( ). When = 1, by definition, the KTD( = 1) procedure is equivalent to the supervised learning method (7). KTD(1) yields the same per-sequence weight changes as the least square solution since (9) is derived directly from supervised learning by replacing the error term in (8). Thus, the convergence of KTD(1) can be established based on the convergence of its equivalent supervised learning formulation, which was proven in [25]. Proof. Since by (8) the sequence of TD errors can be replaced by a multistep prediction with error ( ) = − ( ), the result of Proposition 1 also applies to this case.
In the case of < 1, as shown by [27], the convergence of linear TD( ) can be proved based on the ordinary differential equation (ODE) method introduced in [28]. This result can be easily extended to KTD( ) as follows. Let us consider the Markov estimation problem as in [6]. An absorbing Markov chain can be described by the terminal and nonterminal sets of states T and N, transition probabilities between nonterminal states, the transition probabilities from nonterminal states to terminal states, the vectors representing the nonterminal states, the expected terminal returns from the th terminal state, and the probabilities of starting at state . Given an initial state ∈ N, an absorbing Markov chain generates an observation sequence of vectors 1 , 2 , . . . , , where the last element of the sequence corresponds to a terminal state ∈ T. The expected outcome given a sequence starting at ∈ N is given by Computational Intelligence and Neuroscience where [ ] denotes the th element of the array , is the transition matrix with entries [ ] = for , ∈ N, and [ℎ] = ∑ ∈T for ∈ N. In linear TD( ), a sequence of vectors 1 , 2 , . . ., is generated. Each one of these vectors is generated after having a complete observation sequence; that is, a sequence staring at state ∈ N and ending at state ∈ T with the respective return . Similar to linear TD( ), in KTD( ) we have a sequence of functions 1 , 2 , . . ., (vectors in a RKHS) for which we can also write a linear update of the mean estimates of terminal return after sequences have been observed. If is the actual function estimate after sequence , and +1 is the expected function estimate after the next sequence, we have that where Analogously to [27], the mean estimates in (16) converge appropriately if H has a full set of eigenvalues with negative real parts, for which we need K to be full rank. For the above to be true, it is required the set of vectors { ( )} ∈N to be linearly independent in the RKHS. This is exactly the case when the kernel is strictly positive definite as shown in the following proposition.
Proof. If is strictly positive definite, then ∑ ( , ) > 0 for any set where ̸ = , for all ̸ = , and any ∈ R such that not all = 0. Suppose there exists a set { } for which { ( )} are not linearly independent. Then, there must be a set of coefficients ∈ R not all equal to zero such that which contradicts the assumption.
The following Theorem is the resulting extension of Theorem in [27] to KTD( ).

-Learning via Kernel Temporal Differences( )
Since the value function represents the expected cumulative rewards given a policy, the policy is better than the policy when the policy gives greater expected return than the policy . In other words, ≥ if and only if ( , ) ≥ ( , ) for all ∈ X and ∈ A. Therefore, the optimal action value function can be written as * ( ( ), ( )) = max ( ( ), ( )). The estimation can be done online. To maximize the expected reward [ ( + 1) | ( ), ( ), ( + 1)], one-step -learning update was introduced in [29], At time , an action ( ) can be selected using methods such as -greedy or the Boltzmann distribution, which are popular for exploration and exploitation trade-off [30].
When we consider the prediction as action value function with respect to a policy , KTD( ) can approximate the value functioñusing a family of functions of the form Here,̃( ( ), = ) denotes a state-action value given a state ( ) at time and a discrete action . Therefore, the update rule for -learning via kernel temporal difference ( -KTD)( ) can be written as We can see that the temporal difference (TD) error at time includes reward and action value function terms. For singlestep prediction problems ( = 1), (10) yields single updates for -KTD( ) of the form: Here, ( ( )) = ( ( ), = ) and TD ( ) denotes the TD error defined as TD ( ) = + ( ( + 1)) − ( ( )), and ( ) is an indicator vector of size determined by the number  of outputs (actions). Only the th entry of the vector is set to 1 and the other entries are set to 0. The selection of the action unit at time can be based on a greedy method. Therefore, only the weight (parameter vector) corresponding to the winning action gets updated. Recall that the reward corresponds to the action selected by the current policy with input ( ) because it is assumed that this action causes the next input state ( + 1).
The structure of -learning via KTD(0) is shown in Figure 2. The number of units (kernel evaluations) increases as more input data arrives. Each added unit is centered at the previous input locations (1), (2), . . . , ( − 1).
In the reinforcement learning brain machine interface (RLBMI) paradigm, kernel temporal difference( ) helps model the agent (see Figure 1). The action value function can be approximated using KTD( ), for which the kernel based representations enhance the functional mapping capabilities of the system. Based on the estimated values, a policy decides a proper action. Note that the policy corresponds to the learning policy which changes over time in -learning.

Online Sparsification
One characteristic of nonparametric approaches is their inherently growing structure which is usually linear in the number of input data points. This rate of growth becomes prohibitive for practical applications that handle increasing amounts of incoming data over time. Various methods have been proposed to alleviate this problem (see [31] and references therein). These methods, known as kernel sparsification methods, can be applied to the KTD algorithm to control the growth of the terms in the function expansion, also known as filter size. Popular examples of kernel sparsification methods are the approximate linear dependence (ALD) [19], Surprise criterion [32], Quantization approach [21], and the kernel distance based method [23]. The main idea of sparsification is to only consider a reduced set of samples, called the dictionary, to represent the function of interest. The computational complexity of ALD is ( 2 ), where is the size of the dictionary. For the other methods mentioned above, the complexity is ( ). Each of these methods has its own criterion to determine whether an incoming sample should be added to the current dictionary. The Surprise criterion [32] measures the subjective information of exemplar { , } with respect to a learning system Γ: Only samples with high values of Surprise are considered as candidates for the dictionary. In the case of the Quantization approach introduced in [21], the distance between a new input ( ) and the existing dictionary elements ( − 1) is evaluated. The new input sample is added to the dictionary if the distance between the new input ( ) and the closest element in ( − 1), is larger than the Quantization size . Otherwise, the new input state ( ) is absorbed by the closest existing unit. Very similar to the quantization approach, the method presented in [23] applies a distance threshold criterion in the RKHS. The kernel distance based criterion given a state dictionary Computational Intelligence and Neuroscience 7 ( − 1) adds a new unit when the new input state ( ) satisfies following condition; For some kernels such as Gaussian, the Quantization method and the kernel distance based criterion can be shown to be equivalent.

Simulations
Note that the KTD algorithm has been introduced for value function estimation. To evaluate the algorithm's nonlinear capability, we first examine the performance of the KTD( ) in the problem of state value function estimatioñgiven a fixed policy . We carry out experiments on a simple illustrative Markov chain initially described in [33]. This is a popular experiment involving an episodic task to test TD learning algorithms. The experiment is useful in illustrating linear as well as nonlinear functions of the state representations and shows how the state value function is estimated using the adaptive system. To assess the performance, the updated estimate of the state value functioñ( ) is compared to the optimal value function * at the end of each trial. This is done by computing the RMS error of the value function over all states where is the number of states, = 13.
Stepsize scheduling is applied as follows: where 0 is the initial stepsize, and 0 is the annealing factor which controls how fast the stepsize decreases. In this experiment, 0 = 100 is applied. Furthermore, we assume that the policy is guaranteed to terminate, which means that the value function is well-behaved without using a discount factor in (3), that is, = 1.
In KTD( ), we employ the Gaussian kernel: which is a universal kernel commonly encountered in practice. To find the optimal kernel size, we fix all the other free parameters around median values, = 0.4 and 0 = 0.5, and the average RMS error over 10 Monte Carlo runs is compared. For this specific experiment, smaller kernel sizes yield better performance since the state representations are finite. However, in general, applying too small kernel sizes leads to over-fitting and slow learning. In particular, choosing a very small kernel makes the algorithm behave very similar to the table look up method. Thus, we choose the kernel size ℎ = 0.2 to be the largest kernel size for which we obtain similar mean RMS values as for smaller kernel sizes. After fixing the kernel size to ℎ = 0.2, the experimental evaluation of different combinations of eligibility trace rates and initial step sizes 0 are observed. Figure 4 shows the average performance over 10 Monte Carlo runs for 1000 trials.
All values with optimal stepsize show good approximation to * after 1000 trials. Notice that KTD( = 0) shows slightly better performance than KTD( = 1). This may be attributed to the local nature of KTD when using the Gaussian kernel. In addition, varying the stepsize has a relatively small effect on KTD( ). The Gaussian kernel as well as other shiftinvariant kernels provide an implicit normalized update rule which is known to be less sensitive to stepsize. Based on Figure 4, the optimal eligibility trace rate and initial stepsize value = 0.6 and 0 = 0.3 are selected for KTD with kernel size ℎ = 0.2.
The learning curve of KTD( ) is compared to the conventional TD algorithm, TD( ). The optimal parameters employed in both algorithms are based on the experimental evaluation. In TD( ), = 1 and 0 = 0.1 are applied. The RMS error is averaged over 50 Monte Carlo runs for 1000 trials. Comparative learning curves are given in Figure 5.
In this experiment, we confirm the ability of KTD( ) to handle the function approximation problem when the fixed policy yields a state value function that is linear in the state representation. Both algorithms reach the mean RMS value of around 0.06. As we expected, TD( ) converges faster to the optimal solution because of the linear nature of the problem. KTD( ) converges slower than TD( ), but it is also able to approximate the value function properly. In this sense,    Figure 6).
Again, to evaluate the performance, after each trial is completed, the estimated state valuẽis compared to the optimal state value * using RMS error (26). For KTD( ), the Gaussian kernel (28) is applied and kernel size ℎ = 0.2 is chosen. Figure 7 shows the average RMS error over 10 Monte Carlo runs for 1000 trials.
The combination of = 0.4 and 0 = 0.3 shows the best performance, but the = 0 case also shows good performances. Unlike TD( ) [6], there is no dominant value for in KTD( ). Recall that it has been proved that convergence is guaranteed for linearly independent representations of the states, which is automatically fulfilled in KTD( ) when the kernel is strictly positive definite. Therefore, the differences are rather due to the convergence speed controlled by the interaction between the step size and the elegibilty trace.
The average RMS error over 50 Monte Carlo runs is compared with Gaussian process temporal difference (GPTD) [15] and TD( ) in Figure 8. The purpose of GPTD implementation is to have comparison among kernelized value function approximations. Here, the applied optimal parameters for KTD( ) are = 0.4, 0 = 0.3, and ℎ = 0.2, for GPTD, = 1, 2 = 0.5, and ℎ = 0.2, and for TD( ), = 0.8 and 0 = 0.1. The linear function approximation, TD( ) (blue line), cannot estimate the optimal state values. KTD( ) outperforms the linear algorithm as expected since the Gaussian kernel is strictly positive definite. GPTD also learns the target state values, but the system fails to reach as low error values as KTD. GPTD is sensitive to the selection of the covariance value in the noise, 2 ; if the value is small, the system becomes unstable, and larger values cause the the learning to slow down. GPTD models the residuals, the difference between Computational Intelligence and Neuroscience  expected return and actual return, as a Gaussian process. This assumption does not hold true for the Markov chain in Figure 6. As we can observe in Figure 8, KTD( ) reaches to the mean value around 0.07, and the mean value of GPTD and TD( ) are around 0.2 and 1.8, respectively.
In the synthetic examples, we presented experimental results to approximate the state value function under a fixed policy. We observed that KTD( ) performs well on both linear and nonlinear function approximation problems. In addition, in the previous section, we showed how the linear independence of the input state representations can affect the performance of algorithms. The use of strictly positive definite kernels in KTD( ) implies the linear independence condition, and thus this algorithm converges for all ∈ [0, 1]. In the following section, we will apply the extended KTD algorithm to estimate the action value function which can be employed in finding a proper control policy for RLBMI tasks.

Experimental Results on Neural Decoding
In our RLBMI experiments, we map the monkey's neural signal to action-directions (computer cursor/robot arm position). The agent starts at a naive state, but the subject has been trained to receive rewards from the environment. Once it reaches the assigned target, the system and the subject earn a reward, and the agent updates its neural state decoder. Through iteration, the agent learns how to correctly translate neural states into action-directions.

Open-Loop RLBMI.
In open-loop RLBMI experiments, the output of the agent does not directly change the state of the environment because this is done with prerecorded data. The external device is updated based only on the actual monkey's physical response. In this sense, we only consider the monkey's neural state from successful trials to train the agent. The goal of these experiments is to evaluate the system's capability to predict the proper state to action mapping based on the monkey's neural states and to assess the viability of further closed-loop experiments. From 96-channel recordings, a set of 185 units are obtained after sorting. The neural states are represented by the firing rates of each unit on 100 ms window. There is a set of 8 possible targets and action directions. Every trial starts at the center point, and the distance from the center to each target is 4 cm; anything within a radius of 1 cm from the target point is considered as a valid reach.

Agent.
In the agent, -learning via kernel temporal difference ( -KTD)( ) is applied to neural decoding. For -KTD( ), we employ the Gaussian kernel (28). After the neural states are preprocessed by normalizing their dynamic range to lie between −1 and 1, they are input to the system. Based on the preprocessed neural states, the system predicts which direction the computer cursor will move. Each output unit represents one of the 8 possible directions, and among the 8 outputs one action is selected by the -greedy method [34].
The action corresponding to the unit with the highest value gets selected with probability 1 − . Otherwise, any other action is selected at random. The performance is evaluated by checking whether the updated position reaches the assigned target, and depending on the updated position, a reward value is assigned to the system.

Results on Single
Step Tasks. Here, the targets should be reached within a single step; rewards from the environment are received after a single step, and one action is performed by the agent per trial. The assignment of reward is based on the 1-0 distance to the target, that is, dist( , ) = 0 if = , and dist( , ) = 1, otherwise. Once the cursor reaches the assigned target, the agent gets a positive reward +0.6, else it receives negative reward −0.6 [35]. Exploration rate = 0.01 and discount factor = 0.9 are applied. Also, we consider = 0 since our experiment performs single step updates per trial. In this experiment, the firing rates of the 185 units on 100 ms windows are time-embedded using 6th order tap delay. This creates a representation space where each state is a vector with 1295 dimensions. We start with the simplest version of the problem by considering only 2-targets (right and left). The total number of trials is 43 for the 2-targets. For -KTD, the kernel size ℎ is heuristically chosen based on the distribution of the mean squared distances between pairs of input states; let = [‖ − ‖ 2 ], then ℎ = √ /2. For this particular data set, the above heuristic gives a kernel size ℎ = 7. The stepsize = 0.3 is selected based on the stability bound that was derived for the kernel least mean square algorithm [25], where is the gram matrix. After 43 trials, we count the number of trials which received a positive reward, and the success rate is averaged over 50 Monte Carlo runs. The performance of the -KTD algorithm is compared withlearning via time delayed neural net ( -TDNN) and the online selective kernel-based temporal difference learning algorithm ( -OSKTD) [23] in Figure 9. Note that TDNN is a conventional approach to function approximation and has already been applied to RLBMI experiments for neural decoding [1,2]. OSKTD is a kernel-based temporal difference algorithm emphasizing on the online sparsifications.
Both -KTD and -OSKTD reach around 100% success rate after 2 epochs. In contrast, the average success rate of -TDNN slowly increases yet never reaches the same performance as -KTD. In the case of -OSKTD, the value function updates require one more parameter 2 to decide the subspace. To validate the algorithm's capability to estimate proper policy, we set the sparsified dictionary as the same size as the number of sample observations. In -OSKTD, we observed that the subspace selection parameter plays an important role in terms of the speed of learning. It turns out that for the above experiment, smaller subspaces allow faster learning. In the extreme case of -OSKTD, where only the current state is affected, the updates become equivalent to the update rule of -KTD.
Since all the experimental parameters are fixed over 50 Monte Carlo runs, the confidence interval for -KTD can be simply associated with the random effects introduced by the -greedy method employed for action selection with exploration, thus, the narrow interval. However, with -TDNN a larger variation of performance is observed, which shows how the initialization, due to local minima, influences the success of learning; it is observed that -TDNN is able to approximate the -KTD performance, but most of the times, the system falls on local minima. This highlights one of the advantages of KTD compared to TDNN, which is the insensitivity to initialization. Table 1 shows average success rates over 50 Monte Carlo runs with respect to different number of targets. The first Computational Intelligence and Neuroscience row corresponds to the mean success rates displayed on Figure 9 (red solid line). This is included in the Table 1  One characteristic of nonparametric approaches is the growing filter structure. Here, we observe how filter size influences the overall performance in -KTD by applying Surprise criterion [32] and Quantization [21] methods. In the case of the 2-target center-out reaching task, we should expect the filter size to become as large as 861 units after 20 epochs without any control of the filter size. Using the Surprise criterion, the filter size can be reduced to 87 centers with acceptable performance. However, Quantization allows the filter size to be reduced to 10 units while maintaining performance above 90% for success rates. Figure 10 shows the effect of filter size in the 2-target experiment using the Quantization approach. For filter sizes as small as 10 units, the average success rates remain stable. With 10 units, the algorithm shows similar learning speed to the linearly growing filter size, with success rates above 90%. Note that quantization limits the capacity of the kernel filter since less units than samples are employed and thus it helps to avoid over-fitting.
In the 2-target center-out reaching task, quantized -KTD shows satisfactory results in terms of initialization and computational cost. Further analysis of -KTD is conducted on a larger number of targets. We increase the number of targets from 2 to 8. All experimental parameters are kept the same as for the 2-target experiment. The only change is step-size = 0.5. The 178 trials are applied for the 8-target reaching task.
To gain more insight about the algorithm, we observe the interplay between Quantization size and kernel size ℎ. Based on the distribution of squared distances between pairs and Quantization sizes ( = 1, 110, 120, 130) are considered. The corresponding success rates for final filter sizes of 178, 133, 87, and 32 are displayed in Figure 11.
With a final filter size of 178 (blue line), the success rates are superior to any other filter sizes for every kernel sizes tested, since it contains all input information. Especially for small kernel sizes (ℎ ≤ 2), success rates above 96% are observed. Moreover, note that even after reduction of the state information (red line), the system still produces acceptable success rates for kernel sizes ranging from 0.5 to 2 (around 90% success rates).
Among the best performing kernel sizes, we favor the largest one since it provides better generalization guarantees. In this sense, a kernel size ℎ = 2 can be selected since this is the largest kernel size that considerably reduces the filter size and yields a neural state to action mapping that performs well (around 90% of success rates). In the case of kernel size ℎ = 2 with final filter size of 178, the system reaches 100% success rates after 6 epochs with a maximum variance of 4%. As we can see from the number of units, higher representation capacity is required to obtain the desired performance as the task becomes more complex. Nevertheless, results on the 8target center-out reaching task show that the method can effectively learn the brain state-action mapping for this task with a reasonable complexity.

Results on Multistep Tasks.
Here, we develop a more realistic scenario; we extend the task to multistep and multitarget experiments. This case allows us to explore the role of the eligibility traces in -KTD( ). The price paid for this extension is that now, the eligibility trace rate selection needs to be carried out according to the best observed performance. Testing based on the same experimental set up employed for the single step task, that is, a discrete reward value is assigned at the target, causes extremely slow learning since not enough guidance is given. The system requires long periods of exploration until it actually reaches the target. Therefore, we employ a continuous reward distribution around the selected target defined by the following expression: where ( ) = exp[( − ) ⊤ C −1 ( − )], ∈ R 2 is the position of the cursor, reward = 1, and reward = −0.6. The mean vector corresponds to the selected target location and the covariance matrix, which depends on the angle of the selected target as follows: for target index one and five, the angle is 0, two and six are for − /4, three and seven are for /2, and four and eight are for /4. (Here, the target indexes follow the location depicted on Figure 6 in [22].) Figure 12 shows the reward distribution for target index one. The same form of distribution is applied to the other directions centred at the assigned target point.
Once the system reaches the assigned target, the system earns a maximum reward of +1 and receives partial rewards according to (30) during the approaching stage. When the system earns the maximum reward, the trial is classified as a successful trial. The maximum number of steps per trial is limited such that the cursor must approach the target in a straight line trajectory. Here, we also control the complexity of the task by allowing different number of targets and steps. Namely, 2-step 4-target (right, up, left, and down) and 4-step 3-target (right, up, and down) experiments are performed. Increasing the number of steps per trial amounts to making smaller jumps according to each action. After each epoch, the number of successful trials is counted for each target direction. Figure 13 shows the learning curves for each target and the average success rates.
Larger number of steps results in lower success rates. However, the two cases (two and four steps) obtain an average success rate above 60% for 1 epoch. The performances show all directions can achieve success rates above 70% after convergence, which encourage the application of the algorithm to online scenarios.

Closed-Loop RLBMI Experiments.
In closed loop RLBMI experiments, the behavioral task is a reaching task using a robotic arm. The decoder controls the robot arm's action direction by predicting the monkey's intent based on its neuronal activity. If the robot arm reaches the assigned target, a reward is given to both the monkey (food reward) and the decoder (positive value). Notice that the two intelligent systems learn coadaptively to accomplish the goal. These experiments are conducted in cooperation with the Neuroprosthetics Research Group at the University of Miami. The performance is evaluated in terms of task completion accuracy and speed. Furthermore, we provide a methodology to tease apart the influence of each one of the systems of the RLBMI in the overall performance.

Environment.
During pretraining, a marmoset monkey was trained to perform a target reaching task, namely, moving a robot arm to two spatial locations denoted as A trial and B trial. The monkey was taught to associate changes in motor activity during A trials and produce static motor responses during B trials. Once a target is assigned, a beep signals the start of the trial. To control the robot during the user training phase, the monkey is required to steadily place its hand on a touch pad for 700∼1200 ms. This action produces a go beep that is followed by the activation of one of the two target LEDs (A trial: red light for left direction or B trial: green light for right direction). The robot arm goes to a home position, namely, the center position between the two targets. Its gripper shows an object (food reward such as waxworm or marshmallow for A trial and undesirable object (wooden bead) for B trial). To move the robot to the A location, the monkey needed to reach out and touch a sensor within 2000 ms, and to make the robot reach to the B target, the monkey needed to keep its arm motionless on the touch pad for 2500 ms. When the monkey successfully moved the robot to the correct target, the target LEDs would blink and the monkey would receive a food reward (for both the A and B targets).
After the monkey is trained to perform the assigned task properly, a microelectrode array (16-channel tungsten microelectrode arrays, Tucker Davis Technologies, FL) is surgically implanted under isoflurane anesthesia and sterile conditions. Neural states from the motor cortex (M1) are recorded. These neural states become the inputs to the neural decoder. All surgical and animal care procedures were In the closed-loop experiments, after the initial holding time that produces the go beep, the robotic arm's position is updated based solely on the monkey's neural states. Differently from the user pretraining sessions, the monkey is not required to perform any movement. During the realtime experiment, 14 neurons are obtained from 10 electrodes. The neural states are represented by the firing rates on a 2 sec window following the go signal.

Agent.
For the BMI decoder, we use -learning via kernel Temporal Differences ( -KTD)( ). One big difference between open-loop and closed-loop applications is the amount of accessible data; in the closed-loop experiments, we can only get information about the neural states that have been observed up to the present. However, in the previous offline experiments, normalization and kernel selection were conducted offline based on the entire data set. It is not possible to apply the same method to the online setting since we only have information about the input states up to the present time. Normalization is a scaling procedure that interplays with the choice of the kernel size. Proper selection of the kernel size brings proper scaling to the data. Thus, in contrast to the previous open-loop experiments, normalization of the input neural states is not applied, and the kernel size is automatically selected given the inputs.
The Gaussian kernel (28) is employed, and the kernel size ℎ is automatically selected based on the history of inputs. Note that in the closed-loop experiments, the dynamic range of states varies from experiment to experiment. Consequently, the kernel size needs to be re-adjusted each time a new experiment takes place, and it cannot be determined beforehand. At each time, the distances between the current state and the previously observed states are computed to obtain the output values,̃in this case. Therefore, we use the distance values to select the kernel size as follows: Using the squared distance between pairs of previously seen input states, we can obtain an estimate of the mean distance. This value is also averaged along with past kernel sizes to obtain the current kernel size. Moreover, we consider = 1 and = 0 since our experiments perform single step trials. Stepsize = 0.5 is applied. The output represents the 2 possible directions (left and right), and the robot arm moves based on the estimated output from the decoder.

Results.
The overall performance is evaluated by checking whether the robot arm reaches the assigned target. Once the robot arm reaches the target, the decoder gets a positive reward +1, otherwise, it receives negative reward −1. Table 2 shows the decoder performance over 4 days in terms of success rates. Each day corresponds to a separate experiment. In Day 1, the experiment has a total of 20 trials (10 A trials and 10 B trials). The overall success rate was 90%. Only the first trial for each target was incorrectly assigned.   Note that at each day, the same experimental set up was utilized. The decoder was initialized in the same way at each day. We did not use pretrained parameters to initialize the system. To understand the variation of the success rates across days, we look at the performance of Day 1 and Day 3. Figure 14 shows the decoder performance for the 2 experiments.
Although the success rate for Day 3 is not as high as Day 1, both experiments show that the algorithm learns an appropriate neural state to action map. Even though there is variation among the neural states within each day, the decoder adapts well to minimize the TD error, and thevalues converge to the desired values for each action. Because this is a single step task and the reward +1 is assigned for a successful trial, it is desired for the estimated action valuet o be close to +1.
It is observed that the TD error and -values oscillate. The drastic change of TD error or -value corresponds to the missed trials. The overall performance can be evaluated by checking whether the robot arm reaches the desired target (the top plots in Figure 14). However, this assessment does not show what causes the change in the system values. In addition, it is hard to know how the two separate intelligent systems interact during learning and how neural states affect the overall performance. Under the coadaptation scenario in the RLBMI architecture, it is obvious that if one system does not perform properly, it will cause detrimental effects on the performance of the other system. If the BMI decoder does not give proper updates to the robotic device, it will confuse the user conducting the task, and if the user gives improper state information or the translation is wrong, the resulting update may fail even though the BMI decoder was able to find the optimal mapping function.
Using the proposed methodology introduced in [36], we can observe how the decoder effectively learns a good state to action mapping, and how neural states affect the prediction performance. Figure 15 shows how each participant (the agent and the user) influences the overall performance in both successful and missed trials, and how the agent adapts the environment. By applying principal component analysis (PCA), the high-dimensional neural states can be visualized in two dimensions using the first two largest principal components. In this two-dimensional space of projected neural states, we can visualize the estimated policy, as well.
We observe the behavior of two systems at the beginning, intermediate, and final stages of the experiment by using the neural states that have been observed as well as the learned decoder up to the given stage. It is evident that the decoder can predict nonlinear policies. Day 1 (left column in Figure 15) shows that the neural states from the two classes are well separable. It was noted during Day 3 that the monkey seemed less engaged in the task than in Day 1. This suggests the possibility that during some trials the monkey was distracted and may not have been producing a consistent set of neural outputs. We are also able to see this phenomenon from the plots (right column in Figure 15). We can see that most of the neural states that were misclassified appear to be closer to the states corresponding to the opposite target in the projected state space. However, the estimated policy shows that the system effectively learns. Note that the initially misclassified A trials (red stars in Figure 15(d) which are located near the estimated policy boundary) are assigned to the right direction when learning has been accomplished ( Figure 15(f)). It is a remarkable fact that the system adapts to the environment online.

Conclusions
The advantages of KTD( ) in neural decoding problems were observed. The key observations of this kernel-based learning algorithm are its capabilities for nonlinear function approximation and its convergence guarantees. We also examined the capability of the extended KTD algorithm ( -KTD( )) in both open-loop and closed-loop reinforcement learning brain machine interface (RLBMI) experiments to perform reaching tasks.
In open-loop experiments, results showed that -KTD( ) can effectively learn the brain state-action mapping and offer performance advantages over conventional nonlinear function approximation methods such as time-delay neural nets. We observed that -KTD( ) overcomes main issues of conventional nonlinear function approximation methods such as local minima and proper initialization.
Results on closed-loop RLBMI experiments showed that the algorithm succeeds in finding a proper mapping between neural states and desired actions. Its advantages are that it does not depend on the initialization neither require any prior information about input states. Also, parameters can be chosen on the fly based on the observed input states. Moreover, we observed how the two intelligent systems coadaptively learn in an online reaching task. The results showed that KTD is powerful for practical applications due to its nonlinear approximation capabilities in online learning.
The observation and analysis of KTD( ) give us a basic idea of how this algorithm behaves. However, in the case of -KTD( ), the convergence analysis remains challenging since -learning contains both a learning policy and a greedy policy. For -KTD( ), the convergence proof for -learning using temporal difference (TD)( ) with linear function approximation in [37] can provide a basic intuition for the role of function approximation on the convergence of -learning.