Dynamical Motor Control Learned with Deep Deterministic Policy Gradient

Conventional models of motor control exploit the spatial representation of the controlled system to generate control commands. Typically, the control command is gained with the feedback state of a specific instant in time, which behaves like an optimal regulator or spatial filter to the feedback state. Yet, recent neuroscience studies found that the motor network may constitute an autonomous dynamical system and the temporal patterns of the control command can be contained in the dynamics of the motor network, that is, the dynamical system hypothesis (DSH). Inspired by these findings, here we propose a computational model that incorporates this neural mechanism, in which the control command could be unfolded from a dynamical controller whose initial state is specified with the task parameters. The model is trained in a trial-and-error manner in the framework of deep deterministic policy gradient (DDPG). The experimental results show that the dynamical controller successfully learns the control policy for arm reaching movements, while the analysis of the internal activities of the dynamical controller provides the computational evidence to the DSH of the neural coding in motor cortices.


Introduction
Conventional computational studies of motor control adopt a spatial representation of the body's state, where the control command is generated by properly integrating the information of state feedback or state residual at a specific instant of time (Figure 1(a)) [1][2][3]. The parameters of the controller are adapted to capture the spatial patterns of the state input to generate the desired control command under a certain optimality principle [1,4]. Recent neuroscience studies, however, have provided an alternative perspective, which emphasizes that the motor cortex serves as a dynamical system that generates the control command through its internal dynamics.
In recent neurophysiological studies aiming at deciphering the neural coding mechanisms in the motor cortex, Churchland and colleagues put forward a dynamical system hypothesis (DSH) to interpret the neural activities in the primary motor cortex (M1) during movement generation [5,6]. Under this doctrine, the activities of the motor network are driven mainly by the interaction of the neural population within the network rather than responding to an external kinematic state. And the time-varying control command can be extracted from the population dynamics of the motor network [6,7]. The motor cortex under the view of DSH can be described as a time-continuous dynamical system: ( ) = ( ( )) + ℎ( ( )), where ( ) is the neural response during movement,( ) is its derivative, and ( ) is the external input. Additionally, the motor network undergoes a preparatory period before movement execution, through which the network activity converges to a task-specific state. The converged state, in turn, is utilized as the initial state of the network in the movement execution period. The control command can be extracted sequentially when the network evolves under its own dynamics in the absence of the external input. With this arrangement, the computation of the control command can be embedded into a spatiotemporal transformation mechanism; that is, when the task-specific initial state is given, which is a spatial representation, the subsequent temporal evolution of the network and thus the output command can be unfolded deterministically.
Inspired by these findings, we proposed a computational model that exploited a dynamical controller to realize the spatiotemporal transformation mechanism. The dynamical 2 Computational Intelligence and Neuroscience The dynamical controller generates the control signal ( ) by its internal dynamics. Note that the dynamical controller loops by itself and theoretically the initial state 0 and the goal state are sufficient to generate the control command, with or without the feedback state (dotted arrow). (c) The dynamical controller is trained using DDPG with the reward information ( ) from the environment (shown as the Env box). The broken arrow indicates that the controller parameters are tuned with the gradients from DDPG.
controller was initialized with the task information and evolved under the constraint of its dynamics to elicit the desired control command (Figure 1(b)). Namely, the controller mapped the input task information to a temporal function of control commands. In the present study, the recurrent neural network (RNN) was adopted as the dynamical controller for its capability to approximate the dynamical system [8]. Accordingly, the learning of the control policy could be achieved by modifying the dynamics of the RNN controller, which was governed by the connectivity of the RNN. Some previous related models have achieved the overall goal of mapping the motor task information to the control command sequence [9,10]. Berniker and Kording verified the possibility of retrieving the command and state sequence from the task parameters using deep networks of the autoencoder architecture [9]. However, the computation that was achieved in their approach was essentially a spatiospatial transformation, where the task information was fed forwardly through the deep network and control commands for all time steps were output simultaneously, with each output dimension for a single time step. This method requires the control signal (for different time steps) to be accessible from different spatial locations, which is not biologically plausible. Also, previous computational models that incorporated the RNN dynamics for motor control captured the essential properties of motor circuits effectively [11,12]. The multiphasic responses of the individual neurons and rotational patterns of the neuronal population were successfully reproduced. However, these models relied on the examples of optimal control policy given in advance as the training data to learn the controller. This is not plausible in biological control, which is learned in a trial-and-error manner; that is, no pregiven examples or immediate error is available for the synaptic adaptation.
To train the dynamical controller in the way of trial and error, we embedded it in the framework of reinforcement learning (RL) (Figure 1(c)). Among various RL methods, the policy gradient methods are suitable to learn the control policy in a continuous action and state space [13,14]. Specifically, the deep deterministic policy gradient (DDPG) [15,16] method was exploited in the present study to learn the control policy, because of the following two advantages. First, it does not require a probabilistic representation of the control policy and thus is compatible with the deterministic policy in motor control. Second, DDPG contains a term of the gradient of the action-value function with respect to the action, which is compatible with the learning method of the dynamical controller (detailed in the Methods and Materials).
The proposed model was evaluated in the task of arm reaching movement control. The results showed that the model successfully generated the accurate control commands and was able to reproduce the biomechanical properties and the neural responses found in the physiological experiments of arm reaching. Thus, this model could be a beneficial attempt to understand the neural substrate of motor control and for the engineering application in the field of robotic control.
The main contributions of this paper are as follows. First, a computational model that incorporates the neural mechanism of the spatiotemporal transformation from the task information to the motor commands was implemented, which provided a new paradigm for motor representation. Second, the proposed model did not rely on the pregiven sample data for training, but it learned the control policies through self-generated movements.
Q  (s(t + 1),   (t + 1)) Figure 2: Schematic illustration of the deep deterministic policy gradient method. The critic network approximates the value function ( , ) by minimizing the TD error . The actor network is updated with the gradient ∇ from the critic. Two sets of actors and critics are exploited for stability, shown as the boxes of "actor" and "critic" and the boxes of "target actor" and "target critic," respectively. The target critic and target actor are updated by "soft update" for stability (arc dashed arrows).

Deep Deterministic Policy Gradient.
As the dynamical controller is embedded in the framework of DDPG, it is necessary to first introduce the DDPG method as a context. To define the terminology and symbols involved in DDPG, we first introduce the background of the Markov decision process (MDP).
Consider an MDP which consists of the state from state space S, the action from action space A, a stationary transition dynamics distribution ( +1 | , ) that defines the dynamics of the environment, and a reward function ∈ S × A. An agent optimizes a policy that is parameterized by ∈ R to maximize the cumulative rewards along the trajectory. At each time step, the state of the environment is updated with the action, and a reward ( , ) is returned to the agent. The policy is improved upon the sequential information of state, action, and reward: ]. The objective function can also be defined as the action-value function with the initial state and action: Policy gradient (PG) methods directly adapt the parameters of the policy to maximize the objective function [13,14], among which the deep deterministic policy gradient algorithm (DDPG) is developed for RL problems with continuous action and deterministic policy [15,16]. It adopted an actorcritic architecture: the actor generates the control action and the critic estimates the value function of the policy. The actor and the critic are approximated with parameterized functions, = ( | ) and = ( , | ), respectively, where and are the parameters for the actor and for the critic. Maximization over a continuous action space is achieved by altering the actor in the direction of the action gradient ∇ ( , | ). Accordingly, the deterministic policy gradient can be given as where ∇ is the gradient of the policy with respect to the actor parameter . As shown in Figure 2, the action-value function is updated with temporal difference (TD). The critic parameter is updated to minimize the TD error , which is defined by the difference between the predicted value of the current time step and the target value: For learning stability, there are two sets of actors and critics. The additional set, which is denoted as target actor and target critic, is not a direct copy of the original set, but it slowly tracks the original set. In (3), the target value + ( +1 , +1 ) is computed from the target networks. The update from the parameters of the original actor and critic networks , to the parameters of target networks , is determined by where is a leaky factor with ≪ 1 and defines how quickly , follows , . In this study, the instant reward was comprised of several terms of penalties:   (6)). , , and denoted the input, the recurrent, and the output weights, respectively. Note that only the task information (the start and target state) was fed to the network at the initial time step.
, where is the episode length, ( * − ( )) ( * − ( )) penalizes the error of the final state ( ) from that target state * , and discourages the action amplitude through the trajectory. The components of state and action will be further described in the Arm Model section.

The Implementation of Actor and Critic.
The critic was implemented with a multilayer perceptron (MLP), and the actor, which served as the dynamical controller in the proposed model, was implemented with a recurrent neural network (RNN). The critic network took the current state , target state * , and action as inputs and approximated the action-value function ( , ) as output. The critic network had two hidden layers, each with 300 nodes. The state and target state * were fed to the input layer, while the action was fed to the first hidden layer. Both hidden layers had Rectified Linear Units (ReLU) activation, which was defined as ( ) = max(0, ), and the output units had linear activation.
The RNN actor consisted of 100 interconnected units, which were fed with the concatenation of the start state 0 , target state * , and the index of force field , and outputted the action , that is, 6 muscle activation instances (detailed in the Arm Model section). Figure 3 shows the connection of the actor network unfolded in time. The activity of the recurrent units ( ) had tanh activation: where and denote the input and recurrent weights, ( ) ensures that the input was only applied in the first time step, and tanh activation was defined as tanh( ) = (1 − exp(− ))/(1 + exp(− )). To keep the muscle activation positive, the output units had sigmoid activation: ( ) = ( ⋅ ( )), where is the weight of output projection, and the sigmoid function was defined as ( ) = 1/(1 + exp(− )).

The HF
Learning for the RNN Actor. The RNN actor was trained using Hessian free learning (HF). By incorporating the second-order information of the error surface during optimization, HF learning is effective in eliminating the "exploding and vanishing gradient problem" in training RNNs [17]. There are two main steps in HF learning: firstly, a Gauss-Newton matrix is constructed from the outer product of the cost function with respect to the network parameters to represent the second-order information. Secondly, a conjugate gradient (CG) was conducted upon the Gauss-Newton matrix to optimize the network parameters.
To integrate the HF learning into this model, three main steps were taken. First, the gradients of the action-value with respect to muscle activation in each time step ∇ ( ) were derived from the DDPG procedure. And the gradients were transmitted to the actor RNN through the output units (dashed arrows in Figure 3), which were back-propagated through time to the parameters of the network to get the gradient of with respect to the network parameters : where is the time horizon of the movement and is the weights [ , , ] of the RNN actor. Second, the Gauss-Newton matrix was constructed by the outer product of the parameter gradient: = . Third, the network parameters were optimized with CG iteration upon the Gauss-Newton matrix . The Gauss-Newton matrix stayed the same within the loop of CG iteration, and it was updated for each batch of 64 reaches. To keep the stability of CG iteration and to regularize the optimized parameters, we incorporated Tikhonov regularization in this step. Specifically, the original Gauss-Newton matrix was added with a multiple of the Computational Intelligence and Neuroscience 5 identity matrix: = + , where is the gain of the regularization and is the identity matrix with the dimensionality of the parameters. And the CG iteration would be conducted on the new Gauss-Newton matrix.

Arm
Model. The 2-link revolute model of the primate arm has been elaborately described in the literature. Here, we adopted the physics formulations given by [4]. The arm plant had two joints and was actuated by three pairs of muscles.
A detailed description of the muscular properties is given in the Appendix. The arm state = [ 1 2̇1̇2 ] was represented by the joint angles and angular velocities, where is the th joint angle anḋis the angular velocity of the th joint. The torques on each joint could be calculated by a nonlinear function of the muscle activation and the muscle state (see the Appendix). Driven by the joint torque , the angular acceleration in the joint space was subject to the following equation: where ( ) is the inertia matrix and depended on the joint angles,̈is the angular acceleration, ( ,) is a vector defining the centripetal and Coriolis forces depending on both the joint angles and the joint angular velocities, is the joint friction matrix with respect to the angular velocities, and is the joint torque. The force field could be applied

Results
We examined the capability of the proposed model to generate the optimal control policy for planar arm reaching. The RNN actor generated muscle activity as the control command to drive the limb model. Each reaching movement had a fixed time horizon = 25; namely, all episodes had the same length of 25 time steps. For each time step, the plant state , the action , the new state +1 , and the network activities were stored to a data buffer with a buffer size of 10 5 . The discounted factor was kept at 0.99. To avoid the correlations between samples, the critic network and the target critic network were trained using the scrambled samples from the data buffer. The critic networks and the RNN actor network were updated for every minibatch, which was a buffer of 64 episodes. The gain of the regularization to the network weights was set to 0.03. The regularization term was not trivial, since it could prevent the output torque and network activity from dramatically changing. Without the regularization, the reaching trajectories would become heavily curled.
The model was first trained for randomly generated reaches. The start and end points of each reach were confined in the work space of a circle centered at [0.0265, 0.3833] and with a radius of 20 cm. The start and end points were described by the Euclidean coordinates of the arm's endeffector. As the arm state was represented in the joint space, we first transformed the target position from the Euclidean space to the joint space. The transformation was achieved by iterating the randomly initialized joint state along the direction where the Euclidean error of the current Euclidean state was minimized. For each trial of movement, a pair of start and target points was drawn randomly from the work space. The uniform distribution of the start and end points excludes the bias that may be introduced by the preference of the task parameters. The reaching trajectories of the randomly sampled points are illustrated in Figure 4(a). The cumulative reward versus the number of episodes was plotted in Figure 4(b). The red curve shows the mean cumulative reward over 15 randomly initialized models, and the area of the mean ± standard deviation is shaded in blue. The training was validated by the center-out reaching task (Figure 4(c)).
In the center-out task, the reaches always started from the center and aimed at a surrounding point with a fixed distance of 18 cm. The surrounding target points were arranged evenly along a circle with an angular interval of /8.
The model successfully generated accurate control commands to drive the arm for center-out reaches after training, and it captured some essential biomechanical and neural properties resembling the real arm reaching movements. Specifically, the trained model elicited relatively straight hand trajectories and bell-shaped velocity profiles (Figures 4(c) and 5(a)). The circled lines in Figure 4(c) depict the trajectories generated by the controller. Circles were plotted with the same time interval along each trajectory, and the density of the dots reflected the hand velocity. For clarity, the velocity profiles of the same set of movements were also plotted. The hand velocities for all movements displayed a bell-shaped profile and were insensitive to the movement amplitude and direction. Figures 5(b) and 5(c) show the muscle activity and the responses of an individual neuron in the controller network during movement for reaches of 16 different directions. The muscle activities and neural responses displayed a smooth change across different directions of reaching movements. These features were in accordance with findings of the biomechanical experiments in the literature [2]. Considering that the control commands could be identified with the task information, the continuity of the control commands across different movements indicated the continuity in the task information.
Adaptation of the controller under a force field yielded curled trajectories. The force field could be applied by manipulating the matrix in (7)  circles of the trajectories) went clockwise, and the trajectories curved anticlockwise. This could be attributed to the process of reoptimization under the force field. In the theory of reoptimization, the optimal command overcompensated the perturbing force at the early stage of movements [18,19]. This could be verified in Figure 6(c), where the difference between the joint torques in the force field force field and the torques without the force field had larger amplitudes at the early stage of movements.
The population dynamics of neurons in the dynamical controller during motor generation were also analyzed and compared with the findings from physiological studies of DSH. The neural activities were analyzed with a newly developed method, jPCA analysis [5], which had found the consistent rotational structure in the population activities ( Figure 7). The jPCA analysis is developed to verify the DSH by extracting the rotational transformation underlying the aggregate neural states [5,6]. The computation behind this method is to fit the derivative of neural states( ) with a product of the transition matrix and the neural stateṡ ( ) = skew ⋅ ( ), and the transition matrix is constrained to be skew-symmetric ( skew = − skew ). The dominant rotations are those corresponding to the first few pairs of the eigenvalues of the fitted skew matrix, which have imagery values and come in pairs. The rotational trajectories could be obtained by projecting the population activities onto the eigenvector of the dominant eigenvalues.
The results showed that there were coherent rotational transformations for the neural states in the jPC space (projection on the first pair of the eigenvectors of the fitted skew Computational Intelligence and Neuroscience  matrix). The neural states underwent rotation with the same frequency, and the different target states * were related to the different initial phases and amplitudes. Take the neural activities in Figure 7(a) as an example; the first pair of the eigenvalues were [0.138 , −0.138 ], and the dominant frequency was 0.022/time step. The dominant rotation accounted for a major part of the variance of the total neural activities (52% on average for all trials). This was in accordance with the findings from neurophysiological studies [5,6], in which the quasi-oscillations were the most prominent features in the neural responses during reaching in M1 circuitry. The jPCA was applied to quantitatively relate the quasi-oscillation to the sparse rotational structures, thus attributing it to the neural dynamics of the M1 network. The common presence of rotational dynamics in both the dynamical controller and its biological counterpart (M1 circuitry) suggested that the present study can also provide computational evidence to the DSH of motor processing in the motor cortex.

Discussion
This study implemented a computational model of motor control based on the DSH of neural coding mechanism in motor cortex. The motor generation was achieved through a spatiotemporal transformation mechanism with a dynamical controller. The control commands could be unfolded from the dynamics of the controller network when its initial state was specified with the task information. Motor learning was implemented with the DDPG framework. And the two processes were integrated by embedding the dynamical controller in the DDPG as the actor network, which allowed the model to be trained in a trial-and-error manner. As shown from the experiments of arm reaching movement control, the model successfully generated an accurate control policy that captured the key properties found in the biomechanic experiments. And the analysis of the internal dynamics of the controller provided the computational evidence for the dynamical system hypothesis of the neural coding in motor cortices during movement generation.
The model proposed in this study has a loose resemblance to the specializations of the biological nervous system, where the DDPG learning corresponds to the rewarding circuitry [20,21], and the RNN actor corresponds to the cortical motor areas. It has been revealed that the nigrostriatal dopamine neurons in the basal ganglia represent the temporal difference error (TD error) [22], as the learning of the critic in RL. These different modules interact with each other to fulfill the cognitive tasks of motor control and motor learning. In our model, this interaction between the learning modules can be recognized as the propagation of the DPG gradients ( , )/ , which are transmitted from the critic network to the output units of the RNN actor at each time step for the modification of the actor parameters ( Figure 3). The supervised learning of the RNN controller and the reinforcement learning of DDPG are thus integrated in the same framework.
The generation of time-varying control commands from an initial state in the dynamical state can be regarded as a spatiotemporal transformation. This mechanism can be applied in the hierarchical control in future studies. With the spatiotemporal transformation as the low-level controller, the higher-level control module only needs to send out the motor goal in terms of spatial information, leaving the computation of the time-varying control commands to the low-level control. Under this hierarchical architecture, a complex movement can be implemented by sequentially initializing the low-level controller.

Appendix
As shown in Figure 8, the arm plant is driven by three pairs of muscles. One pair consists of biarticulate muscles (from shoulder to forearm), while the other two, respectively, actuate the shoulder or elbow joint (monoarticular shoulder flexors, monoarticular shoulder extensors, monoarticular elbow flexors, monoarticular elbow extensors, biarticular flexors, and biarticular extensors). Each pair of muscles is a pair of flexor and extensor. The activation is provided by the network output ( ). Given the muscle activation, the force exerted is also nonlinearly dependent on muscle length and contraction speed V: With the muscle tension and the anatomical configuration, which is reflected by the moment arm matrix , the joint torque can be calculated by ( ) = ⋅ Force ( ) . (A. 2) The moment arm matrix describes the relationship between joint torques and muscle forces under certain gesture. The trivial variations of the moment arms due to the change of joint angle are ignored for simplification. The moment arm matrix is given as The columns of correspond to six muscles. The muscle length is fitted using the function of current deviance from the optimal joint angle 0 and the optimal length 0 . The matrix 0 of size 2 × 6 indicates the optimal angle of the two joints for each of the six muscle groups. Similarly, six columns in 0 indicate the optimal length for each of the six muscles. Same as in the matrix , zero-value elements in 0 and 0 represent the anatomical absence of the corresponding muscles. For the th muscle group, the dependence of current length on the deviance is given as follows: The derivative of muscle length, that is, the muscle contraction velocity, can be achieved using a weighted summation of the joint angle velocitẏ, which is also parameterized by the moment arm matrix and the optimal length 0 : Computational Intelligence and Neuroscience

11
The muscle length and velocity are normalized by dividing 0 and thus they can be seen as relative length and velocity.
The nonlinearity terms ( ) and V ( ( ),̇( )) implement the scaling effect of the fascicle force-length relationship and force-velocity relationship. The ( ) function and the V ( ( ),̇( )) function are approximated as follows: