Decentralized Reinforcement Learning Robust Optimal Tracking Control for Time Varying Constrained Reconfigurable Modular Robot Based on ACI and Q-Function

A novel decentralized reinforcement learning robust optimal tracking control theory for time varying constrained reconfigurable modular robots based on action-critic-identifier (ACI) and state-action value function (Q-function) has been presented to solve the problem of the continuous time nonlinear optimal control policy for strongly coupled uncertainty robotic system. The dynamics of time varying constrained reconfigurable modular robot is described as a synthesis of interconnected subsystem, and continuous time state equation andQ-function have been designed in this paper. CombiningwithACI and RBF network, the global uncertainty of the subsystem and the HJB (Hamilton-Jacobi-Bellman) equation have been estimated, where critic-NN and action-NN are used to approximate the optimalQ-function and the optimal control policy, and the identifier is adopted to identify the global uncertainty as well as RBF-NN which is used to update the weights of ACI-NN. On this basis, a novel decentralized robust optimal tracking controller of the subsystem is proposed, so that the subsystem can track the desired trajectory and the tracking error can converge to zero in a finite time. The stability of ACI and the robust optimal tracking controller are confirmed by Lyapunov theory. Finally, comparative simulation examples are presented to illustrate the effectiveness of the proposed ACI and decentralized control theory.


Introduction
Reconfigurable modular robot could transform its configuration depending on the different external situations and the requirements of the tasks.According to the concept of modular design and the decentralized control theory of the subsystem, reconfigurable modular robot can complete the task by changing its structure efficiently in different situations, without redesigning the control law.At the same time, reconfigurable modular robot possessed a good accuracy and flexibility.
Many scholars have studied the dynamics and the control method of reconfigurable modular robot.A novel VGSTA-ESO based decentralized ADRC control method for reconfigurable modular robot has been proposed in [1].Through designing a high-precision VGSTA-ESO to estimate the dynamic model nonlinear terms and the interconnection terms of the subsystem, the joint trajectory tracking control is implemented.Based on calculating torque, a robust fuzzy neural network controller is proposed in [2], which is used to solve the problem of model uncertainty in the process of model generating.In [3], it shows a decentralized adaptive fuzzy sliding mode control method of reconfigurable modular robot.The fuzzy logic system is used to approximate the unknown dynamics of the subsystem, and a sliding mode controller with an adaptive scheme is designed to avoid both the interconnection term and the fuzzy approximation error.A decentralized adaptive neural network control algorithm for reconfigurable manipulators is proposed in [4], where the neural networks are used to approximate the unknown dynamic functions and interconnections in the subsystem by using the adaptive algorithm.A new distributed control method is proposed in [5], which uses a decomposition algorithm to decompose the robot dynamic system into a number of dynamical systems, and the adaptive sliding mode controller is designed to offset the impact of model uncertainty.An observer based decentralized adaptive fuzzy controller for reconfigurable manipulator is proposed in [6]; by designing the state observer, the adaptive fuzzy systems which are used to model the unknown dynamics of the subsystem and the interconnection term can be constructed by using the state estimations.Nevertheless, the requirement of the dynamics of the reconfigurable modular robot system is hard to be satisfied, either fully or even partially know.Moreover, because of there are strong coupling model uncertainties and interconnection terms of subsystems in the reconfigurable modular robot system, besides the processing load on the controller would be increased, the fact that the greater time delay and calculation error are easy to produce, so that it is too complicated to design the controllers by using the methods and algorithms above.
In recent years, as one of the most effective methods to solve the control problems for continuous time which strongly coupled with nonlinear system, the reinforcement learning algorithm has received extensive attention from scholars.Reinforcement learning [7,8] is a kind of learning method mapping situations to actions, so as to maximize a numerical reward signal.Compared with supervised learning, reinforcement learning does not need to predict the mentor signal in various states, but learns in the process of interaction with the situation.Because of its adaptive optimization capability in the nonlinear model under the condition of uncertainty, reinforcement learning has a unique advantage to solve the problems of optimization strategies and the control method in terms of the complex models [9][10][11].Zhang and his team presented an infinite time optimal tracking control scheme for discrete-time nonlinear system via the greedy HDP iteration algorithm [12][13][14].According to the system transformation, the optimal tracking problem is transformed into an optimal regulation problem, and the greedy HDP iteration algorithm is introduced to deal with the regulation problem with the rigorous convergence analysis.Then, a data-driven robust approximate optimal tracking control is proposed by using the adaptive dynamic programming, as well as a data-driven model which is established by a recurrent neural network to reconstruct the unknown system dynamics by using available input-output date [15].After this, they design a fuzzy critic estimator which is used to estimate the value function for nonlinear continuous-time system [16].On this basis, a synchronization problem for an array of neural networks with hybrid coupling and interval time varying delay is concerned with an augmented Lyapunov-Krasovskii functional method [17].The FRL scheme only using immediate reward and sufficient conditions is adopted to analyze the convergence of the optimal task performance.Bhasin presents a neural network control of a robot interacting with an uncertain viscoelastic environment [18].In [18], a continuous controller is developed for a robot that moved in free space and then regulated the new coupled dynamic system to a desired setpoint.Khan et al. present an implementation of a model-free Q-learning based on the discrete model reference compliance controller for a humanoid robot arm [19], where reinforcement learning scheme uses a recently developed Q-learning scheme to develop an optimal policy online.Patchaikani et al. propose an adaptive critic-based real-time redundancy resolution scheme for kinematic control of the redundant manipulator [20].The kinematic control of the redundant manipulator is formulated as a discrete-time input affine system; and then, an optimal real-time redundancy resolution scheme is proposed.Although the research of reinforcement learning algorithm has been rapidly developed in recent years, there are still some deficiencies.For example, when there are multiple subsystems in the global system, the methods above could not handle the impacts of the interconnection terms between the subsystems.Meanwhile, the methods above are mostly used to solve the learning and optimization problems of the system itself, but when the external constraints exist in the system, these methods are no longer applicable.Therefore, it is an urgent issue to solve the problem of how to design a kind of robust reinforcement learning optimal control method in the case of the external constraints existing and multiple subsystems coupled in the system are the urgent problems to be solved.
In this paper, we presented a novel continuous time decentralized reinforcement learning robust optimal tracking control theory for the time varying constrained reconfigurable modular robot.Combining with ACI and RBF-NN, the critic-NN is used to estimate the optimal -function; the action-NN is proposed to approximate the optimal control policy; and then, the identifier is adopted to identify the global uncertainty, so that the HJB equation can be estimated and the estimation error is bounded and converged.Firstly, since the decentralized control method is adopted in this paper, which means that each joint subsystem owns a separate controller, thus the processing loads of the controllers are reduced greatly.Secondly, due to the fact that the time varying constraints can be compensated in the subsystems, therefore the proposed method in this paper is suitable for the reconfigurable modular robot in the time varying constrained outside environment.Thirdly, the proposed control method could compensate for the impacts of the model uncertainties and the interconnection terms on the system, so that it can make the subsystems track the desired trajectories and the tracking error can converge to zero in finite time.

Problem Formulation
Assume that the time varying external constraints for the end of reconfigurable modular robot is shown as Here  ∈   is the vector of joint displacements.Function Ψ :   →   ,  is the dimension of the external limiting conditions.With the time varying constraints, the dynamics of a reconfigurable modular robot can be presented as follows:  () q +  (, q ) q +  () +  (, q ) =  +   Ψ (, ) .(2) () ∈  × is the inertia matrix, (, q ) ∈   is the Coriolis and centripetal force, () ∈   is the gravity term, (, q ) is the unmodeled dynamics including friction terms and external disturbances,  ∈   is the applied joint torque, and   Ψ (, ) is the contact force generated by the contact of the end of the reconfigurable modular robot and external constraints.
After introducing th constraints for the robot which works in the free space, because of the limitation of (1), the system lost th degrees of freedom.Therefore, the degrees of freedom of the robot change from  to ( − ), so that only ( − ) independent joint displacements are needed to describe the system of restricted movement fully. Define Putting the equation above into (1); then we can get that where Therefore, (3) can be described by joint displacement  1 fully, shown as follows: The derivation of ( 6) is In (7), Therefore, the second derivation of  can be achieved easily as Putting ( 7) and ( 9) into (2), we can get Therefore So, (2) can be decomposed into the following form: In the equation above, ( q 1 )  , ( Ṫ  q 1 )  , ( q 1 )  , and   are the th element of ( q 1 ), ( Ṫ  q 1 ), ( q 1 ) and , respectively;   (),   (  , q  ), and   are the th element of (), (, q ), and ;   is the constraint force which suffered by the th joint.  () and   (, q ) are the th element of () and (, q ), respectively.So, as shown in Figure 1, each subsystem dynamical model can be formulated in joint space as follows: (  ) q  +   (  , q  ) q  +   (  ) +   (  , q  ) +   (, q , q ) =   , Let   = [ 1 ,  2 ]  = [  , q  ]  for  = 1, . . ., ; then (10) can be presented by the following state equation: Z i (q, q , q ) Z n (q, q , q ) Z 1 (q, q , q ) The architecture of the time varying constrained reconfigurable modular robot system.
where   is the state vector of subsystem   ,   is the output of subsystem   , and ℎ  (, q , q ) is the interconnection term of the subsystem.(  ,   ) and ℎ  (, q , q ) can be defined as In response to the time varying constrained reconfigurable modular robot system, we need to design a decentralized robust optimal tracking control policy to make the subsystem track the desired trajectory, as well as the tracking error is converged and bounded.

Decentralized Reinforcement Learning Robust Optimal Tracking Control Based on ACI and 𝑄-Function
Assumption 1. Desired trajectory   , ẏ  , ÿ  and input gain matrix   (  ) are bounded.
Assumption 2. The interconnection terms are bounded, satisfying the following equation: where  0 > 0 is an unknown constant and   (|  |) ≥ 0 is an unknown smooth Lipschitz function.
The trajectory tracking error of the joint subsystem  can be defined as With regard to the continuous time state equation of the subsystem in (18) with the nonlinear function and interconnection terms, generally, the value function can be defined as In order to facilitate the equation, we use   ,   instead of   (),   (  ()).Since the trajectory   relies upon the control of the subsystem   for updating, in order to avoid the infinity results by using (21), we need to transform the value function into the following form: Thus, the optimal value function of the subsystem can be defined as follows: Here   (  ,   ) represents the reward function for the current state, shown as where   and  are the positive definite matrixes.
Typically, recording the value of state-action pairs is more useful than recording the value of state only, since the stateaction pairs are the predictions of the reward.Even if the reward value of a state is low, it does not mean that the value of state-action pairs is low too.If the state of the subsystem in a period time produces a higher reward, then it can still get a higher state-action value.Therefore, from a long term perspective, defining a suitable state-action value function (-function) can make actions produce more rewards [21,22].
According to ( 23) and ( 24), the continuous-time optimal -function can be defined as Assumption 3. The partial derivation of  *  and   (  ,   ,   ) exist and they are continuous in the domain.According to (18) and (24), by using the control policy   , the optimal -function can satisfy the following Hamiltonian-Jacobi-Bellman equation [23]: where means the global uncertainty including the unknown dynamics of the subsystem and the interconnection term, and ∇ *  =  *  /  means the gradient of the optimal -function.
Lemma 4 (see [24]).Considering dynamics of the subsystem of time varying constrained reconfigurable modular robot in (14), in order to ensure the minimum of the HJB equation (26) possessing the stationary point with respect to   , the optimal -function and the optimal control policy must satisfy the following conditions: (1) (  ,   , ∇  )/  = 0 The necessary conditions above lead us to the following results.
(a) The bounded control policy can guarantee a local minimum of the HJB equation (26) and satisfy the constraints imposed on the control inputs.
(b) The Hessian matrix is positive-definite, and the control police   can render the global minimum of the HJB equation.
(c) If an optimal algorithm exists, it is unique.
According to Lemma 4, if the reward function is smooth, and the optimal control  *  is adopted, then the HJB equation satisfies the following equation: And the optimal control can be expressed as follows: If the optimal -function  *  is continuous, derivable, and known, and the initial value  *  (0) = 0, as well as the optimal control policy  *  (  ), and the global uncertainty of the subsystem Φ  (  ,  *  ) is known, then the HJB equation in ( 27) is held and solvable.However, in the actual situation,  *  is not derivable everywhere, and  *  (  ) and Φ  (  ,  *  ) are unknown.Therefore, it is not feasible to solve the HJB equation by using average method.In this paper, we combine the actioncritic identifier (ACI) with RBF neural network to estimate the optimal control policy, the optimal -function, and the global uncertainty of the subsystem.Action-NN is used to estimate  *  (  ) and is denoted as û (  );  *  is estimated by critic-NN and expressed as Q ; then we use the robust neural network identifier to identify Φ  (  ,  *  ), denoted as Φ (  ,  *  ).The block diagram of the ACI architecture is shown in Figure 2.
The estimated HJB equation can be expressed as follows: The identification error of the HJB equation above can be expressed as A classic radial basis function of the neural network is proposed in [25], shown as (31):  where  * means the ideal neural network weights and () represents the estimation error.In the case of using sufficient number of nodes, if the center and width of the nodes are built appropriately, then any kind of continuous function could be approximated by RBF-NN.Therefore, the optimal -function and the optimal control policy can be expressed as follows: where According to the equations above, Ŵ () and Ŵ () indicated the weights of critic-NN and action-NN.And the estimation errors of weights are shown as follows: The update law of the weight for the critic-NN is a gradient descent algorithm, which is shown as follows: In the equation above,   > 0 is the adaptive gain of the neural network.  and   are defined as Therefore, according to the definition above, the following inequalities can be obtained: The update law of the weight for the action-NN is developed by a gradient descent algorithm, expressed as follows: According to the estimation error of action-NN in (36), the optimal control  *  (  ) can minimize the optimal -function, and we can get the following equation: Putting ( 41) into (42), we can get that ) . (43) After using critic-NN and action-NN to estimate Q and û (  ), we need to design a kind of robust RBF-NN identifier to identify the nonlinear uncertainties of the subsystem.Here Φ  (  , û ) can be expressed as follows: where (⋅) means the basic function of neural network, and   , Λ  indicate the unknown ideal neural network weights.Equation (44) can be identified by using robust RBF-NN identifier, so we can get Here κ indicates the estimated value of the basic function of the neural network   , Λ  are expressed as the estimated value of neural network.  ∈ R means the feedback error term, shown as follows [26]: where , ,  1 , and  are the positive control gain constants and sat(⋅) is a saturation function.Therefore, the state estimation error of the identifier-NN can be expressed as follows: A filtered identification error is defined as follows: The derivation of the equation above is shown as Here, the weight   , Λ  of the identification-NN can be updated by where Γ  , Γ Λ are positive constant adaptation gain matrices.In order to analyze the convergence of the filtered identification error, Ŵ  κ  Λ  ė  can be divided into the following form: where W  =    − Ŵ  , Λ  = Λ   − Λ  .Putting (51) into (49), then (49) can be reduced to the following form: (52) Among the equations above,  1 +  2 +  3 can be expressed, respectively, as follows:  (14) and the state equation (18), if the designed identifier and the corresponding weight update laws are adopted, then the global uncertainty of the subsystem which depends explicitly on the error term can be identified, and the identification error is converged and bounded.

Theorem 5. Considering dynamics of the subsystem of time varying constrained reconfigurable modular robot in
Proof.Define the Lyapunov function as the follows: In the equation above,   () and   () can be expressed as follows: where (62) The derivation of (58) is shown as follows: where [⋅] is expressed as a Filipov set [27].So V  (  ,   ) can be deformed as the following form: Put ( 53), (54), and (55) into (64); then we can get: so that Lyapunov stability theory shows that the system is stable.In order to make the subsystem of time varying constrained reconfigurable modular robot tracking the desired trajectory progressively, in this paper, a novel decentralized reinforcement learning robust optimal tracking controller has been designed by using the robust term to compensate the neural network approximation errors.Design the robust control term as In the equation above,  > 0 is a constant.And   can be expressed as Therefore, the global control law can be designed as follows: Theorem 6. Considering dynamics of the subsystem of time varying constrained reconfigurable modular robot in (14), if the system parameters conditions and the assumptions are held, the critic-NN, action-NN, and identifier are given by (33), (34), and (45), respectively, and the decentralized robust optimal tracking controller of the subsystem in (70) is adopted; then, the system is closed-loop stability and the desired trajectory can be tracked asymptotically by the actual output.
Proof.Design the Lyapunov function as follows: where Ξ > 0 is the undetermined parameter,  ≤  < ∞.The derivation of (71) is shown as follows: then, V  (  ,  mix ) can be further transformed as (74) Therefore, we can get the conclusion that V  (  ,  mix ) < 0.

Simulations
In order to verify the validity and convergence of the proposed decentralized reinforcement learning robust optimal tracking control method based on ACI and to study the convergence of the error by comparing the simulation result, in this paper, two different configurations of the time varying external constrained reconfigurable modular robot have been applied, shown in Figures 3 and 4.
For the sake of the facilitation of the analysis of the configurations above, we can transform them into a form of analytic charts, which are shown in Figures 5 and 6, where  1 ,  2 , and  4 are the length of the links. 3 is the distance between the time varying constraint joint and the base modular.
The time varying constraint can be defined as a kind of column which rotated about with a certain degree of freedom.The constraint equations of configuration A and configuration B are shown as follows: In the equation above, the angle () between the time varying constraint and the -axis can be defined as follows: The initial positions of joint models are  1 (0) = 2,  2 (0) = 2 in configuration A and  1 (0) = 2,  2 (0) = 2 in configuration B. The initial velocities of joints are zeros.The dynamic model of configurations A and B is designed as follows:   (, q ) = [ q 1 + 10 sin (3 1 ) + 2 sgn ( q 1 ) 1.2 q 2 + 5 sin ] . (77) The desired trajectory of configurations A and B is shown as Configuration A: 1 = 0.5 cos () + 0.2 sin (3)  (79) Due to the limit of dynamic external constraints, the variety of joint 1 in configuration B is zero.
In order to confirm that the adopted method can be applied in different configurations and to verify the tracking performance for the subsystem desired trajectory by using the ACI and RBF-NN based decentralized reinforcement learning robust optimal tracking control method, in this part, the comparative simulations which include the classic RBF neural network control method and ACI based decentralized reinforcement learning robust optimal tracking control method have been adopted, respectively.
From Figures 7,8,9,10,11,12,13, and 14, the joint tracking and error curves are shown by using classic RBF neural network [4] to compensate the effect of the dynamic nonlinear term and the interconnection term in the subsystem.Figures 7 to 10 show that the actual output trajectories take about 2 seconds to track the desired trajectories.This is due to the fact that the classical neural network method requires a longer training process and parameter adaptive process.The error curves in Figures 11-14 show that the joint subsystem  constraints cannot be well compensated by adopting the classical neural control method for the reconfigurable modular robot when the time varying constrains exist.When the joint output variables turn larger, the reconfigurable modular system cannot exhibit a good robustness, and the tracking errors are larger than before.Figures 15,16,17,18,19,20,21,and 22 showed the joint tracking and error curves by adopting ACI to identify the optimal -function, optimal control policy, and global dynamic nonlinear terms of the HJB equation in the subsystem.Figures 15 to 18 show that the actual output trajectories of the subsystems can track the desired trajectories in 0.5 seconds by using the proposed robust reinforcement learning optimal control method.This is due to the excellent identifying ability of ACI which can identify the uncertainties contained in the subsystems in a short time.The error curves in Figures 19-22 show that the tracking error is very small and it can converge to zero in a short time.Besides, when the proposed decentralized control method is adopted for the joint subsystems, the robustness is manifested.
Figures 23 and 24 showed the 3D-tip trajectory curves by using ACI algorithm.These two figures show that the proposed decentralized control method can fully satisfy the accessibility requirement of the reconfigurable modular robot.Besides, the singular displacements of the joint subsystems and the end-effector would not appear.The parameters defined in ACI are shown in Table 1.The simulation results show that, compared with the situation of using classic RBF neural network controller, the decentralized robust optimal tracking control method based on ACI and reinforcement learning can be applied into different configurations of the time varying constrained reconfigurable modular robot.And the joint variables can track the desired trajectory within a very short time in different configurations, and the fluctuation of the error convergence range is minimal.

Conclusions and Future Work
In this paper, combining ACI with RBF neural network, a novel decentralized reinforcement learning robust optimal tracking control theory has been proposed for time varying constrained reconfigurable modular robots.Moreover, this theory is used to solve the problem of the continuous time nonlinear optimal control policy for strongly coupled uncertainty robotic system.Firstly, we build the model of subsystem with the time varying external constraints and described the global robot system as a synthesis of interconnected subsystem.Secondly, ACI is used to estimate the HJB equation and the global uncertainty, where a continuoustime optimal -function is adopted to take the place of traditional optimal value function, the optimal -function and the optimal control police are approximated by critic-NN and action-NN, and the global uncertainty is identified by the identifier.Thirdly, we design a novel decentralized robust optimal tracking controller, so the desired trajectory can be tracked, and the tracking error could converge to zero in finite time.On this basis, two kinds of Lyapunov functions are designed to confirm the stability of ACI and the subsystem.Finally, in order to confirm the superiority of the proposed control theory, the comparative simulation examples have been presented combining two different configurations of the robot.
In the future, more complex configuration for time varying external constrained reconfigurable modular robot can be included.Therefore, the decentralized controller with higher control and error precision will be considered.

Figure 3 :
Figure 3: Configuration A for simulation.

Figure 5 :
Figure 5: The analytic chart of configuration A.

Figure 6 :
Figure 6: The analytic chart of configuration B.

5 Figure 7 :Figure 8 :
Figure 7: Trajectory tracking curve of configuration A joint 1 with RBF neural network.

Figure 15 :Figure 16 :
Figure 15: Trajectory tracking curve of configuration A joint 1 with ACI based reinforcement learning.

Figure 17 :Figure 18 :
Figure 17: Trajectory tracking curve of configuration B joint 1 with ACI based reinforcement learning.

Figure 19 :Figure 20 :Figure 21 :
Figure 19: Tracking error curve of configuration A joint 1 with ACI based reinforcement learning.

Figure 22 :Figure 23 :Figure 24 :
Figure 22: Tracking error curve of configuration B joint 2 with ACI based reinforcement learning.
the smooth basis function of the neural network,   means the ideal unknown neural network weight, and   (  ) and   (  ) are the estimation error.By using Q and û (  ) to estimate *  and  *  (  ), we can get the following equations: (   W +  ℎ ) .

Table 1 :
Parameter list of action-critic-identifier.