Adaptive Reinforcement Learning-Enhanced Motion/Force Control Strategy for Multirobot Systems

)is paper presents an adaptive reinforcement learning(ARL-) based motion/force tracking control scheme consisting of the optimal motion dynamic control law and force control scheme for multimanipulator systems. Specifically, a new additional term and appropriate state vector are employed in designing the ARL technique for time-varying dynamical systems with online actor/ critic algorithm to be established by minimizing the squared Bellman error. Additionally, the force control law is designed after obtaining the computation of constraint force coefficient by theMoore–Penrose pseudo-inverse matrix.)e tracking effectiveness of the ARL-based optimal control is verified in the closed-loop system by theoretical analysis. Finally, simulation studies are conducted on a system of three manipulators to validate the physical realization of the proposed optimal tracking control design.


Introduction
In recent years, the tracking control problem for robotic systems has been remarkable, considered from not only academia but also industrial automation [1,2]. Among different schemes, robotic systems can mainly be separated into two categories: (1) mobile robotics [3,4] and (2) robot manipulators [5]. However, because of the growing requirements for high productivity, the networked robotic systems that are known as the collaboration of many robots can fulfill different tasks by mobile manipulators [6], tractor trailers [3], bilateral teleoperators [5], and cooperating mobile manipulators (CMMs) [7]. It is worth emphasizing that, unlike nonlinear multiagent systems [8,9] studying the relation between agents, CMMs consider the interaction with the rigid object. In the literature of CMMs' control objectives, they can be classified into two categories, i.e., multiple mobile robot manipulators in cooperation carrying a common object with unknown parameters and disturbances [7,[10][11][12][13][14][15][16][17][18]; one of them tightly holds the object by the end effector, and the remaining mobile manipulators' end effector follows a trajectory on the surface of the object [19,20]. Most of the existing control literature of networked robotics mainly focuses on the implementation of nonlinear controllers. According to Lyapunov's direct technique, authors in [5] presented the robust controller for a bilateral teleoperation under time-varying delays and without relative motion. Recently, the backstepping technique of networked robotics has achieved much attention [3,11,21]. In [3], a backstepping method is utilized to implement an adaptive trajectory tracking control design for a tractor trailer, where the proposed control scheme can stabilize for cascade control systems. For the purpose of easily computing the backstepping method and handling the full-state constraint, authors in [21,22] pointed out the appropriate term α in the asymmetric barrier Lyapunov function (ABLF). is backstepping method is extended in [11] to deal with unknown dead zone nonlinearity, time-varying term using the inputdriven filter term and Nussbaum function, respectively. Another consideration addressed in the existing literature is to mention the holonomic-nonholonomic constraint [12,19,20,23,24]. Motivated by this approach, the corresponding motion dynamic model and constraint force equation are established for finding appropriate controllers [12,19,20,23,24]. As a result, the robust adaptive controller was realized by estimating the derivative of the Lyapunov function and combining the constraint force control to obtain the motion/force control scheme for multimanipulator systems [12]. Due to dynamic uncertainties and external disturbance, fuzzy and neural network (NN) method are popularly considered to estimate dynamic uncertainties for guaranteeing system stability [7,10,20]. However, the work in [23] utilized an estimation in the absence of fuzzy and NNs for designing a decentralized controller for cooperating manipulators. In contrast to this technique, authors in [7,12,25] proposed the centralized control law based on the consideration of dynamics of m manipulators. e scheme in [7] extends the neural network-based centralized coordination control to achieve the tracking of each mobile manipulator by the consideration of additional actuators using servo motors. However, these nonlinear control solutions do not take into account the challenges, such as input-state saturation, and control objectives are different from the traditional tracking control problem. erefore, we propose motion/force control with a progressive optimality principle to improve the tracking performance.
In the implementation of the optimality principle, an important aspect is developed for robotic systems with the consideration of the robot-environment interaction [26][27][28]. In control design for robotic systems, and so on, there always exists an interaction between the end-effector and the environment, which may lead to various challenges of system performance. erefore, the control scheme should maintain a solution for considering the system interactions. Needless to say, the control systems of the robot-environment interaction have been paid much attention by many researchers for several decades [28]. Due to the effect of the interaction between the robot's end effector and the unknown environment, the desired trajectory in the joint space needs to be modified by the optimality approach [26][27][28]. In [26], the method to generate the virtual desired trajectory can be implemented by adaptive output feedback optimal control for the unknown impedance model. A different approach for finding the virtual reference trajectory can be developed by Q-learning [27]. Furthermore, based on ARLbased admittance control, authors in [28] proposed the completed control structure with the additional term of torque estimation and adaptive controller. An extension of the robot-environment interaction can be known as the physical human-robot interaction [15,29]. In [29], the dynamic model of an n− dimensional manipulator system is established under the task space with the consideration of the end effector of the robot and human or environment. Moreover, the dynamic model established more a desired admittance model in the task space for obtaining virtual desired trajectory under constraint region by a soft saturation function [29]. A different discussion of the humanrobot interaction with the estimation of human intention was mentioned in [15]. Furthermore, the motion/force control law was proposed using the computation of constraint force without the optimality problem [15]. An approach of optimization for the dual-arm manipulator with the consideration of relative Jacobian was mentioned in [16] by the QP solver. e extension from dual-arm manipulators to multiarm robots is discussed with the classical backstepping structure and optimal control law being constructed to tackle optimal behaviors involving obstacle avoidance and energy saving [11]. Additionally, authors in [17] studied multiarm systems under time-varying communication delay by task-space synchronization with a fixed threshold. For different networked systems of bilateral teleoperation systems, the optimality principle was also considered for implementing control design [30]. However, the disadvantage of the control law in [30] is no consideration in solving the Riccati equation. Furthermore, the optimality principle in robotic systems has been almost focused on the interaction of the human-robot's end effector and the environment. is article proposes the consideration of optimal control in trajectory tracking in the motion/force control problem.
For finding an optimal control problem to minimize a given performance index in dynamical systems, it is necessary to solve the so-called Hamilton-Jacobi-Bellman (HJB) equation obtained from the dynamic principle. However, it is hard to analytically solve this HJB equation due to leading to a nonlinear partial differential equation. Among the numerical methods which have been considered to solve the HJB equation, the remarkable iterative structure has been developed to solve via online based on the reinforcement learning (RL) principle being inspired by machine learning [27,[31][32][33][34][35][36][37][38][39]. In order to implement the numerical algorithm for solving the HJB equation, there are two major directions, including online actor/critic [31,32] and off-policy technique using integral reinforcement learning (IRL) [38,39]. e first approach using online actor/critic is to develop based on the property of Hamiltonian H(x, u * , (zV * /zx)) being 0. As a result, the iterative algorithm can be considered by the Bellman error with solving the optimization problem. For designing the ARL scheme in linear dynamical systems [31] and in nonlinear systems [32,40], the methods are realized to find the optimal control input being Kronecker product and approximating neural networks (NNs), respectively. Furthermore, this technique is extended for several situations, such as goal representation heuristic dynamic programming (GRHDP) with the multivariable tracking scheme [35] and uncertain discrete-time systems by using NNs [33]. e second approach using the off-policy technique is implemented by considering the deviation of the performance index in two different times with appropriate data collection [38,39]. It is worth emphasizing that almost existing works discussed the ARL design in time-invariant systems [27,33,35,36,38,39]. In contrast, for establishing the optimal control problem in robotic systems, it requires the consideration of ARL in time-varying systems because of the desired time-varying trajectory [31,32]. Furthermore, the ARL-based optimal control with the actor/critic learning structure has just been discussed in [41] for time-invariant systems. e key idea is to establish simultaneously the learning of neural networks (NNs) in the actor and critic part [41]. e work in [32] developed the actor/critic learning structure for manipulators by combining the sliding mode control (SMC) technique in the absence of constraint force. Regarding the multimanipulator systems, the proposed control law in [12] only implemented the robust adaptive control based on the traditional nonlinear control technique and model separation technique without considering the optimality principle.
In this paper, we investigate the ARL-based motion/force tracking control problem for a multimanipulator system in the presence of disturbance. e control goal is to obtain the unification of the optimality principle and tracking problem. e main contributions of this paper can be summarized as follows: (1) First, it is obviously different from [26,27,34,41,42] studying optimal control for time-invariant systems; we propose ARL-based optimal control design in the situation of the time-varying nonlinear dynamical system under the influence of the time-varying desired trajectory by utilizing the online actor/critic technique for the motion dynamic model of multimanipulator systems. Furthermore, this work is still proposed for the motion/force control problem by extending more the force control term in the control scheme. (2) Second, different from numerous existing nonlinear control approaches [10, 12, 13, 17, 20, 23-25, 43, 44], the proposed optimal control algorithm is investigated with the unification of the optimization and tracking problem to be discussed through theoretical analysis and simulation studies.
e rest of this article is organized as follows. Section 2 provides the problem statement of this paper. Section 3 describes the ARL-based motion/force control design for multimanipulator systems with tracking analysis. Simulation studies on the multirobots with 3 manipulators are presented in Section 4, and the conclusion remarks are pointed out in Section 5.

Preliminaries and Problem Statement
We investigate n two-link manipulators, as shown in Figure 1, which includes n robots interacted by each pair of manipulators, the unknown environment with the original Cartesian coordinate system O(X, Y, Z), and the corresponding coordinate E i of the end effector of each robot. e interaction between each pair of manipulators leads to a constraint condition as described in the following equation: where OE i is a vector with the first point being the original point and the end effector E i under the Cartesian coordinate system and d ij is the distance between end effectors of the ith robot manipulator and the jth robot manipulator. According to (1), combining all the constraints in this multimanipulator, we can obtain the constraint vector as follows: where Ξ(η) ∈ R m×1 , η is the joint vector of n two-link manipulator systems. It can be seen that there are two main cases in general robots. First, there exist no constraint forces due to no interaction between the robot and the rigid object. Second, the interaction leads to the existence of constraint force, and we only consider the constraint motion under the following assumption.
Assumption 1. e number of joints is more than the number of constraint conditions, i.e., 2n > m.
In [12], the dynamic equation of n two-link robot manipulators with constraint force by the interaction between each couple of manipulators can be described as where J(η) � (zΞ/zη) ∈ R m×2n and λ ∈ R m×1 is the vector of Lagrange multiplier. e control objective is to obtain not only the tracking of joints to desired trajectories but also guaranteeing the constant distance between end effectors of each pair of manipulators. Moreover, the motion/force control objective also mentions the tracking of Lagrangian coefficient λ in (3). On the contrary, due to the proposed optimality-based motion/force control design, it requires the unification of the optimal control and tracking problem. Remark 1. It is worth emphasizing that different from classical nonlinear control approaches [7,17,19,20,23,24,43], the approach in this paper is developed by the optimal control technique for uncertain systems based on adaptive/approximate reinforcement learning (ARL). Moreover, regarding the mathematical model of multimanipulators, this work is the extension of Chen et al. [11] by considering the holonomic constraint force with separation of the joint vector.

Motion/Force Control Design with ARL
In this paper, a completed control structure in Figure 2 is investigated with 2 independent parts, including ARL-based optimal control for the motion dynamic model and the force control scheme for tracking the constraint coefficient. Moreover, the convergence and stability of the whole system Rigid object z y 0 x

ARL-Based Motion Controller.
For the purpose of designing the motion controller in multimanipulator systems, the motion dynamic model is discussed after eliminating the constraint forces. Moreover, it is due to the description of time-varying systems in trajectory tracking control; the modification of the tracking error model needs to be implemented for developing the ARL-based optimal control design.
where ξ ∈ R (2n− m)×1 is the vector of independent joint variables. Taking the derivative of (4), we imply that where Λ � (zφ/zξ) ∈ R 2n×(2n− m) . With the purpose of eliminating the constraint force in the motion dynamic model, we use the Moore-Penrose pseudo-inverse matrix of J(η) being J + � J T (JJ T ) − 1 to achieve the matrix Q � (I − J + J) satisfying QJ T � 0. Multiplying both sides of (3) with matrix Q, it leads to the following equation after utilizing QJ T � 0: According to (5) and (6), the motion dynamic can be expressed as Regarding the independent joint coordinates, the dynamic model can be represented after multiplying with matrix [I 2n− m , 0] ∈ R (2n− m)×2n on both sides of (6) as follows: Remark 2. is paper only discusses the situation that the number of constraint conditions is smaller than the number of joints (2n > m) as mentioned in Assumption 1. For the case of (2n � m), it leads to no existence of independent joints. As a result, the above modifications are not implemented, and the control method is developed with different approaches [45]. Additionally, no constraint forces are considered in the special case (m � 0). e motion control design is proposed as where erefore, with the assumption that Furthermore, υ d is the function of (z ξ , (12), a continuous-time affine system has been obtained with the typical form ARL-based optimal control Force control scheme Cooperating manipulators Figure 2: Control structure of the surface vessel. 4 Mathematical Problems in Engineering where the vector of state variables is Remark 3. Due to the time-varying reference in trajectory tracking, the closed system needs to be known as the timevarying property. However, the framework of using term (11) and choosing the state variable vector Z is able to obtain time-invariant systems (12) and (13). erefore, in contrast to the existing methods in [36], the proposed control easily handles the situation of time-varying systems by the elimination of the term − (zV * /zt).

e Optimal Control Problem.
In this section, we consider the general class of time-varying systems: with the associated cost function to be defined as where r(Z(τ), u(τ)) � (Z T QZ + u T Ru) under the weighting matrices Q � Q T ≥ 0 and R � R T > 0 to be defined with appropriate dimensions. It can be seen that the optimal problem for the motion model of multimanipulators (12) and (13) is to realize an admissible control policy [31,32] for the purpose of obtaining the minimal value. is is because that the control policy u(Z) needs to guarantee the existence of the estimated Bellman function V(Z) obtained from the HJB equation.

Definition 1.
A control law u(Z) is considered being "admissible" with respect to the performance index (15) on a compact set Ω, which is known as a continuous signal u(Z) ∈ Υ(Ω), if the properties of the stabilization of (14) and the limitation of J(Z, u) for every Z ∈ Ω are satisfied. It is worth emphasizing that, in the general case of timevarying system (14), the optimal control strategy needs to be considered as a time-varying function u * (Z, t). Hence, the time-varying Bellman function with respect to arbitrary time V * (Z(t), t) can be given as e HJB equation can be achieved by taking the time derivative of V * (X(t)) under two different methods. Firstly, it is directly computed along system (14) as Secondly, thanks to the dynamic programming principle, we also compute the derivative of V * (Z(t)) with respect to time as According to (17) and (18), the first part of the HJB problem can be given: Moreover, the second part of the HJB problem can be developed with the performance index to be formulated as Because of utilizing the dynamic programming principle, the performance index is represented as erefore, we imply that the Bellman function can be rewritten as

Mathematical Problems in Engineering
As the convergence of Δt ⟶ 0 + , one can derive that Remark 4. It is worth noting here that the term (z/zt)V * (Z(t), t) in (19) and (24) is the disadvantage in developing the ARL algorithm, and there has been very little research considering the direct method for this challenge. Authors in [36] developed the solutions of the ARL-based time-varying optimal controller using the framework of the data-driven and Newton-Leibniz formula-based function approximation method. In this paper, we indirectly handle the time-varying tracking error model through the consideration of systems (12) and (13). erefore, the two parts of the HJB problem can be modified as

ARL-Based Control Design for Independent Joints.
It is because of difficulties in directly solving HJB equation (25) to obtain corresponding optimal control algorithm. Hence, the approximate solution with a neural network (NN) is mentioned to develop the ARL design in the motion dynamic model of multimanipulators. anks to the smooth property of functions V * (Z) and u * (Z), they can be described over any compact domain C⊆R 12 : where W ∈ R N is an uncertain ideal NN weight vector, N is chosen as the number of neurons, ψ(Z) ∈ R N denotes an activation function vector satisfying ψ j (0) � 0 and (zψ j /zZ)| Z�0 � 0, ∀j � 1, . . . , N, and ϵ(Z) ∈ R is known as the estimation error of the Bellman function V * (Z). Due to the unknown ideal NN weight vector W ∈ R N , it is necessary to obtain the corresponding adjusting mechanism W a , W c for estimating the actor/critic NNs to achieve the optimal control design in the absence of finding the HJB equation analytically (for more details, see [41]). Moreover, it is able to choose the smooth NN activation function vector ψ(Z) ∈ R N based on the description of multimanipulators (see Section 4). In [41], thanks to the Weierstrass approximation theorem, we can uniformly approximate not only V * (Z) but also (zV * (Z)/zZ) with ϵ(Z), (zϵ(Z)/zZ) ⟶ 0 as N ⟶ ∞. For a fixed number N, the approximated Bellman function of critic NN V(Z) and the approximated optimal control policy of actor NN u(X) are represented as It should be noted that properties (19) and (24) of Hamiltonian H(Z, u, (zV/zZ)) � r(Z(τ), u(τ)) + (zV/zZ)(F(Z) + G(Z)u) with the optimal controller u * (Z) and corresponding value function V * (Z) guarantee that the adjusting mechanisms of actor NN W a and critic W c weights are simultaneously tuned to minimize the squared Bellman error δ hjb and the corresponding integral, respectively. According to the property of HJB problem (25), it leads to H * (Z, u * , (zV * /zZ)) � 0. After obtaining the error between approximated functions V(Z), u(Z) and optimal control input V * (Z), u * (Z), the Bellman error δ hjb is represented as where σ(Z, u) � (zψ/zZ)(F(Z) + G(Z)u) is the regression vector of the critic part. Based on the results in [31], the adjusting mechanism of the critic NN is computed: where ], k c ∈ R are constant positive coefficients and λ(t) ∈ R N×N is an estimated symmetric matrix being the solution of the differential equation as where t + s denotes the resetting time satisfying that λ min λ(t) In [31], because of the condition that λ(t) is positive definite and preventing the covariance wind-up problem, the covariance matrix λ(t) is chosen as Additionally, the adjusting mechanism of the actor NN part is represented by minimizing the squared Bellman error: Remark 5. ARL technique has been investigated with many different approaches, such as off-policy integral reinforcement learning (IRL) [46,47] and Q-learning [42]. However, there is a wide difference between the online actor/critic technique in this paper and IRL and Q-learning. e offpolicy IRL requires the complete data collection between the two sequential sampling times in computing the optimal law [46,47]. Meanwhile, it is obviously different from the offpolicy IRL technique; the online actor/critic method in (29), (30), (35), (33) requires the initial values to implement the computation. is could be exemplified by this article design control law under HJB problems (25) and (26), and the offpolicy IRL method considers the relation between the deviation of integrals at two sampling times and the collected data under the dynamic programming law. In addition, the Q-learning method mentioned the Q-function in terms of not only state variables but also control inputs, being different from the Bellman function [42]. is property can result in the difficulties for implementing the Q-learning technique of the continuous-time system. Additionally, compared with the data-driven method in [48], this work provides a neural network-based technique to avoid the Kronecker product in estimating the actor/critic term. e actor/critic-based approaches have been discussed in [49,50] for nonlinear affine systems using residual error δ hjb . However, in view of the consideration of the identifier in [49,50], it implies the difference in the computation of residual error δ hjb and training laws in actor/critic weights between the proposed method and the work in [49,50].

e Control Design of Dependent Joints.
After completing the tracking control design of independent joints, we can achieve the remaining dependent joints' controller based on the relation between two groups of joint variables, including independent joints and dependent joints. According to (8), it leads to the following representation: On the contrary, thanks to the relation τ 1 � [τ 1T ind , τ 1T dep ] T , multiplying both sides of (6) with [0, I m ] ∈ R (m)×2n , we achieve the dependent joints to be computed as where M ″ � 0, I m QMΛ, Substituting (36) into (37), we obtain the control law of dependent joint variables to be implemented as follows: It is worth noting that the problem of synchronization between multiple joints can be considered by the relation between dependent and independent joints. erefore, this problem is able to obtain the desired trajectories of independent joints and the trajectories of remaining joints depending on these independent joints. Furthermore, it leads to the relationship between the control input of independent joints and dependent joints (39).

Constraint Force Control Design.
For the purpose of designing the motion/force control scheme for n manipulators, the constraint force controller and motion control law are necessary to be established based on the computation of lambda coefficient in (3). Before demonstrating the motion/ force control design, we need to compute the coefficient λ to be dependent on control input τ as follows. According to (3)-(5), we achieve the following equation: (40) where J * T (ξ) � J T (φ(ξ)), Moreover, we find the method to compute coefficient λ after eliminating the term ξ by multiplying both sides with matrix N satisfying that erefore, we can obtain the following coefficient λ: with N to be chosen as where A � J * (ξ)M − 1 (η). Based on the proposed motion/force, control design is described in the following equation: Mathematical Problems in Engineering is leads to the following relations that It should be noted that the constraint force coefficient λ depends on the ARL-based motion dynamic control τ (1) � dep ⎡ ⎣ ⎤ ⎦ to be described in (47). erefore, with the purpose of tracking the coefficient λ, we can propose the remaining term τ (2) in (47) as given in the following: e tracking problem will be proved in the next section.

Convergence and Stability Analysis.
With the purpose of considering the stability and convergence of the closed system under the proposed motion-force control design, we need to implement several following steps. First, the Bellman error δ hjb , critic estimated weight error W c � W − W c , and the persistence of excitation (PE) condition need to be given. Second, thanks to the optimal Bellman function and PE condition, the corresponding Lyapunov function candidate is chosen. ird, the stability of the whole system ( Figure 1) and the role of constraint force control and dependent joints' motion control are discussed. According to (27)-(31), the Bellman error function δ hjb is dependent on state variable vector Z(t) as Combining (49) with (32), we imply the dynamic equation of critic error W c � W − W c to be given as where (51) e nominal system is given after eliminating the term of NN errors: Due to the purpose of exponential convergence of W c , it is necessary to guarantee the PE condition as follows: e following main theorem is proposed with the assumptions being discussed in [41] to estimate the attraction region of tracking errors. Figure 1 with the assumption of the bound conditions of the NN including ideal weights, activation function ψ(.), its derivative to be described in [41], and PE condition of the signal vector Π(t), and the selection of parameters is implemented with the following condition:

Theorem 1. Consider the cooperating manipulators in
where the coefficients in (54) are mentioned in (55), (56), and (35). Let us consider the control structure (Figure 2), (10) with ARL-based control scheme (46), (10), (39), and (30), the associated adjusting mechanisms (35) and (33)  Proof. First, the tracking effectiveness of motion tracking control is considered under the ARL-based optimal control scheme for corresponding tracking error motion models (12) and (13). Due to the satisfaction of PE condition (53), there exists a time-varying function V c : R N × [0, ∞) ⟶ R satisfying several following inequalities: anks to Bellman function V * (Z) (20) and timevarying V c (W c , t) (55), a Lyapunov function candidate can be chosen to determine the stability of the whole of the cascade control system ( Figure 2) and the tracking of the weights of the actor NN and critic NN: It is worth noting that the term (1/2)ρz T ξ z ξ (ρ is a positive constant coefficient) is the additional term for the purpose of tracking of the whole motion control system. Because V * (Z) is a smooth function and positive definite, there exist two K-class functions c 1 and c 2 such that Because of (55) and (8), we can determine the strict proper representation of Lyapunov function V L (Z): Implementing the derivative of V L along motion system (13) under the control input u(Z), we imply that where According to (19), the relation between the Bellman function and the optimal control law can be represented as According to (62) and optimal law (28), (35), and (55) in (60), we can conclude that Based on (72), inequality (71) can be written as (73)  It is obvious that (d/dt)V L is negative if z(t) lies outside the attraction region as   Moreover, to easily compare with the existing work [12], the parameters of multimanipulators are given as D 12 � 7 m, D 13 � 3.5 m, l 1 � l 2 � 1 m, m 1 � m 2 � 1 kg, r 1 � r 2 � 0.9 m, l 3 � l 4 � 1.5 m, m 3 � m 4 � 2 kg, r 3 � r 4 � 1.2 m, l 5 � l 6 � 1.2 m, m 5 � m 6 � 1.5 kg, and r 5 � r 6 � 1 m. Because of two constraints (76), this multimanipulator has 4 independent joints η 1 , η 2 , η 3 , and η 5 with the desired trajectories η 1 d � (π/12), η 2 d � 1.91π+ 0.2 sin(t), η 3 d � 0.51π, and η 5 d � 0.191π, and the initial values of these independent joints are η 1 (0) � π/6, η 2 (0) � 1.92π, η 3 (0) � 2π/3, η 4 (0) � − 0.0565, η 5 (0) � π/4, and η 6 (0) � − 0.0853. According to the given multimanipulator and above parameters, we achieve the necessary models for investigating the control design, including motion dynamic model (7), (37), and constraint force (43). In light of eorem 1, the control parameters are given as follows: k 1 � 0.5, η c � 5, η a1 � 5, η a2 � 10, and ] � 0.01. e dynamic model of 3 manipulators with the above parameters and ARL-based motion/force control scheme is established by the m-file script and Simulink in MATLAB software. e purpose of simulations is to verify the tracking effectiveness of not only joint variables but also constraint force coefficients. Additionally, due to the implementation of optimal control for the motion dynamic model by the radial basis function (RBF) network-based actor/critic reinforcement learning algorithm, the convergence of the adjusting mechanism of the actor/critic weight response and the tracking effectiveness of joint variables. Referring to Figures 4 and 5, it is seen that the response of joint variables 1-6 with tracking errors converging to zero is 10 s, which leads to that the effectiveness of closed systems is highly precise under the ARL technique. Additionally, the convergence of adjusting mechanisms of the actor NN and critic NN to the ideal weight is also obtained (0.0006, − 0.0055, 0.0045, 0.0038, 0.014, − 0.018, − 0.009, 0.019, 0.011, 0.0055, 0.000041, 0.00051, 0.0492, 0.002, 0.000015, 0.0024, 0.0047, 0.00019) (Figures 6-11). Finally, the high-performance responses of constraint force and control inputs are shown in Figures 12  and 13. It should be noted that these simulation results are conducted for three manipulators to be described in the work [12]. However, they are obviously different with [12]; due to the ARL-based approach, the proposed control scheme also considers the convergence of weights in actor/ critic neural networks (Figures 6-11).

Conclusions
is paper presented an ARL-based motion/force tracking control for multimanipulator systems, which consists of the ARL-based motion control law and nonlinear force controller. Both the minimization of the performance index and closed-loop stability are guaranteed by theoretical analysis. Moreover, the proposed control structure is able to handle the time-varying model using the additional term. e simulation studies are conducted to illustrate the efficacy of the proposed control structure. Future work is to develop completely uncertain multimanipulator systems by integrating model-free reinforcement learning.

Data Availability
is publication is supported by multiple datasets, which are available at locations cited in the reference section.

Conflicts of Interest
e authors declare that they have no conflicts of interest.