A Q-Learning-Based Parameters Adaptive Algorithm for Formation Tracking Control of Multi-Mobile Robot Systems

. This paper proposes an adaptive formation tracking control algorithm optimized by Q-learning scheme for multiple mobile robots. In order to handle the model uncertainties and external disturbances, a desired linear extended state observer is designed to develop an adaptive formation tracking control strategy. Then an adaptive method of sliding mode control parameters optimized by Q-learning scheme is employed, which can avoid the complex parameter tuning process. Furthermore, the stability of the closed-loop control system is rigorously proved by means of matrix properties of graph theory and Lyapunov theory, and the formation tracking errors can be guaranteed to be uniformly ultimately bounded. Finally, simulations are presented to show the proposed algorithm has the advantages of faster convergence rate, higher tracking accuracy, and better steady-state performance.


Introduction
A multi-mobile robot system can present intelligent behaviours through mutual cooperation and achieve work efficiency and fault tolerance that a single individual cannot provide, so that it can complete some more difficult tasks. e coordinated formation control of multiple mobile robots has received extensive attention due to its important applications in the industrial and medical field [1]. e most existing control methods dealing with formation control problems of multiple mobile robots mainly include behaviour-based control [2], virtual structures [3], and leader-follower architecture [4][5][6]. As a decentralized control strategy, the leader-follower formation structure has become the preferred control strategy due to its simplicity and scalability and requires less computation and communication resources than other strategies [7]. e movement types of mobile robots are divided into omnidirectional mobile robot (OMR) (holonomic) and nonholonomic one [8]. Concerning the uncertainty and disturbance of robot docking, reference [9] proposed a novel robust containment architecture for nonholonomic mobile robot formations with docking capabilities, which realized multirobot formation maintenance/switching, docking, and collision avoidance. In [10], a dynamic control law was developed for the cooperative target encircling problem of multiple unicycle mobile robots subject to heterogeneous input disturbances generated by the linear exogenous system.
Since nonholonomic mobile robots have fewer controllable degrees of freedom (DOFs) than holonomic mobile robots, geometric constraints are introduced on the robot's motion. Common examples with incomplete constraints are unicycles and car-like wheeled mobile robots. In contrast, holonomic mobile robots have the same number of controllable DOFs and total DOFs, which makes them very flexible and being able to move within the workspace without geometric constraints (e.g., they can perform both rotation and lateral translation). A typical example of this class of vehicles is the omnidirectional robot with mechanical wheels. More details on the types and configurations of mobile robots can be found in [11]. References [12,13] studied the formation control problem of multiple omnidirectional mobile robots. Reference [12] developed a distributed adaptive control law for a multirobot system by obtaining information from moving targets through some mobile robots and using backstepping control technology in formation control. Reference [13] proposed an improved collision avoidance and formation control to configure a multirobot system optimized for omnidirectional visual simultaneous localization and mapping. However, the performance of the control schemes in [12,13] will deteriorate when there exist uncertainties in the kinematics and dynamics of the omnidirectional mobile robots.
In the research of formation control of complex nonlinear systems, sliding mode control (SMC) is an effective robust controller for suppressing disturbances because of its excellent characteristics of being insensitive to system parameter changes and external disturbance when the system enters the sliding mode. In [14], a nonsingular fast terminal SMC was proposed, which can drive the tracking error to zero in finite time. Reference [15] investigated the leader-follower formation control for multiwheel mobile robots by combining a motion controller with a dynamic controller based on sliding mode. Considering the bounded external disturbance and parameter uncertainty of mobile robots, reference [16] proposed a dual-loop attitude tracking robust controller for mobile robots, using SMC with modified arrival law to ensure that the actual speed converges within a finite time. e main disadvantage of SMC is chattering phenomenon due to the discontinuity of the control law. In order to alleviate chattering, adaptive SMC [17] and higher order SMC [18,19] have been proposed. However, these control methods may bring serious chattering even leading to instability when the system is exposed to a dynamic environment with large uncertainties and disturbances. Moreover, the traditional SMC is conservative to some extent since it ignores the information of uncertainties and disturbances. An effective way to solve this problem is using the disturbance estimation and compensation to decrease the conservatism and improve the control performance. Reference [20] proposed a disturbance observer and super-twisting SMC for the multirobot formation. Reference [21] designed an adaptive high-gain observer for the robot to estimate the nonlinearity that appears in the dynamics of the wheeled robot. In [22], the active disturbance rejection control technology was used to estimate the external disturbance in the inner loop of the double closed-loop strategy. On the other hand, the velocities of omnidirectional mobile robots cannot often be measured due to the lack of sensors which are needed in controller design.
Extended state observer (ESO) based active disturbance rejection control (ADRC), proposed by Han [23], is a powerful tool to cope with uncertainties and external disturbances. e key idea of ADRC is that the total disturbances (including internal uncertain dynamics, crosscoupling, and external disturbances), regarded as an extended state of the system, can be estimated by the ESO and then compensated in the control signal. As a matter of fact, the ESO is a state observer to estimate both the system states and the total disturbances. Considering the advantage of it, ESO is adopted to estimate the total disturbances and then followed by the ESO-based controller constructed to compensate it. Reference [24] employed a nonlinear extended state observer (NESO) to estimate unknown states as well as uncertainties and designed a robust finite-time tracking control scheme to handle wheeled mobile robots with parameter uncertainties and disturbances. Reference [25] used NESO-based estimation and compensation signals into the closed-loop control method; an NESO-based decoupling control method was then proposed to solve the attitude control problem of hypersonic gliding vehicles. In [26], an NESO was used to estimate the system uncertainties, and a saturation-resistant adaptive SMC was designed based on the estimated values to achieve robust trajectory tracking for a wheeled mobile robot. However, it is difficult to find appropriate nonlinear functions to design an NESO in practice. For convenience of theoretical analysis, Gao proposed a linearized and bandwidth-based linear ADRC (LADRC) to simplify parameter tuning and standardize controller tuning [27]. Linear ESO (LESO) with nested inner and outer loops was used in [28] to actively estimate and eliminate generalized interference and improve the estimation accuracy in various practical models. e proposal of LADRC makes the design and adjustment of the controller easier and more effective, and the tracking error will be more decreased than some classic control structures [16], which greatly promotes the engineering application of ADRC. However, the LADRC is loss of design flexibility because the LADRC parameters are adjusted based on bandwidth. A general LADRC with more tuning parameters was proposed in [29], and nowadays LADRC has been employed to various cases in engineering application and becomes much more popular [30,31].
Reinforcement learning (RL) has been developed rapidly in recent years. As one of the important algorithms of RL, Q-learning is off-policy, tabular, model-free, and based on temporal-difference methods [32]. It has the advantages of not relying on models and having good learning effects for complex systems. In order to improve the control performance, some scholars combined Q-learning with PID control and proposed many excellent control methods [33,34]. In the autonomous underwater vehicle system, reference [35] proposed a Q-learning PID controller based on RBFNN to improve control performance, in which Q-learning neural network was used to adaptively optimize control parameters. Reference [36] combined model-based Q-learning into the predictive control setting to provide closed-loop stability in online learning and ultimately improve the performance of the limited range controller. Chen proposed an adaptive autodisturbance-rejection controller parameter adaptation method based on Q-learning for ship heading control with multiple uncertainties due to wind, wave, and current interference [37].
Inspired by the above statements, this paper investigates the formation tracking control of a multiple omnidirectional mobile robot system. e considered mobile robots have internal modelling uncertainties and external 2 Complexity disturbances (considered as total disturbances). To handle the disturbances, an LESO is constructed, and the total disturbances can be effectively estimated through the ESO. en, on the basis of distributed formation tracking control architecture, an LESO-based SMC (LESO-SMC) is designed for each OMR to ensure that the observer errors and the formation neighbourhood errors are uniformly ultimately bounded (UUB). However, LESO based control is not widely used in practice because there are not adequate methods for LESO parameter adjustment. In view of this, and furthermore for obtaining better control performance, Q-learning is employed to optimize the bandwidth parameters of LESO and the control parameters of SMC and to avoid the complex parameter tuning process. In addition, a simulation example is given to verify the effectiveness of the proposed method. e main features of the proposed methods are summarized as follows: (1) An LESO is constructed to estimate the 'total disturbances' in real time, including both internal parameter uncertainties and external disturbances, and then an LESO-SMC based formation protocol is developed for the OMR system. e LESO provides distinctly better robustness against 'total disturbances' by providing accurate input variables to the control system, including the states of the mobile robot at each order, as well as the extended state representing the 'total disturbances'. en a corresponding improved strategy on the SMC is made to compensate the influence of the 'total disturbances', which ensures a faster convergence rate and decreases the conservatism of the traditional SMC.
(2) To take full advantages of the LESO-SMC, an adaptive method of LESO-SMC parameters optimized by Q-learning algorithm is proposed in the formation tracking control of the OMR system. Q-learning is introduced to perform online parameter adaptation (including the observer, sliding mode variables, and controller parameters), which exhibits better formation tracking performance and avoids the complex parameter tuning process. e organization of this paper is as follows. In Section 2, the dynamic model and some preliminary knowledge are outlined. In Section 3, the proposed controller design based on LESO and SMC is presented. Both the Q-learning algorithm and the Q-learning parameter tuning process are also introduced. e results of numerical simulations are discussed in Section 4, followed by the conclusion of this paper.
Notations: A T and A −1 represent the transpose and inverse of matrix A, respectively. R n represents n-dimensional real column vector set. I n denotes an n × n unit matrix. 0 n denotes an n × n zero matrix. ⊗ stands for Kronecker product. sign(·) is the sign function. ‖ · ‖ represents the Euclidean norm. diagx 1 , x 2 , . . . , x n denotes the diagonal matrix with its diagonal entries being x 1 , x 2 , . . . , x n . λ max (A) and λ min (A) represent the maximum and the minimum eigenvalues of matrix A, respectively.

Dynamic Model.
e Euler-Lagrange equation of motion can be used to describe the dynamic behaviour of an OMR. e dynamic model of the ith OMR can be described as [31] represents the position and orientation angle of the ith robot in the world coordinate frame. M i is the inertia matrix, C i is the Coriolis and centrifugal term, G i is the gravitational force, and τ i denotes the control input. It is assumed that the robots move on flat ground, where gravitational force G i is 0. Considering the unknown dynamic disturbances and model uncertainties, a new dynamic model is obtained as follows: where ΔM i ∈ R 3×3 and ΔC i ∈ R 3×3 denote uncertainty terms, and d i ∈ R 3 denotes the unknown disturbance term. e above equation can be rewritten as where

Assumption 2.
e uncertain terms ΔM i and ΔC i are bounded.
As for the LESO design in the following section, the total disturbance is extended as a new state. Define 3 is the extended state of the total disturbances f i . e dynamic model (3) can be transformed into the following system: where indicate the position and velocity of the virtual leader, respectively. u l is the control input of the virtual leader which can be obtained by some followers.

Assumption 3.
e derivative of the total disturbances f i is bounded by an unknown constant ρ i , i.e., ‖ _ f i ‖ ≤ ρ i .

Remark 1.
Note that f i in the dynamic model (4) indicates an unknown term, such as external disturbances and model uncertainties for the mechanism of OMRs. In practical applications, the total disturbances mainly include wheel-ground sliding, modelling uncertainty due to robot load variation, etc.

Remark 2.
In practice, both the speed of the DC motor which drives the OMR forward and its derivative have upper bounds, i.e., q i , _ q i , and € q i are all bounded. Δ i is related to _ q i and € q i , so we can conclude that Δ i and its derivative are bounded. Similarly, f i and its time derivative are bounded as well. erefore, the assumptions of Δ i and f i in Assumptions 1 and 2 are reasonable.

Algebraic Graph eory.
Consider an omnidirectional mobile robot system consisting of one virtual leader and n followers. Assume that each robot is a node, and the information exchange among follower robots can be described by a directed graph G.   (2), graph G for the n follower robots contains a directed spanning tree; i.e., there is a vertex (the root node) which can reach all the other vertices through a directed path.
Lemma 1 (see [38]). If G is a directed graph which contains a directed spanning tree and at least the root-node agent has access to the virtual leader, then the matrix L + B is of full rank.

Main Results
In this section, in order to achieve better formation tracking control performance, the LESO-SMC scheme will be designed for system (3) in the presence of the unknown disturbances and model uncertainties, such that all follower OMRs can track the virtual leader with the given formation configuration in advance and maintain the same speed with the virtual leader. Furthermore, a parameter adaptation method based on Q-learning algorithm is involved in LESO-SMC to avoid the complex parameter tuning process, which displays a better formation tracking control performance.

Linear Extended State Observer Design.
In this section, we use the LESO to estimate the OMR's total disturbances f i , which include unknown disturbances and model uncertainties. e LESO for system (4) is designed as follows: 3 , and h i , respectively, and β k (k � 1, 2, 3) is the observer gain to be determined.
Define the estimation errors as follows: ; then the estimation error equation is given by 3 /ω 0 2 T ∈ R 9 ; equation (7) can be rewritten as where Slightly different to the proof in [40], the convergence analysis of LESO (6) is given below.

Lemma 2. Considering the estimation error dynamics (8), the LESO (6) is bounded stable if the observer bandwidth ω
Proof. Since matrix A 1 is Hurwitz, there exists a positive definite matrix P 1 which satisfies A T where κ is an unknown constant. Consider a Lyapunov function candidate for (8) as 4 Complexity en we obtain the time derivative of V eso as where Using Young's inequality, one has From Assumption 3, we have _ f ≤ ρ, where ρ is an unknown constant. It can be obtained from (11) that If ω 0 − 2‖P‖ · ‖κω 0 ‖/ω 0 − 1 > 0, then ‖PN‖ 2 ρ 2 /ω 4 0 is bounded; hence the proposed LESO is bounded stable. e proof is completed.
To achieve a better tracking control performance, in this section, an LESO-SMC-based formation control scheme will be proposed to ensure the tracking performance based on the dynamic models introduced in Section 2.

Sliding Mode-Based Formation Controller Design.
Define the system neighbourhood errors as where δ i � [δ ix , δ iy , δ iθ ] T ∈ R 3 denotes the desired relative position for the ith robot and the virtual leader.
According to (15), we design the ESO-based sliding mode surface of the overall formation for the ith agent as where μ is a positive constant. e time derivative of S i is e formation tracking controller based on ESO-SMC algorithm for the ith agent is designed as

Complexity 5
where c 1 and c 2 are positive constants, and u l is the control input of the virtual leader which can be obtained by some followers. e designed control structure block diagram is shown in Figure 1.

Stability Analysis of State Tracking Error Dynamics.
We have introduced LESO to estimate the states and uncertainties of each order of OMR; next we will propose the SMC-based formation controller. Similar to [41], we comprehensively consider the closedloop system composed of the estimation errors of the observer (6) and the sliding variable (16). e stability analysis of the closed-loop system is given below.

Q-Learning Based Parameter Optimization Process.
e learning process of Q-learning is of continuous environmental interaction. First, at time t, select an action value A t . en the agent will transfer from the original state S t to a new state S t+1 with a probability of P(S t+1 |S t , a). At this time, due to environmental interaction, the agent can get a feedback return R, then the time variable is updated, and the agent restarts the above steps in the new state until the optimal strategy is finally obtained. e Q-learning algorithm is shown below.
Parameters: α ∈ (0, 1], small ε > 0, c ∈ (0, 1]. Step Step 3. Output the final policy π(s) � arg max a Q(s, a). In this paper, we consider the application of Q-learning ideas to the optimization of observers and controllers. Regarding the formation error and the derivative of the error as the state of Q-learning, dividing the controller parameters selection into a reasonable range and combining the divided values as an optional action, a Q-learning algorithm for observer and controller parameters optimization can be obtained. e specific steps of the algorithm are as follows: Perform the learning of the value function Q according to the algorithm flow mentioned above. en we get the learned Q table and the optimal strategy π(s) � arg max a Q(s, a) for online parameter selection. ere are 3 termination conditions as follows. If any one of them is fulfilled then the Q-value training is terminated.
(a) It is not desired in practice that the formation error in the control is too large, which will make little sense to continue to iteratively calculate the Q table. So when |e i,1 | > 10 , the training is terminated and reinitialized. (b) e control process has reached a steady state; then the training is terminated, i.e., |e i,1 | < 0.001 and |e i,2 | < 0.01.

Complexity
(c) e simulation time t � 500s; then the training is terminated.
Denote L t as the training times. When L t � 800, the training is terminated and the trained Q table can be obtained for online control.
i.e., each parameter has 3 possible values; then the number of equivalent selectable actions is A t � 81. erefore, the value function matrix Q ∈ R 81×49 for the formation system.

Remark 4.
e introduction of Q-learning avoids the process of selecting and optimizing the controller, the observer, and the sliding mode parameters. After Q-learning optimization, the convergence performance of the controller and observer can be guaranteed.

Numerical Simulations
In this section, numerical simulation examples are used to illustrate the previous conclusions and the effectiveness of the proposed control scheme. Consider a scenario where a multiagent systems composed of three OMRs (followers) are simultaneously tracking a preset target (the virtual leader). e communication topology G is given in Figure 2. e corresponding Laplacian matrix L and the adjacency weight matrix B can be described as Each dynamics model of the three OMRs can be described by an Euler-Lagrange equation as follows: where  Table 1. (x i , y i ) and θ i denote the positions and the orientation angle of the ith OMR in the x and y directions in the ground coordinate system, respectively. q i � [x i , y i , θ i ] T ∈ R 3 represents the position and orientation angle of the ith robot in the world represents the linear velocity and angular velocity of the ith robot in the world coordinate frame, and € represents the linear acceleration and angular acceleration of the ith robot in the world coordinate frame.
We assume the initial position states of each OMR are randomly chosen within [−5, 5] × [ −5, 5], the initial angle is 0, and the initial velocity , 0] T . e comparison simulations are carried out by the proposed adaptive method of LESO-SMC parameters with and without Q-learning algorithm, respectively, denoted by 'Q-LESO-SMC′ and 'LESO-SMC′, respectively. e parameters of LESO-SMC are chosen as ω 0 � 2 , c 1 � 6, c 2 � 7, μ � 4, and the parameters of Q-LESO-SMC are obtained after online optimization by the Q-learning algorithm in Section 3.4. e other parameters used in the simulation are chosen as In order to observe the performance of the two different controllers, the robot formation is implemented in two cases, including the case with constant disturbances and the case with sinusoidal disturbances.

Case A.
e case with constant disturbances. Consider the robot formation subject to constant external disturbances, denoted by step functions. e corresponding results are shown in  e estimation behaviours of the LESO in two dimensions for Case A with constant disturbances are shown in Figures 3-5. Here x i,1x , x i,2x , and x i,3x (x i,1y , x i,2y , x i,3y ) denote the position, velocity, and the extended state (total disturbances) estimation in x-direction (in y direction) for the ith agent, i � 1,2,3, respectively. In each figure, the error plots are locally enlarged to highlight the specific convergence time and steady errors. From the comparison results, the estimation errors of all three states converge to a small stable region, and the estimation error of Q-LESO has faster convergence time and smaller steady estimation error than that of LESO under the constant disturbances. Figure 6 shows the trajectories of three mobile robots and the virtual leader. ☆ and ○ represent the end positions of each follower robot and the virtual leader, respectively. It can be seen from the figure that the robots form a preset formation with the virtual leader as the center.
In order to make the comparison of control performance between LESO-SMC and Q-LESO-SMC, the formation tracking performance of each follower under these two controllers is depicted in Figures 7-8. Here e i,1x and e i,2x  (e i,1y and e i,2y ) denote the formation tracking position error 8 Complexity and velocity error in x-direction (in y direction) for the ith agent, i � 1,2,3, respectively. It can be seen that the convergence speed of formation tracking error with Q-LESO-SMC is faster than that with LESO-SMC method, and Q-LESO-SMC method has stronger ability to suppress disturbance by comparing the magnitudes of the steady-state errors. e position trajectories for three OMRs are shown in Figure 9, where ○ represents the starting position of each robot. It can be seen that 3 followers quickly track the virtual leader under the proposed Q-LESO-SMC method.

Case B.
e case with sinusoidal disturbances. For simplicity of simulation, consider the robot formation subject to nonlinear external disturbances, such as sine functions, e.g., f i � 0.5e − 0.02t cos(t), 0.2sin(t), 0.5cos(t) T . e corresponding results are shown in Figures 10-16.      Figure 12, where the 'total disturbances' of each follower are approximated by Q-LESO in a shorter period. In addition, the amplitude of the     steady state of the estimation error of Q-LESO is smaller than that of LESO as can be seen in the locally enlarged plot.
Similar to Figure 6, Figure 13 shows the trajectories of the three mobile robots and the virtual leader in the same way. As can be seen from the figure, the target triangular    formation is achieved with the virtual leader at the center. Figures 14-15 show the comparison of formation tracking error between Q-LESO-SMC and LESO-SMC with sinusoidal disturbances. As can be seen, the followers track the virtual leader faster with the Q-LESO-SMC-based formation controller. e error steady-state part is locally enlarged, and comparing the magnitude of the steady-state error before and after optimization shows that the error amplitude of Q-LESO-SMC is smaller than that of LESO-SMC. Similar to Figure 9 in Case A, the position trajectories for three OMRs are shown in Figure 16. It can be seen that 3   followers quickly track the virtual leader under the proposed Q-ESO-SMC method, and the tracking error is smaller than that under the LESO-SMC method.
Remark 5. In Case B, both LESO and Q-LESO methods can achieve well formation control performance. e comparison results show that the Q-LESO exhibited the advantages of faster convergence rate, smaller tracking error, and better disturbance suppression performance. In addition, to get good performance, the LESO needs larger observer gains, which may go beyond bandwidths of practical systems and make the required control energy infeasible. However, higher observer gains may lead to the bigger overshoot (the so-called peaking phenomenon); see  erefore, it is a trade-off between cost and performance.

Conclusions
In this paper, considering the uncertainties and external disturbances in the formation process, an LESO-SMC with Q-learning adaptive optimization are proposed to achieve formation tracking of multiple OMRs. e simulation results show that the proposed control method has the advantages of faster convergence rate, higher tracking accuracy, and good steady-state performance. However, the LESO brings large overshoot with increasing bandwidth. We will further investigate this problem in the future.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.