Discrete Globalised Dual Heuristic Dynamic Programming in Control of the Two-Wheeled Mobile Robot

Network-based control systems have been emerging technologies in the control of nonlinear systems over the past few years. This paper focuses on the implementation of the approximate dynamic programming algorithm in the network-based tracking control system of the two-wheeled mobile robot, Pioneer 2-DX. The proposed discrete tracking control system consists of the globalised dual heuristic dynamic programming algorithm, the PD controller, the supervisory term, and an additional control signal. The structure of the supervisory term derives from the stability analysis realised using the Lyapunov stability theorem. The globalised dual heuristic dynamic programming algorithm consists of two structures: the actor and the critic, realised in a form of neural networks. The actor generates the suboptimal control law, while the critic evaluates the realised control strategy by approximation of value function from the Bellman’s equation. The presented discrete tracking control system works online, the neural networks’ weights adaptation process is realised in every iteration step, and the neural networks preliminary learning procedure is not required. The performance of the proposed control system was verified by a series of computer simulations and experiments realised using the wheeled mobile robot Pioneer 2-DX.


Introduction
A rapid development of the mobile robotics applications in the last few years can be observed.Autonomous wheeled mobile robots (WMRs) have attracted much attention among researchers and engineers, while construction of robots, their sensory systems, and control algorithms were developed.One of the most challenging tasks, which occurs in the implementations of autonomous WMR, is the tracking control problem.It is widely discussed in literature, where different control strategies [1][2][3][4] are presented.This shows how significant the problem is.Difficulties met in the realisation of the desired trajectory by WMRs result from the fact that these control objects are described using nonlinear dynamic equations, where some parameters of the model can be unknown or change during the movement, for the sake of disturbances.This results in the necessity of application of computationally complex methods, which can adjust their parameters during the realisation of the trajectory and assure required quality of tracking.Artificial intelligence (AI) methods, like neural networks (NNs) [1,2,5,6], are willingly applied in control systems of robots, for the sake of weights adaptation possibility.The development of AI methods makes the implementation of Bellman's dynamic programming (DP) [7] idea possible.This group of methods is called approximated dynamic programming algorithms (ADP) [8][9][10][11][12], adaptive critic designs (ACD), neurodynamic programming algorithms, or actor-critic structures.It is included in the larger family of methods adapted using the reinforcement learning (RL) idea.According to [9,12], the ADP algorithms family is composed of six main schemes: heuristic dynamic programming (HDP), dual heuristic dynamic programming (DHP), globalised DHP (GDHP), and action dependant versions of mentioned earlier algorithms: action-dependent HDP (ADHDP), ADDHP, and ADGDHP.Very good surveys on ADP are given in [9,[13][14][15][16].ADP algorithms have been firstly described for discrete-time systems [8,9,12] and few years later, for time-continuous systems [17][18][19][20][21].
Simultaneously with continuous high interest in RL algorithms, a growing number of its applications can be observed.The challenging applications of RL methods are the control problems of autonomous robots like the helicopter [22] or the underwater vehicle [23].There are implementations of RL algorithms in mobile robot path planning [24], urban traffic signal control [25], or power system control [26], but these are mostly implementations of the Q-learning algorithm [10].There are not many recent articles concerning ADP algorithms; the example is the application of ADHDP algorithm for a static compensator connected to a power system [27] or HDP and DHP algorithms in target recognition [28].Application of the ADP algorithms in the control of the wheeled mobile robot is presented in [4] and in the trajectory generating process in [29].In [30,31] the HDP algorithm is applied to the control of the nonlinear system with some simulation results.Interesting results are shown in [32], where based on the HDP and the DHP algorithms, new kernel versions were proposed that can obtain better performance than original ones.The performance was tested using the inverted pendulum and the ball and plate benchmark systems.The implementation of the GDHP algorithm for the control of the linear object is described in [33] and for the control of the nonlinear system in [3,34,35], the control problem of the turbo-generator, solved using this algorithm, is presented in [36].The article [37] summarizes the novel developments in policy-gradient and presents the novel RL architecture, the natural actor-critic (NAC), and the simulation test performed in the cart-pole balancing problem.Recent works on ADP algorithms have attempted to solve the problem of implementation of ADP based control systems without a system model knowledge [17][18][19].Recent advances in this field also include implementation of ADP algorithms for partially unknown nonlinear systems [19] and robust optimal tracing control for the unknown nonlinear system [38].
The paper presents the application of the ADP algorithm in the GDHP configuration [3,[33][34][35] in the tracking control problem of the WMR.The discrete tracking control system guarantees a high tracking performance and a stable realisation of the desired trajectory in the face of disturbances.The GDHP algorithm consists of two structures, the actor and the critic, both realised in the form of random vector functional link (RVFL) NNs [2].Solutions of the tracking control problems presented in literature are often theoretical considerations; there are not many real applications of ADP algorithms in control problems.The proposed discrete neural tracking control system is used for the tracking control of the WMR Pioneer 2-DX, where a series of computer simulations and experiments were realised to illustrate the performance of the control algorithm.
The results of the research presented in the paper continue the authors' earlier works related to the problem of control of the ball and beam systems [39] and the robotic manipulator [40] using DHP algorithm, tracking control of the WMR [41][42][43][44] using different ADP algorithms, and the problem of trajectory generating using ADHDP [45].The remainder of this paper is organised as follows.The WMR dynamics is given in Section 2. The ADP algorithms family is described in Section 3. In Section 4 the GDHP algorithm implemented in the proposed discrete tracking control system is presented and in the following section, the stability is analysed using the Lyapunov function.In Section 6, the effectiveness of the proposed control algorithm is demonstrated through a numerical illustration and an experiment realised using the WMR Pioneer 2-DX.Finally, Section 7 gives the conclusion.

Dynamical Model of the Wheeled Mobile Robot Pioneer 2-DX
The WMR Pioneer 2-DX is the control object, shown in Figure 1(a).It is a nonholonomic object, which dynamics is described using nonlinear equations.The WMR is composed of two driving wheels 1 and 2, a third, free rolling castor wheel 3, and a frame 4 (Figure 1(b)).The movement of the WMR is analysed in the  plane.
Substituting the WMR dynamics model (3) and the tracking errors (4) into s {+1} , calculated on the base of (5), the filtered tracking error was assumed in the form where where z 3{} is the vector of desired angular accelerations that derives from the expansion of the vector z 2{+1} using Euler's method.The vector Y  (z 2{} ) includes all nonlinearities of the controlled object.

Approximate Dynamic Programming
Bellman's dynamic programming (DP) is based on the calculation of the value function, the control law, and the state of the object for every step of the process, from the last to the first.That is why it is not applicable in online control.ADP algorithms are also called adaptive critic designs (ACD) [8][9][10][11][12][13][14][15][16] or neuro-dynamic programming (NDP) algorithms.They derive from the application of NNs into Bellman's approach to the optimal control theory, where the value function and the optimal control law are approximated by the critic and the actor.This approach makes real-time control of dynamical objects possible.The ADP algorithms family is schematically shown in Figure 2. It is composed of six algorithms, which differ from each other by the critic's structure and the weights adaptation rule of the actor's and the critic's NN.
The basic structure is the HDP algorithm, in which the critic approximates the value function and the actor generates the suboptimal control law.In the DHP algorithm the critic approximates the difference of the value function with respect to the state of the controlled system.The actor has the same structure as in HDP.Complexity of the critic grows proportionally to the size of the state vector, because the difference of the value function with respect to the -dimensional state vector is approximated by  critic's NNs, and the critic's weights adaptation law is also more complex.The DHP algorithm assures higher quality of tracking control in comparison to HDP [43].The GDHP algorithm is built in the same way as HDP; its characteristic feature is the critic's weights adaptation law.It is based on the minimisation of the value function and its difference with respect to the state and can be seen as a combination of the HDP and the DHP critic's NN adaptation law.The actor structure is the same as Actor Critic HDP GDHP in HDP.The difference in complexity of the three basic ADP algorithms is schematically shown in Figure 3.
In the HDP and the GDHP algorithm the critic is composed of one NN that approximates the value function, while in the DHP algorithm critic consists of  NNs, where  is the size of the state vector.For example, in the case of the WMR, where the state vector for the system ( 6) is of  = 2 size, the DHP algorithm consists of the actor and the critic realised in a form of two NNs each.In the GDHP algorithm, the actor is composed of two NNs, but the critic is realised in the form of only one NN.The advantage of GDHP over DHP, in the case of complexity of the critic, is even more evident considering the instance of the 6 degrees of freedom robotic manipulator ( = 6).The DHP algorithm implemented in the control system for this controlled object should be composed of the actor and the critic realised in a form of six NNs each, while the GDHP would be composed of the actor realised in a form of six NNs, and only one NN in the critic structure.The difference of the complexity of the critic structure increases simultaneously as the state vector of the controlled object increases.The rest of the ADP algorithms are AD versions of the basic algorithms, where the control law generated by the actor's NN is also the input to the critic's NN.

Globalised Dual Heuristic Dynamic Programming in Tracking Control
The main part of the proposed tracking control system is the GDHP algorithm.There are not many applications of the GDHP algorithms in literature, and existing publications concern rather with theoretical studies [3,[33][34][35][36].In this paper, both the numerical tests and the verification experiments of the neural tracking control system, realised using the WMR Pioneer 2-DX, are presented.The GDHP structure generates the control law that minimises the value function  {} (s {} ) [8][9][10][11][12][13][14][15][16], assumed in the form of equation where  is a number of iteration steps,  is a discount factor, 0 <  ≤ 1, and  {} (s {} ) is the local cost function for the th step, assumed in the form where R is a positive defined, fixed diagonal matrix.The GDHP algorithm, schematically shown in Figure 4(a), consists of the following: (i) the predictive model that predicts the WMR's closedloop state s {+1} , according to the equation (10) where u {} is the overall tracking control signal of the proposed control system.Its structure derives from the stability analysis presented in the next section.The controlled system's dynamical model is necessary in the synthesis of the actor's and the critic's weights adaptation law in the GDHP algorithm; (ii) the actor, realised in the form of two RVFL NNs, that generate the suboptimal control law u {,} = [  [1]{,} ,  [2]{,} ]  and are expressed by the formula where  = 1, 2,  is an index of the internal loop iteration, x {} is the input vector of the th actor's NN, it consists of normalised values of the filtered tracking error s {} , errors e {} , desired (z 2{} ) and realised (z 2{} ) angular velocities of the driving wheels,  []{} ⟨−1; 1⟩, W {,} is the vector of output layer weights of the th actor's NN, S(⋅) is the vector of sigmoidal bipolar neuron activation functions, and D  is the matrix of fixed input weights selected randomly in the NNs initialisation process.Actor's NNs weights are adapted by the gradient method according to equation where Γ  is the fixed diagonal matrix of positive learning rates.The quality rating e {,} was assumed in the form where V{+1,} (x {+1} , W {,} ) is the output of the critic's NN, generated on the basis of the predicted state for the step  + 1; (iii) the critic, realised in the form of one RVFL NN, estimates the value function (8).It is expressed by the formula where x {} is the input vector of the critic's NN, ,   is the constant diagonal matrix of positive scaling coefficients, W {,} is the vector of output layer weights of the critic's NN, and D  is the matrix of fixed input weights selected randomly in the critic's NN initialisation process.The critic's RVFL NN is schematically shown in Figure 4(b).The critic's weights adaptation procedure in the GDHP algorithm is the most complex among all the ADP structures family.It is based on the minimisation of errors characteristic for the critic's weights adaptation rule of the HDP algorithm ( {,} ) and the DHP algorithm ( {,} ), expressed by the formula where I  is a constant vector, I  = [1, 1]  .Weights of the critic's NN are adapted using the gradient method according to the equation where Γ  is the fixed diagonal matrix of positive learning rates and  1 ,  2 are positive constants.Adaptation process of NNs' weights is an interesting feature of the ADP algorithms.It is realised in a form of an internal loop with the iteration index .In every step  of the discrete control process calculations, which are connected to the actor's and the critic's weights adaptation procedure, are executed according to the scheme shown in Figure 5.

Mathematical Problems in Engineering
The actor-critic structure adaptation process is organised in the following way: at the beginning of every th iteration step  = 0. Actor's NNs weights are adapted according to the assumed adaptation law (12) by minimisation of the error rate (13).This part of the algorithm, called the "control law improvement routine" [9], leads to the evaluation of the actor's NNs weights W {,+1} .The next step consists of the adaptation of the critic's NN weights; it is called the "value function determination operation." The critic's NN weights are adapted according to the assumed adaptation law, by the minimisation of the error rate (15), called the temporal difference error (TDE) [12], and the error rate (16).This leads to the calculation of the critic's NN weights W {,+1} .Next, the internal loop iteration index  is increased, and a new cycle of the ADP algorithm adaptation is started.In the presented algorithm, the internal loop breaks, when the number of internal iterations  ≥   , where   is the maximal number of iteration cycles, or when the error  []{,} is smaller than the assumed positive limit When one of these conditions is satisfied, W {,+1} becomes W {+1,} and W {,+1} becomes W {+1,} .Next index  is increased.The actor's NNs generate control signals and the GDHP structure receives information about a new state of the controlled object.In the next sections index  is omitted for the sake of simplicity.

Stability Analysis
This paper focuses on the implementation of the ADP algorithm in the network-based tracking control system of the two-wheeled mobile robot, Pioneer 2-DX.The proposed discrete tracking control system consists of the GDHP algorithm, the PD controller, the supervisory term, and the additional control signal.
The filtered tracking error s {} was defined in the form (5), where Λ is a positive defined, fixed diagonal matrix selected in the way that the eigenvalues are within a unit disc.Consequently, if the filtered tracking error (5) tends to zero then all the tracking errors go to zero.Filtered tracking error s {+1} can be expressed as (6), where the vector Y  (z 2{} ) includes all nonlinearities of the controlled object.
Let us define the control input u {} as where Ŷ (z 2{} ) is an estimate of the unknown function.
Then, the closed-loop system becomes where the functional estimation error is given by Ỹ (z 2{} ) = Ŷ (z 2{} )−Y  (z 2{} ).Equation ( 19) relates the filtered tracking error with the functional estimation error.In general, the filtered tracking error system (19) can also be expressed as where is a positive constant, then the next stability results hold.
Let us consider the system given by (3).Let the control action be provided by (18) and assume that the functional estimation error and the unknown disturbance are bounded.The filtered tracking error system (6) is stable provided that where   max ∈  is the maximum eigenvalue of the matrix K  .Let us consider the following Lyapunov function candidate: The first difference is Substituting the filtered tracking error dynamics ( 19) into (23) results in where  and   are positive constants.This further implies that The closed-loop system is uniformly ultimately bounded (UUB) [47].The PD controller parameter   max ∈  has to be selected using (21) in order for the closed-loop system to be stable.This outer-loop signal is viewed as the supervisor's evaluation feedback to the actor and the critic.In the NN actor-critic control scheme derived in this paper there is no preliminary offline learning phase.The weights are simply initialized at zero, for then the control system is just the PD controller.Therefore, the closed-loop system remains stable until the NNs begin to learn.The proposed discrete tracking control system is composed of the GDHP structure that generates the control signal u {} , the PD controller (u PD{} ), the supervisory term (u {} ), and the additional control signal u {} .Structure of the supervisory term derives from the stability analysis performed using the Lyapunov stability theorem.The additional control signal u {} derives from the process of the WMR dynamics model discretisation.The overall tracking control signal was assumed in the form where where K  is a fixed diagonal matrix of positive PD controller gains, I  is a diagonal matrix, with elements is a positive constant.The scheme of the discrete neural tracking control system with actor-critic structure in the GDHP configuration is shown in Figure 6.
The stability analysis was performed under the assumption that  [,] = 1.Substituting ( 27) into ( 6), the closed-loop system equation is expressed by the formula The stability analysis was realised using the positive definite Lyapunov candidate function which discretised derivative was assumed in the form Substituting ( 29) into (31), Δ {} takes the form On the assumption that all elements of the vector of disturbances are bounded, The supervisory term's control signal was assumed in the form where is a positive constant, and  [] is a positive constant.On the above assumptions the difference of the Lyapunov function ( 30) is a negative definite.

Research Results
Performance of the proposed discrete tracking control system was tested during a series of computer simulations and then verified using the laboratory stand schematically shown in Figure 7.
The laboratory stand consists of the WMR Pioneer 2-DX, the power supply and a PC equipped with the dSpace DS1102 digital signal processing board and software: dSpace Control Desk and Matlab/Simulink.The WMR Pioneer 2-DX is equipped with the sensory system composed of eight ultrasonic sensors and a scanning laser range finder.The movement of the robot is realised using two independently supplied DC motors with gears (ratio 19.7 : 1) and encoders (500 ticks per shaft revolution).The WMR weights   = 9 kg, its frame is   = 0.44 m long,   = 0.33 m width, and its maximal velocity is equal to V  = 1.6 m/s.

Simulation Results.
Performance of the proposed control system was tested during a series of numerical simulations performed using the Matlab/Simulink software environment.In this section the notation of variables is simplified and the index  is omitted.The same set of parameters during simulations as in the experiment was used.The time discretisation parameter was equal to ℎ = 0.01 s.In the GDHP structure NNs with eight neurons each were used.The output layer weights of NNs were set to zero in the initialisation process.Parameters of the PD controller K  = diag{0.036,0.036}, Λ = diag{0.5,0.5} were assumed.One must select K  using some trial and error experiments or computer simulations.In practice, this has not shown itself to be a problem.The PD controller gains were selected heuristically to satisfy (21).For the sake of the noise that occurs in the signals of the driving wheels angular velocities, incremental encoders were used in the experiment for measurement, the amplification of PD gains in a range of conditions ( 21 z d1 [2] z d1 [1] , z d1 [2] (rad) Figure 8: (a) The desired angles of wheels 1 and 2 rotation,  1 [1] and  1 [2] , (b) the desired angular velocities of driving wheels 1 and 2,  2 [1] and  2 [2] .
values of parameters were restored.The first change of parameters corresponds to the situation, when the WMR is loaded by an additional mass   = 5 kg, and a return to the nominal set of parameters corresponds to the situation, when the additional load is removed.
The desired trajectory of the WMR was computed earlier.In Figure 8(a) the desired angles of the driving wheels' , 1 and 2, rotation are shown; in Figure 8(b) the desired angular velocities are presented.Realisation of the presented trajectory results in movement of point  of the WMR on the path in a shape of a digit "8, " with a stop phase in the middle point.
The overall tracking control signal u, shown in Figure 9 The desired and realised angular velocities of driving wheels 1 and 2 are shown in Figures 10(a) and 10(b), respectively.The biggest differences between the desired and realised angular velocities occur at the beginning of the numerical test.Small changes of realised angular velocities can be observed at the moment, when the parametric disturbances occur.
The desired trajectory was realised with tracking errors shown in Figures 11(a) and 11(b) for adequate driving wheels.In Figures 11(c) and 11(d), values of filtered tracking errors  [1] and  [2] are shown that are minimised by the ADP structure.The highest values of the tracking errors occur at the beginning of the numerical test, when values of the PD control signals are at their highest, and the process of NNs' zero initial weights adaptation starts.Next, the control signals of the actor's NNs take the main part of the overall control signals, and the values of tracking errors are reduced.A noticeable increase of the tracking error values occurs at the time of simulated disturbances, but it is reduced by the change of the actor's NNs control signals.
Values of the GDHP structure's NNs weights are shown in Figure 12(a) for the first actor's NN, in Figure 12(b) for the second one, and in Figure 12(c) for the critic's NN.In the numerical test, zero initial weights values were used.At the time of the disturbances, changes of weights' values occur as a result of the adaptation performed in order to reduce the tracking errors.

Verification Results.
After numerical tests were performed, a series of experiments were realised using the WMR Pioneer 2-DX.The control algorithm operated in real time during the experiment, thanks to the application of the dSpace DS1102 digital signal processing board.In the experiment, the same parameters of the control system as in the simulation were used.The values of signals from the experiment were not filtered.The control signals are shown in Figure 13.The first disturbance occurs at time  1 = 13 s and the second one at time  2 = 33 s.The PD control signals (Figure 13(c)) based on the tracking errors calculated on the basis of the realised trajectory, determined by using signals form incremental encoders.These signals are noised, which has an effect on the overall control signals (Figure 13 The biggest differences between the desired and realised angular velocities, shown in Figure 14, occur at the beginning of the experiment, when the process of the actor's NNs weights adaptation starts and at the time when the disturbances occur.[1] , u [2] (Nm) u [2] u [1] (a) U PD [2] (c) S [1] , U S [2] , U E [1] , U E [2] (Nm) U S [2] U E [1] U E [2] (d) Figure 9: (a) The overall tracking control signals  [1] and  [2] , (b) the actor's NNs control signals   [1] and   [2] , U  = −ℎ −1 Mu  , (c) the PD control signals  PD [1] and  PD [2] , U PD = −ℎ −1 Mu PD , (d) the supervisory term's control signals (  [1] ,   [2] ), U  = −ℎ −1 Mu  , and the control signals   [1] and   [2] , U  = −ℎ −1 Mu  .z d2 [1] z d2 [1] , z 2 [1] (rad/s) z d2 [2] z d2 [2] , z 2 [2] (rad/s) Figure 10: (a) The desired (dashed line) and realised (continuous line) angular velocity of wheel 1,  2 [1] and  2 [1] , (b) the desired (dashed line) and realised (continuous line) angular velocity of wheel 2,  2 [2] and  2 [2] .
The tracking quality of the proposed control system was compared to the results obtained by the tracking control systems presented earlier, where ADP algorithms in HDP and DHP [43] configuration, or the PD controller (K  = diag{1, 1}, Λ = diag{0.5,0.5}), were used.Every experiment was performed in the same conditions, using the same or and  2 [1] , (b) tracking errors of wheel 2,  1 [2] and  2 [2] , (c) the filtered tracking error  [1] , and (d) the filtered tracking error  [2] .analogical values of parameters, and the same type of the disturbance.
To evaluate the tracking control quality, the following quality ratings were used: (i) average of maximal values of the filtered tracking error for wheels 1 ( max [1] ) and 2 ( max [2] ): (ii) average of root mean square error (RMSE) of the filtered tracking errors  [1] and  [2] : where  = 4500.
Values of quality ratings are shown in Table 1.
Average of maximal values of the filtered tracking error for wheels 1 ( max [1] ) and 2 ( max [2] ) is shown in Figure 17(a), and values of RMSE of the filtered tracking errors  [1] and  [2] are shown in Figure 17(b).
On the basis of the obtained results, the higher quality of tracking for the control systems with ADP algorithms in comparison to the PD controller can be noticed.In the presented paper the goal was not to demonstrate the maximal quality of the tracking control attainable using highest feasible to apply the PD controller gains but to illustrate the increase of the quality of the tracking control after adding, to the control system, a part that compensates for nonlinearities of the control system.Values of the quality ratings for the control system with the GDHP structure are close to the ones obtained by the control system with the DHP structure.Simultaneously values of quality ratings are lower than obtained using the HDP algorithm, which means that the application of more complex critic's NN weights adaptation rule improves the quality of control.

Conclusion
The paper presents the discrete tracking control system of the WMR Pioneer 2-DX.The main element of the control system is the ADP algorithm in the GDHP configuration.It consists of the actor and the critic, realised in a form of RVFL NNs.The additional elements of the control system, like the PD controller or the supervisory term, assure stability of the tracking control in case of disturbances, or at the ) and 2 ( max [2] ), (b) RMSE of the filtered tracking errors  [1] and  [2] .
beginning of movement, in the case when values of the actor's NNs weights are not adequately selected for the controlled system; for example, the process of preliminary learning was not realised, or zero initial weights were applied.PD controller gains were selected experimentally for the control system with the GDHP algorithm.Next the experiment for the control system with only the PD controller, with the same parameters, was performed to demonstrate the increase of the tracking control quality for the tracking control system compensating nonlinearities of the control object.It is important to indicate that in a case of realisation of the control system, with nonlinearities compensation, the primary part of the system is the nonlinear compensator.The nonlinear compensator, realised in the form of a GDHP algorithm, compensates for the nonlinearities of the controlled object, as well as the parametrical and the structural disturbances.
The GDHP algorithm has the same structure as HDP and its critic's structure is simpler than in DHP.In the GDHP algorithm the critic's NN weights are adapted using a more complex adaptation law, which is composed of the critic's NN weights adaptation rule of the HDP algorithm and the DHP algorithm.This feature assures a high quality of tracking, higher than the quality of tracking obtained when using the control system with the HDP algorithm, and close to the quality of tracking for the control system with the DHP algorithm, which is a significant advantage.The presented control system is stable; the values of errors and NNs' weights are bounded.Even in the case of zero initial weights of NNs application, or in the case of disturbances, the proposed control system guarantees a stable tracking process.The discrete tracking control system works online and does not require a process of preliminary learning of NNs.Performance of the control system was verified by a series of numerical tests and experiments realised using the WMR Pioneer 2-DX.

Figure 2 :
Figure 2: The scheme of the approximate dynamic programming algorithms family.

Figure 3 :
Figure 3: (a) Scheme of the actor's and the critic's structure complexity in HDP and GDHP, (b) scheme of the actor's and the critic's structure complexity in DHP.

Figure 5 :
Figure 5: Schematic conception of the ADP structure adaptation process.

Figure 6 :
Figure 6: Scheme of the tracking control system.

Figure 7 :
Figure 7: Scheme of the laboratory stand.
) does not improve tracking control quality and can lead to instability.The matrix R, in the cost function, was set to R = diag{1, 1}, the discount factor was equal to  = 0.5, learning rates of the actor's NNs and the critic's NN were equal to Γ [,] = 0.1 and Γ [,] = 0.9 properly,  = 1, . . ., 8,  1 =  2 = 1.Parameters of the supervisory term were set to  [] = 3 and  [] = 0.09.The maximal velocity of point  of the WMR's frame was equal to V  = 0.4 m/s.During the movement of the WMR two parametric disturbances were simulated (marked on diagrams by ellipses), first in  1 = 12.5 s, when the nominal set of parameters was changed to a  = [0.1343,0.0945, 0.037, 0.0001, 2.296, 2.296]  and the second one, when in  1 = 32.5 s, nominal (a), consists of the control signals generated by the actor's NNs u  , (Figure 9(b)), the PD control signals u PD , (Figure 9(c)), the supervisory term's control signals u  , and the additional control signals u  , shown both in Figure 9(d).At the beginning of the numerical test, values of the PD control signals are big.Next, they are reduced during the NNs adaptation process.The control signals of the actor take the main part in the overall control signals.In time  1 , when the first parametric disturbance occurs, a change in values of the generated control signals can be observed.The additional load changes the dynamics of the WMR; realisation of the desired trajectory requires generating higher values of the control signals.The influence of the disturbance on the WMR's dynamics is compensated by the actor's NNs control signals.Analogically, the change of the WMR's parameters in time  2 , which simulates removal of the additional load, is compensated in the generated control law by reduction of the actor's NNs control signals values.
(a)).In contrast, the actor's NNs control signals (Figure 13(b)) and residual control signals (Figure 13(d)) are smooth.As it was observed in the simulation, at the time of the disturbances, the values of the actor's NNs control signals changed to compensate the effect of the WMR's dynamics change.
) and 15(b); filtered tracking errors are shown in Figures 15(c) and 15(d).Values of errors are noisy, because of the realised method of measurement of the movement parameters.The errors at the beginning of the experiment are at their highest.The change of the load transported by the WMR has noticeable influence on the trajectory realisation process.The method of placing the load on the WMR and removing it has a big influence on temporary values of errors.The increase of errors values results in the adaptation of the actor's and the critic's NNs weights in order to minimise tracking errors.Values of NNs' weights are shown in Figure 16.At a time, when the WMR transports an additional load, values of the

Table 1 :
Values of quality ratings.