Attitude Control with Auxiliary Structure Based on Adaptive Dynamic Programming for Reentry Vehicles

,is paper presents an attitude control scheme combined with adaptive dynamic programming (ADP) for reentry vehicles with high nonlinearity and disturbances. Firstly, the nonlinear attitude dynamics is divided into inner and outer loops according to the time scale separation and the cascade control principle, and a general sliding mode control method is employed to construct the main controllers for the double loops. Considering the shortage of main controllers in handling nonlinearity and sudden disturbances, an ADP structure is introduced into the outer attitude loop as an auxiliary. And the ADP structure utilizes neural network estimators tominimize the cost function and generate optimal signals through online learning, so as to compensate defect of the main controllers’ adaptability speed and accuracy. ,en, the stability is analyzed by the Lyapunov method, and the parameter selection strategy of the ADP structure is derived to guide implementation. In addition, this paper puts forward skills to speed up ADP training. Finally, simulation results show that the control strategy with ADP possesses stronger adaptability and faster response than that without ADP for the nonlinear vehicle system.


Introduction
Attitude control for reentry vehicles has been a hotspot in the field of aerospace. e complex operating conditions and the high nonlinearity of vehicles themselves bring great challenges to attitude control. Fortunately, around these focuses, researchers continue to explore and ameliorate control schemes, developing a series of available control technologies.
For the control of space vehicles, some schemes have been investigated one after another. Some linear control methods, such as linear parameter varying (LPV) [1] and linear quadratic regulator (LQR), focus on linearizing the aircraft model. However, due to the highly nonlinear and coupling dynamic characteristics, to be honest, the capabilities of these linear control methods on actual nonlinear coupling vehicles are limited. Besides, some nonlinear control methods are widely employed, such as nonlinear dynamic inversion [2], sliding mode control, and backstepping method [3,4]. Although these nonlinear control techniques can also effectively deal with the nonlinear nature of vehicles, they will still be slightly embarrassed and lack adaptability in the face of complex and changeable disturbances if without other auxiliary means. erefore, in the recent development of vehicle control, more and more adaptive technologies have been favoured by researchers [5].
For the purpose of ameliorating the robustness of the controller by designing adaptive mechanism, observer-based adaptive control technology and other intelligent methods (as adaptive fuzzy control and iterative learning) have emerged one after another [6][7][8]. Especially, in recent years, thanks to the vigorous development of new artificial intelligence, reinforcement learning (RL) has attracted more and more attention, which has shown strong performance in solving adaptive and optimal control problems [9][10][11]. In the control domain, reinforcement learning is transformed into approximate or adaptive dynamic programming (ADP), which learns by interacting with the environment to determine what optimal actions to take to minimize a cost function over a period of time [12]. One of the core approaches is the critic-action (CA) design, which approximates the cost function and obtains the optimal actions by solving the Hamilton-Jacobi-Bellman equation with function estimators [13]. ADP contains a variety of structural classifications, including heuristic [14], dual heuristic [15], and action-dependent dynamic programming (ADHDP), etc., which have been made preliminary explorations and achievements in the field of vehicle control [16]. Specifically, Luo et al. developed a direct heuristic dynamic programming (dHDP) for longitudinal control of hypersonic vehicles and introduced fuzzy neural networks to enhance the learning ability and robustness of dHDP [17].
ere is also an application of ADDHP to study the optimal control of attitude maneuver for three-axis spacecraft [18]. Some creative researchers improve ADP by redefining the two optimization objectives and apply ADP to the in-orbit reconfiguration of the vehicle attitude system under multitask constraints through dual optimization indexes [19]. Moreover, ADP can be associated with traditional methods, such as nonlinear filter [20] and sliding mode control [21], to implement a data-driven ADHDP auxiliary control scheme for the speed and altitude system of an air-breathing hypersonic vehicle [21]. In [22], a switching adaptive active anti-interference control technique based on reduced-order observer technique and ADP is proposed, considering the parameter uncertainty and external disturbance of variable structure near-space vehicles. Furthermore, aiming at the guidance and control problem of the vertical take-off and landing (VTOL) system with multivariable disturbances, an online kernel DHP robust control strategy based on the sparse kernel theory is designed for VTOL vehicles [23]. Most of the above control strategies with ADP utilize neural network estimators to approximate the cost function and optimal control law online, while Zhou et al. creatively put forward an incremental ADP (iADP) combining the advantages of the incremental control method and ADP [24].
is iADP is based on Markov decision-making process and Bellman optimal principle to directly derive the explicit expression of optimal control law, greatly simplifying the design process of ADP, and successfully exploited to satellite [25] and aircraft [26]. Similarly, Sun and van Kampen also come up with an incremental model-based DHP technology for vehicle control, replacing the model network in traditional DHP with an incremental model [27,28].
In a word, the development of ADP in the field of vehicle control is rapidly deepening and expanding [16], but as far as the current literature is concerned, ADP is still rarely applied to the control of all three channels' attitude angles of the vehicle. Moreover, most of the literature rarely mentions the internal weight convergence, parameter selection, training speed, and other issues of ADP based on critic-action networks, but these are problems to be concerned about. erefore, this paper contributes to employ the ADP framework to the control of all three-channel attitude angles of a reentry vehicle. Inspired by the ADP as an auxiliary controller [21], this paper presents a framework combining conventional controller and ADP, and ADP is as the auxiliary means to enhance the rapidity and adaptivity of the whole attitude system. In addition, the internal convergence of the ADP structure and its parameter selection rules are discussed in depth. Aiming at the implementation problem, this paper considers the improvement measures to speed up ADP training, which will be provided to interested researchers for future discussion. e rest of this paper is organized as follows. Firstly, the nonlinear dynamics of the three-channel attitude control system of the reentry vehicle is established in Section 2. en, in Section 3, the control strategy based on the dualloop main controller plus ADP is elaborated in detail. In Section 4, some issues about implementation are taken into consideration. Finally, the simulations and conclusions are presented in Sections 5 and 6, respectively.

Nonlinear Model
To describe the attitude change of the reentry phase, we give the rotation equations of the vehicle around the center of mass, including rotation dynamics and attitude kinematics. ey determine the attitude angles of the vehicle around the center of mass and the angular rate of the three channels during the flight. Considering the influence of Earth rotation on attitude control, a three degree of freedom nonlinear attitude model in the body coordinate system can be obtained [29]: where α, β, and μ represent the angle of attack, sideslip, and bank angle, respectively; p, q, and r are the roll, pitch, and yaw rate, respectively. And M x , M y , and M z denote the roll, pitch, and yaw control torques, respectively; I ij (i � x, y, z; j � x, y, z) is rotational inertia. ϕ, θ, χ, and ϑ are longitude, latitude, heading angle, and flight path angle, respectively; Ω E is the Earth rotation angular velocity.
In actual control, vehicles can be regarded as an ideal rigid body. Considering that the rotation rate of the Earth is far less than that of vehicles, the rotation of the Earth is ignored. Besides, orbital motion is much slower than attitude motion, so the orbital motion terms of vehicles are described as _ ϕ � _ θ � _ ϑ � _ χ � 0. Finally, simplified dynamics can be obtained: Above attitude kinematics equation (2) is abbreviated as Similarly, rotational dynamics can be simplified as where I ∈ R 3×3 denotes inertial matrix; M c � [M x , M y , M z ] T ∈ R 3 is a vector of control torques. Ω ∈ R 3×3 and I ∈ R 3×3 are defined as If there exist external disturbances, d 1 and d 2 are introduced into the vehicle system as follows: where d 1 ∈ R 3 and d 2 ∈ R 3 represent external disturbances. Obviously, the attitude tracking control problem of the reentry vehicles can be described as

Controller Design
In the previous section, the nominal attitude model of the reentry vehicle has been established by equations (3) and (5), which can be reorganized as equations (9a) and (9b). is section will devise a controller with an auxiliary according to this vehicle model: It is well known that the attitude angles change more slowly than the angular rate. erefore, according to the principle of time scale separation and cascade control, Complexity equations (9a) and (9b) can be divided into attitude angle slow loop equation (9a) and angle rate fast loop equation (9b), also known as outer loop and inner loop, respectively. In this section, the ADP-based controller will be presented, and the overall control strategy is shown in Figure 1.
As shown in Figure 1, there are two control loops. e outer loop is an attitude control loop with two controllers. e controller 1 generates the main angular rate instruction ω s according to the guidance instruction c d , and the ADP controller outputs the control instruction u ADP according to the attitude angle error; both of which together yield the angular rate ω c . en, ω c is a reference instruction for the inner angular rate loop so that the controller 2 of the inner loop generates the control torque M c , which acts on the vehicle to output the actual attitude angles and complete the control task.
In this paper, the inner controller 1 and outer loop controller 2 are implemented based on conventional sliding mode control and serve as the main controllers. To increase the performance of the main controller of the outer loop, the ADP controller acts as an auxiliary and adopts an actiondependent structure such as ADHDP. Note that ADHDP belongs to the category of ADP, so it is called ADP in this paper. e output of the ADP serves as a supplementary reference signal for the inner loop. e focus of this paper is to discuss the auxiliary role of ADP structure. Of course, the main controllers can also choose other methods to design, but how to select the main controllers is not the focus of this paper. It should be pointed out that only the ADP auxiliary controller is introduced into the outer loop, mainly because the outer loop variable is the attitude angle and the inner loop variable is the angular rate, and the attitude angle changes slowly than the angular rate. erefore, in each iteration, the iterative speed of the ADP is more easily matched with the update speed of the main controller 1. Perhaps we can similarly introduce the ADP auxiliary controller with the same structure into the inner loop, and its rationality and effectiveness will be researched and verified in future work.
In the following subsections: according to cascade control strategy, the outer loop controllers are first designed, including the main controller 1 and the ADP-based auxiliary controller. After the reference command signal ω c is obtained by the outer loop controllers, the inner loop controller 2 is presented.

Main Controller 1.
e control objective of the outer loop is to operate the actual attitude angle c to track c d within the desired accuracy. First, take the tracking error e c � c d − c ∈ R 3 . e sliding switching surface S c ∈ R 3 of the outer loop can be selected as where ρ c � diag ρ c1 , ρ c2 , ρ c3 ∈ R 3×3 and ρ ci > 0, i � 1, 2, 3 are the parameters to be designed [30]. Obviously, on the sliding surface S c � 0, the tracking error e c can be guaranteed to converge uniformly, that is, In order to ensure the asymptotic convergence of the outer loop tracking error to the sliding surface, the virtual control law must be designed. First, take the derivative of S c as Take the following Lyapunov function: and the derivative of L 1 is as By Lyapunov stability, _ L 1 < 0 has to be guaranteed. erefore, the sliding mode approach law can be chosen as where designed parameter τ c > 0 and sign(S c ) � [sign(S c1 ), sign(S c2 ), sign(S c3 )] T denotes a sign function. According to equations (12) and (15), there exists So, the virtual control law of the outer loop can be obtained as follows: In order to avoid or reduce the sliding mode chattering caused by the sign function in equation (17), a smooth continuous function can be adopted instead of the sign function. Because the saturation function is one of the most simple and effective ways, the virtual control law is redesigned as follows: T denotes a saturation function with width ξ c > 0 as follows: erefore, according to control law equation (18), the attitude angles can track the commands, and the error e c uniformly converges. Next, ω s will be provided as the main reference signal to the inner loop.

ADP Auxiliary Controller.
e idea of ADP is to take advantage of the function estimators to approximate the performance index functions and control strategies that meet the principle of optimality. By designing a critic-action structure, the critic network approximates the performance index J (the cost function) and J is defined as the forward accumulation of the utility function U with the discount factor λ [20,21]: where U is usually defined as a quadratic. It can be seen that the cost function is also a quadratic convex function, with only a local minimum and at the same time a global minimum. e action network obtains the optimal control law u * by minimizing J: In this paper, only the auxiliary ADP controller is added to the outer loop to compensate for the attitude angle error generated by the main controller 1. ADP outputs u ADP (u ADP has the same dimension as ω s ), and the sum of u ADP and ω s inputs as a reference instruction to the inner loop. Obviously, the ADP controller is sensitive to the attitude angle error. It can be imagined that ADP will start to work when a certain error occurs; when the error meets the threshold requirements, the ADP does not need to work, which will balance the loss in accuracy and calculation speed. However, this does not seem to be the focus of this paper. It may be discussed in future research, such as the selection and optimization of the threshold.
In Figure 2, ADP adopts a network structure based on ADHDP, which includes an action network, a critic network, and attitude model (9a). e input of ADP is the attitude error, and the action network generates the control signal u ADP . At the same time, the critic network approximates J.
e specific design of each network is given below.
(1) Critic Network. In Figure 3, the critic network uses a single hidden-layer BP neural network with six input nodes, M hidden nodes, and one output node. e input contains the attitude angle error Δc and u ADP generated by the action network. e output is the estimated J of the cost function J. Wc1 ∈ R M×6 is the weight matrix of the input layer to the hidden layer and Wc1 ji (i � 1, . . . , 6; j � 1, . . . , M) represents the weight of the i-th input node to the j-th hidden node. Wc2 ∈ R 1×M is the weight matrix from the hidden-tooutput layer, and Wc2 j , j � 1, . . . , M represents the connection weight of the j-th hidden node to the output. Ch1 ∈ R M×1 and Ch2 ∈ R M×1 are the input and output vectors of hidden nodes, respectively. e active functions of the hidden layer and the output layer are a bipolar sigmoid function and linear function, respectively. e attitude error is as follows: e input of the critic network is INc ∈ R 6×1 as e training of the critic network consists of two parts, one is the forward calculation, and the other is the error backpropagation of updating network weights. e forward process of step k is  Complexity 5 Equation (24) can be rewritten in matrix form as Based on the Bellman optimality principle, the critic network approximates the cost function of the system. e actual J(k) is defined as the cumulative return from the current state to the future: where λ ∈ (0, 1) is a discount factor or forgetting factor, indicating the influence of the future state on the current strategy. U is the utility function at each step, which is defined as a quadratic: e following error E c can be defined, and the critic network can approximate J by minimizing E c : erefore, network weights can be updated through backpropagation of E c .
(2) Updating the Weights Wc2. Using the gradient descent method, let ΔWc2 be the gradient, so where each component of ΔWc2 is represented as where ζ c (k) ∈ (0, 1) is the learning rate. Equation (30) is combined and rewritten into a matrix form as  6 Complexity Combine the above formula into a simplified matrix form as follows: where the symbol "×" represents the Hadamard product of two matrices, that is, bitwise multiplication; "·" represents the ordinary multiplication of matrices. ese symbols appearing in the later parts of this paper possess the same meaning.
(4) Action Network. As shown in Figure 4, the action network adopts a single hidden-layer BP neural network with three input nodes, N hidden nodes, and three output nodes. e network's input is INa � Δc ∈ R 3×1 , and output is u ADP ∈ R 3×1 . Other parameters are defined similarly to the critic network. e active functions of the hidden and output layer are a bipolar sigmoid function and linear function, respectively. e training of the action network also includes forward calculation and error backpropagation. Firstly, the forward process is briefly presented as e action network generates an optimal control strategy by minimizing the system cost function J. is goal can be achieved by minimizing the defined error E a : (5) Updating the Weights Wa2. With the gradient descent method, the update process of Wa2 is where ζ a (k) represents the learning rate. e connection weight from the j-th hidden node to the ith output node is denoted as Wa2 ij (i � 1, 2, 3; j � 1, . . . , N), so e middle term (zJ(k)/zu ADPi (k)) in equation (38) indicates that the path of the backpropagated signal passes through the critic network when training the action network [31]. Furthermore, by the output and input of the critic network, (zJ(k)/zu ADPi (k)) can be obtained: Complexity So, where Wc1 (: ,i+3) represents the (i + 3)-th column of Wc1. Equation (40) can be rewritten in matrix form: where Wc1 u ADP � Wc1(: , 4: 6) represents columns 4 to 6 of Wc1, that is, the connection weights of the three input nodes corresponding to u ADP and all hidden nodes in the critic network. From equations (37)-(41), ΔWa2 can be deduced as 8 Complexity Substituting equation (41) into equation (44), ΔWa1 can be easily obtained.
So far, the training process is completed. And the optimal control signal u ADP output by the action network will be combined with ω s output by outer loop main controller 1, that is where the angular rate signal ω c ∈ R 3×1 will be input as the reference command of the inner loop controller 2, and the control torque M c output by controller 2 will operate the vehicle to complete the attitude control task.

Inner Loop Controller.
To ensure that the actual angular rate ω can stably track the expected reference angular rate ω c , similar to controller 1, the sliding variable is selected for inner loop controller 2 as follows: where e ω � ω c − ω ∈ R 3×1 and ρ ω � diag ρ ω1 , ρ ω2 , ρ ω3 ∈ R 3×3 with ρ ωi > 0, i � 1, 2, 3. In order to ensure the inner loop tracking error e ω asymptotically converges to the sliding surface S ω � 0, the actual control law M has to be designed. e derivative of S ω is Take the following Lyapunov function L 2 : By Lyapunov stability, _ L 2 < 0 has to be guaranteed. erefore, the dynamics _ S ω can be chosen as where designed parameter τ ω > 0 and sign(S ω ) � [sign(S ω1 ), sign(S ω2 ), sign(S ω3 )] T denotes a sign function.
According to equations (47) and (49), there exists So, the actual control law of the inner loop can be obtained as follows: Similarly, a continuous saturation function is chosen to replace the sign function to reduce the chattering. erefore, the actual control law is rewritten as follows: where sat(S ω /ξ ω ) � [sat(S ω1 /ξ ω ), sat(S ω2 /ξ ω ), sat(S ω3 /ξ ω )] T denotes a saturation function with width ξ ω > 0. erefore, for actual control law as equation (52), _ L 2 < 0 holds. at is, the actual attitude angular rate ω converges asymptotically to the expected angular rate ω c .

Implementation Issues
In Section 3, the design of ADP auxiliary controller is completed, but the parameter selection and training speed of ADP cannot be ignored in practical application. So, in this section, some issues are discussed about implementation of ADP structure, including parameter selection for networks and skills related to speed up training.

Network Parameters and eir Convergence.
It is clear that the critic network with a single hidden layer and randomly initialized weights can approximate J with arbitrarily small errors, that is, lim k⟶∞ ‖J(k) − J(k)‖ � 0. Similarly, the action network with randomly initialized weights can minimize the cost function J and its output can approximate to the optimal control law u * ADP , that is, In other words, both the critic network and action network evolve towards the optimal direction to achieve their goals. Furthermore, considering equations (25) and (34), it is because of the adjustment of network weights Wc1, Wc2, Wa1, and Wa2 that the output of the networks reaches the desired optimal value. at is, when the optimal control strategy u * ADP is obtained, the network weights will also reach the optimal weights as follows [32]: Wc * � arg min where Wc * and Wa * represent the optimal weights of the critic and action network, respectively.

Lemma 1. In critic and action network, the weights Wc and
Wa are finally uniformly stable and approach the optimal weights Wc * and Wa * .
Proof. It is well known that the weights of the input to hidden layer are similar to the weights of the hidden to output layer. In order to facilitate the elaboration, this paper only presents the uniform stability proof about Wc2 and Wa2, which are the weights of the hidden to output layer. Let the optimal weights corresponding to Wc2 and Wa2 be Wc2 * and Wa2 * , respectively, and they are bounded. ‖Wc2 * ‖ ≤ κ c , ‖Wa2 * ‖ ≤ κ a , and κ c , κ a are positive constants. Equation (28) can be rewritten as From equations (29) to (31), the update of Wc2 can be rewritten as follows: Similarly, the update of Wa2 is First, the Lyapunov method is adopted to analyse the convergence of Wc2: where Wc2(k) � Wc2(k) − Wc2 * (k) is the error between actual and optimal weights. en, the first-order difference of V c is expressed as According to equation (56), (60) can be obtained: (60) In addition, denote the approximation error between actual and optimal output as
Denote the approximation error of the action network between the actual and optimal output as Furthermore, set V δ (k) � (1/2)‖δ c (k − 1)‖ 2 , and then

Complexity
From the above derivation, we can finally take the total Lyapunov function V(k) as Selecting some parameters as equation (67), then equation (68) holds: where D 2 represents Furthermore, applying the Cauchy-Schwarz inequality, we get where the subscript "max" represents the upper bound of the corresponding parameters' 2-norm, such as ‖Wc2‖ ≤ Wc2 max . erefore, for any ΔV(k) ≤ 0 holds. is indicates that the actual weights will converge to the optimal weights. In other words, the weight error δ c and δ a are uniformly bounded. is also results in a stable ADP system and an optimal output. Furthermore, note that the components of Ch2 and Ah2 are limited to [− 1, 1] due to the activation functions of the hidden nodes, that are So, there exist According to equation (67), some networks' parameters should satisfy Equation (74) provides a simple and intuitive guidance to select networks' structure and learning rate, while maintaining the stability of weights and ADP structure.

Improvement in Implementation.
In the previous literature, when it comes to the training of feedforward networks, all weights usually need to be adjusted, so there are serious dependencies between different layers. Moreover, the algorithm based on gradient descent is widely applied to the learning of various feedforward neural networks. However, it is obvious that the learning method based on gradient descent is usually very slow and time-consuming because of improper learning steps, or it is easy to be overtrained and falls into local minima.
In order to make the training process as time-saving as possible and better meet the time matching between online training and practical applications, we can consider two ideas: one is based on Igelnik and Pao's theory [34], that is, for a single hidden-layer forward neural network, if the weights of input to hidden layer are randomly initialized and kept constant, as long as the number of hidden nodes is sufficient, the approximation error of the network can be arbitrarily small. e second is based on the extreme learning machine (ELM) proposed by Huang et al. [35,36]. For a single hidden layer forward neural network, the weights of the input to hidden layer are initialized randomly and kept constant, and then the hidden nodes are arbitrarily selected. e weights of hidden to output layer are directly determined analytically by the Moore-Penrose inverse, without necessary to derive and calculate partial derivatives layer by layer such as the gradient descent method. e speed of extreme learning methods has been proven to be tens or even thousands of times that of ordinary gradient descent methods, and it can effectively reduce complexity and avoid local minima [37].
To facilitate implementation, this paper will adopt the first idea to improve the performance; that is, the weights Wc1 and Wa1 are randomly initialized in a finite interval and kept constant, and only the weights Wc2 and Wa2 are adjusted by the gradient descent algorithm, resulting in effectively avoiding excessive time consumption. As for the thinking based on extreme learning machine, it is only given here without in-depth discussion due to the limited space of this paper and the lack of theoretical guidance in the application of vehicles. We may make further analysis and give more rigorous theories to support the application in practical vehicle control in future research.

Simulations
In this section, the control strategy with ADP derived above is implemented to vehicle attitude control, and the Complexity 13 effectiveness of the designed strategy is verified by comparing with the conventional controller without ADP.
According to a vehicle model in laboratory, the inertia matrix I is taken as e common parameters are taken as follows:  [10,20,10] will be added at 10 s. Figure 5 presents the tracking results of the three attitude angles. As can be seen from these figures, the controller with ADP is more responsive than the controller without ADP. For example, the controller with ADP can accurately track instructions within 100 steps and cause less overshoot, while the controller without ADP requires about 200 steps. When external disturbances are added at 10 s, the controller with ADP also responds more quickly and with less overshoot. rough these, it can be seen that ADP improves the performance of the system. e controller with ADP shows faster performance and less overshoot, which benefits from the ADP structure's auxiliary behaviour to the outer loop. rough the training process that meets the expected threshold, the ADP structure generates the auxiliary optimal control signal to compensate for the deficiency of the outer loop main controller 1 in eliminating attitude error. Figures 6-11 show the training process of the ADP network. Specifically, Figures 6-9 show the dynamic adjustment of network weights. Figures 10 and 11 are the estimated value of the cost function output by the critic network and the optimal control signal output by the action network. Compared to the previous Figures 5, 6-9 show the rapid adjustment of the network weights at the beginning stage to achieve the    purpose of tracking instructions. As the system output gradually keeps up with the instructions, the weights also converge to the optimal weights (W * as demonstrated in Section 4) and remain stable. When the external disturbances are added at 10 s, the network weights are adjusted again and tend to other optimal weights. It shows that ADP produces auxiliary output to play a certain role at the beginning and when disturbance appears.
According to the thinking and analysis in Section 4.2, when implementing this control strategy, it can be considered that randomly initializing the weights of the input to hidden layer (Wc1 and Wa1) and keeping them constant. During the training, only adjusting the weights Wc2 and Wa2 can not only achieve the same optimal control goal but also greatly reduce the time consumption. Figures 12 and 13 show the corresponding weight changes. Simultaneously, Figure 14 shows the comparison of time consumption in 12 group simulations. It can be further concluded that the average time consumption of maintaining the weights of the input-to-hidden layer (Wc1 and Wa1) and only adjusting the weights of the hidden-to-output layer (Wc2 and Wa2) is 31.9% lower than that of adjusting all weights. Although the sample in Figure 14 is limited, combined with the analysis in Section 4.2 and neural network theory, the effectiveness of this idea in reducing time consumption and improving efficiency is significant.
Furthermore, Figures 15-19 show the tracking control results of time-varying attitude commands, using the controller with ADP. e pulsed disturbances d � [5, 1, 1] T and d � [10,2,2] T are introduced at 10 s and 20 s, respectively, as shown by the yellow arrow in the figures. From Figures 15-17, it can be seen that the controller with ADP auxiliary structure can make the actual attitude angles accurately track the commands. Figures 18-19 show the weights of action network    and critic network in ADP. e weights of the action network are dynamically adjusted to output the optimal auxiliary control signal u ADP in real time, as shown in Figure 20.     these, it can be seen that the controller with ADP auxiliary structure has good dynamic stability performance.

Conclusions
Combining the hottest reinforcement learning at present, this paper presents an ADP-based attitude control methodology for reentry vehicles, applying the ADP to the threechannel attitude control. First, a nonlinear model of the three-channel attitude system is established, and it is divided into inner and outer loops according to the principle of time scale separation. Both the inner and outer loops utilize a conventional sliding mode controller as the main controller, and an auxiliary ADP framework is introduced to the outer loop. When facing the vehicle's nonlinearity and sudden disturbances in particular, the main controller is easy to be weak due to its lack of sufficient adaptability. At this time, the auxiliary role of ADP will be fully exerted. Because ADP uses the critic network and action network, ADP structure has good learning ability. It generates the optimal auxiliary signal immediately after learning the tracking error to compensate for the deficiency of the main controller and improves the adaptability and response speed of the entire control system. For implementation, this paper discusses selection strategies of the ADP parameter and some tips for speeding up training. And the stability is proved by the Lyapunov method. Finally, simulation results of step and time-varying commands demonstrate the effectiveness of the designed scheme for the nonlinear attitude system.
In the future work, we will focus on some switching or event-triggered strategies for this structure with dual controllers. Imagining that if the ADP auxiliary structure is event-triggered rather than time-triggered, it will greatly reduce consumption of ADP's time and system resources, to improve efficiency.

Data Availability
Some data used in this article are confidential, but other public data can be obtained by contacting li_xu@hust.edu.cn.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.