Model-Free Attitude Control of Spacecraft Based on PID-Guide TD3 Algorithm

,


Introduction
With the rapid development of space technology, the structure and composition of On-Orbit Servicing Spacecraft (OOSS) are becoming more and more complex, and the performance is constantly improving. The effectiveness of attitude control determines the success or failure of the service mission and the life of the spacecraft. On-Orbit Servicing Spacecraft is characterized by flexible multibody structure, liquid sloshing, and fuel consumption and needs to change the structure and parameters according to the mission requirements. For example, in the case of mass loss due to fuel consumption, the rate of mass change of the spacecraft is known to be a function of control application and actuator hardware characteristics. The motion of space manipulator will cause mass displacement, resulting in the change of inertial parameters, and cause disturbances to the attitude stability of the service spacecraft body. On the other hand, the spacecraft is affected by gravity gradient torque, aerodynamic torque, radiated torque, and other unknown disturbance torques. These peculiarities make the attitude of the OOSS present dynamic characteristics such as uncertainty, timevarying, strong nonlinearity, and high-order multivariable coupling [1]. Even in many complex space missions, it is difficult to establish an accurate mathematical model, which increases the difficulty of attitude control.
The classical attitude control methods include PID control [2], adaptive control [3], sliding mode control [4], Lyapunov control [5], optimal control [6], and robust H ∞ control [7]. These control algorithms have achieved good results in simulation experiments and practical applications. Traditional attitude control algorithms need to establish a relatively accurate system model and need to design the parameters of the controller accurately. When the system cannot be modeled completely or the environment changes greatly, the performance of the controller will degrade to some extent. For systems with unclear or completely unknown mathematical models, the intelligent control method with self-learning ability will be a promising choice.
Reinforcement learning (RL) is a kind of machine learning, which has a close relationship with dynamic programming and optimal control theory [8]. The basic idea of RL is to explore the optimal strategy through the interaction between agent and environment, so as to maximize the return [9]. Classical RL, such as Q-learning, discretizes the action and state space and uses the table method to solve the problem [10]. However, the actual control problem may have continuous action and state space, and it is difficult to discretize. High-dimensional continuous state and action space increase the computational burden and lead to the so-called Curse of Dimensions (CoD) problem. DRL solves the CoD problem by introducing neural networks to approximate value function and policy [11]. In 2015, Lillicrap et al. proposed the Deep Deterministic Policy Gradient (DDPG) algorithm to solve the control problem with continuous action space [12]. DRL became widely known in 2016 when Google AlphaGo defeated the top international Go player Lee Sedol in the Go competition.
Up to now, DRL has been used in robots [13], UAV [14,15], energy [16,17], transportation [18], and other control fields. DRL based on the Markov decision process (MDP) provides an effective way to realize intelligent control of spacecraft. In the process of self-learning, DRL optimizes the parameters of neural networks iteratively, which eliminates the trouble of design parameters, enables it to adapt to the changing software, hardware, and environment, and can continuously optimize the performance of the controller by changing the setting of reward function. As space exploration missions become more frequent and complex, spacecraft are getting farther and farther from the earth; DRL technology with fast learning ability and self-regulation ability will play an increasingly important role in spacecraft attitude control system. Based on the above discussion, a model-free attitude control scheme based on DRL is proposed, and a kind of mixed reward function for spacecraft attitude control is designed to realize attitude stabilization control and attitude tracking control of spacecraft. Aiming at the problem that the TD3 algorithm is relatively slow to explore the optimal strategy without using any prior knowledge, the PID-Guide TD3 algorithm is proposed, which uses the PID controller to guide the exploration process of the DRL algorithm, so as to significantly speed up the training speed and improve the convergence accuracy of the algorithm. Aiming at the problem that reinforcement learning (RL) is difficult to deploy in the actual environment, this paper proposes that the algorithm should be pretrained on the ground and then fine-tune the parameters on orbit, so as to save training time and computing resources and achieve better results quickly.
The rest of the paper is organized as follows. The problem statements and control objectives are given in Section 2. The DRL controller of spacecraft attitude is designed, and the PID-Guide TD3 algorithm is proposed in Section 3. Section 4 proves the effectiveness of the proposed algorithm through three simulation cases. Conclusions are given in Section 5.

Problem Statement and Preliminaries
2.1. Problem Statement. The paper mainly studies the attitude control problem of OOSS under complex situations such as uncertain inertia matrix, even unable to establish the motion model, and external disturbances. The control objective is to achieve attitude stabilization control under bounded disturbances, that is, where pðtÞ is the attitude angle, ωðtÞ is the angular velocity, p d is the target attitude angle, and ω d is the target angular velocity. The characteristics of this control problem are as follows: (A) it is difficult to establish accurate dynamic and kinematic models; (B) there are unknown external disturbances, such as perturbation and solar wind; (C) the actuator is saturated.

Attitude
Representation. The definition of the earthcentered inertial coordinate system Figure 1. When the attitude of the spacecraft is in stable state, the spacecraft body-fixed frame coincides with the orbital coordinate system. When attitude motion occurs, the orientation or pointing of the body-fixed frame relative to the orbital coordinate system is the attitude of the spacecraft. There are usually five ways to describe the attitude of the spacecraft, including Euler angles, Rotation Matrix, Quaternions, Rodrigues Parameters (RP), and Modified Rodrigues Parameters (MRP). Different description methods can be converted between each other. The characteristics of different description methods can be found in [19,20].

Spacecraft Dynamics.
The kinematic equation of spacecraft attitude described by MRP is given by  Figure 1: Coordinate definition.

International Journal of Aerospace Engineering
The matrix form of the above equation is as follows: where p = ½p 1 p 2 p 3 T is the Modified Rodrigues Parameters and ω = ½ω x ω y ω z T ∈ ℝ 3 is the rotation angular velocity of the spacecraft relative to the earth-centered inertial, which is expressed in the body-fixed frame. I 3×3 is the third-order identity matrix. p × is the skew-symmetric matrix of p, defined as The attitude dynamics equation can be expressed by the following equation: where J ∈ ℝ 3 is the inertia matrix of the spacecraft and ω × is the skew-symmetric matrix of ω. M c ∈ ℝ 3 is the control torque. M d ∈ ℝ 3 is the total disturbing torque, including the gravity gradient torque, aerodynamic torque, magnetic disturbing torque, and radiation torque.
Remark 1. The dynamic model mentioned here is only used as a simulation of the space environment and will not be used directly in the controller.

DRL Controller Design for Spacecraft
3.1. Background. Deep Deterministic Policy Gradient (DDPG) is an actor-critic method that uses neural networks as the approximation of policy function and Q-function, namely, the policy network and the Q network. Therefore, it can deal with the problem with continuous action space. At the same time, the use of empirical playback and double network solves the problem that actor-critic is difficult to converge [12]. While DDPG can achieve great performance sometimes, it is frequently vulnerable to hyperparameters and other kinds of tuning [21]. A common failure mode for DDPG is that the learned Q-function begins to overestimate Q values dramatically, which leads to the policy breaking, because the error in the Q-function is introduced into the training of the policy network. The Twin Delayed DDPG (TD3) algorithm is an improved version of the DDPG, using three techniques to solve the problems of DDPG [22].
The first one is Double Q Networks. TD3 learns two Qfunctions and uses the smaller of the two Q values to form the targets in the Bellman error loss functions; thus, the overestimation of the approximation of the value function is reduced.
The second one is "Delayed" Policy Updates. TD3 updates the policy (and target networks) less frequently than the Q-function.
The last one is Target Policy Smoothing. TD3 adds noise to the target action to reduce sensitivity and instability, making it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action. The combination of these three techniques significantly improves performance of TD3 relative to baseline DDPG.

3.2.
End-to-End Attitude Control Based on TD3 Algorithm. The training goal of the DRL model is to maximize the total return when the agent interacts with the environment. DRL controller has a natural similarity with traditional control but uses different terms to represent the same concept, as shown in Figure 2. DRL is composed of agent, environment, state, action, and reward. The agent learns the optimal policy through interacting with the environment and takes the optimal action to maximize the long-term reward, while the traditional control is to design the controller (policy) through experts. The state feedback signal refers to the observation of the environment, and the reference signal is built into the reward function and observation.

Environment and State.
In order to implement DRL, the simulation environment of spacecraft attitude motion should be established according to the attitude dynamic and kinematic equations of spacecraft. The simulation environment includes the environment dynamic model and the interface between the agent and the environment. The input of the environment is the action output by the agent, and the output is the observed state and reward signal after the action is performed. For the spacecraft attitude control problem, the system states are selected as attitude angle p = ðp 1 p 2 p 3 Þ T and angular velocity ω = ðω x ω y ω z Þ T . The reference signal includes the target state p T = ðp T 1 p T 2 p T 3 Þ T and ω T = ðω Tx ω Ty ω Tz Þ T . The system state s and the error signal e are used as the observations, where e = ðp T − p, ω T − ωÞ T = ðdp 1 dp 2 dp 3 dω x dω y dω z Þ T . The action space is continuous three-dimensional control torque a ∈ ½−1, 1 N ⋅ m.

Reward Function.
Reward refers to the reward signal that the agent measures its performance according to the task goal. A reasonable reward function is the key for agent to learn effective policy, which determines the convergence speed and stability of RL. Typically, a positive reward is offered to encourage certain behaviors of the agent and a negative reward is offered to deter others [23]. The reward function can be divided into three types: continuous reward function, discrete reward function, and mixed reward function. The continuous reward function varies continuously with observation and action. In general, continuous reward signals can improve the convergence during training and simplify the network structure. The discrete reward function varies discontinuously with observation and action. This type of reward signals slow down the convergence rate and require a more complicated network structure. Discrete rewards are usually implemented as event that occurs in the environment. For example, if the agent exceeds a certain threshold, it may receive a positive reward; if a certain performance constraint is violated, it will be punished. The more commonly 3 International Journal of Aerospace Engineering used reward signal is mixed reward signal. Mixed rewards include continuous rewards and discrete rewards. Discrete reward signals keep the system away from bad states, and continuous reward signals improve convergence by providing smooth rewards near the target state.
The following mixed reward function is designed for spacecraft attitude control: where kk 1 , kk −∞ , and kk ∞ represent vector norm. r 1 is a continuous reward, which can stabilize the system state and reduce fuel consumption. The smaller the error, the greater the reward of r 1 . The second term in r 1 represents energy consumption. When the attitude of the spacecraft is very close to the target attitude, the error change is small, and the change of r 1 will be very insignificant. Therefore, the continuous reward r 2 is introduced to increase the reward gradient when the absolute value of each error component is less than 0.1, so as to guide the attitude angle to approach the target value quickly and accurately. r 3 is a discrete reward, which can control the attitude angle not to exceed the range and increase training speed.   International Journal of Aerospace Engineering from it and feed those sample data to update actor and critic networks. The existence of experience replay buffer helps the agent to be able to learn previous experiences and improve the efficiency of sample utilization. Random sampling can break the correlation between samples and make the learning process of agents more stable [24]. TD3 uses a total of 6 neural networks, namely, actor network π ϕ , actor target network π ϕ ′ , critic network Q θ 1 , critic network Q θ 2 , critic target network Q θ 1 ′ , and critic target network Q θ 2 ′ . The role and update rules of each network are as follows.
For the actor network, the deterministic policy is used to optimize the parameters, and the loss function is defined as The target networks are updated by soft update method: 3.3. PID-Guide TD3 Algorithm. TD3 is a model-free algorithm, which does not use any prior knowledge and constantly explores through the interaction with the environment to obtain the optimal strategy. It is time-consuming for agent to find optimal policy without any prior knowledge due to the problems such as partial observability of environmental feedback, the sparsity of reward, and high-dimensional state and action space. In order to speed up the training speed and improve the convergence stability of the algorithm, a PID-Guide TD3 algorithm is proposed in this section. The core idea of the PID-Guide TD3 algorithm is as follows. In the current state s, two actions are generated by the action network and PID controller, respectively. Then, the critical network is used to evaluate the two actions; the action with higher value will be actually executed. In fact, any model-free controller can guide TD3 to conduct policy search. Figure 3 describes the structure of PID-Guide TD3. PID is a model-free controller based on the feedback of error. PID requires precise design of parameters, so the change of environment will lead to serious degradation of its performance. The formula of PID algorithm is where K p , K i , and K d are positive definite matrices containing the control parameters, which are, respectively, called proportional coefficient, integral coefficient, and differential coefficient. The pseudocode of PID-Guide TD3 is given in Algorithm 1. The main steps of the PID-Guide TD3 algorithm are as follows: (1) Randomly initialize critic networks Q θ i ′ , Q θ 1 , Q θ 2 and actor network π ϕ with weights θ 1 , θ 2 , ϕ and initialize target networks Q θ 1 ′ , Q θ 2 ′ , π ϕ′ with weights θ 1 ′ ← θ 1 , (2) Initialize replay buffer R (3) Start a new episode, set the target position, randomly reset the environment, and get the initial state s (4) For every step, select an action according to the current policy with exploration noise: where the exploration noise ε ∼ N ð0, σÞ (5) Select another action according to PID controller: where c is the scale of exploration noise and a max is the maximum control torque (11) Compute target Q value: where γ is the discount factor (12) Update critic networks by (14) Update target networks by where τ is the update rate for target model 3.4. Pretraining and Fine-Tuning. When the state space and action space are too large, the DRL algorithm is difficult to be applied in space tasks directly due to the low learning efficiency and the difficulty in training the networks. In order to shorten the training time and avoid dangerous states during the exploration, this paper proposes to use the pretraining and fine-tuning method in deep learning to further improve the learning efficiency. Pretraining in deep learning is training the machines before they start performing a particular task. Pretraining imitates the way human beings process new knowledge. The weights saved from the previous network will be used  International Journal of Aerospace Engineering as the initial weight for the new experiment. In this way, the old knowledge helps new models successfully perform new tasks from old experience instead of from scratch. A well-established paradigm is to pretrain models using large-scale data and then to fine-tune the models on target tasks that often have less training data. Pretraining has enabled state-of-the-art results on many tasks, including object detection, image segmentation, and action recognition [25].
The pretraining and fine-tuning method makes it possible to deploy the DRL controller on orbit. The real space environment is different from the simulation environment; there are some uncertain information such as unknown disturbances and unknown inertial parameters in space. For the spacecraft attitude control task, the agent which has been pretrained on the ground only needs a small amount of onorbit training to fine-tune the parameters to obtain good performance.

Simulation and Results
In order to verify the effectiveness and superiority of the proposed PID-Guide TD3 algorithm, the simulations are carried out in this section. The following numerical simulations are organized.
Case 1: in the ideal environment without external disturbances, the agent is trained to realize the attitude stabilization control and attitude tracking control of spacecraft, respectively.
Case 2: on the basis of Case 1, the existence of unknown disturbance torques is considered.
Case 3: on the basis of Case 2, the PID-Guide TD3 algorithm is used to accelerate the training speed and convergence stability. The experiments are conducted in the OpenAI Gym simulation environment. The step size of the Gym simulator, which specifies the duration of each physics update step, is set to 0.1 s to develop highly accurate simulations. The inertia matrix is set as J = diag ð120, 100, 120Þ kg ⋅ m 2 . The initial states of system are selected as p 0 = ð0:3, 0:1,−0:2Þ T and ω 0 = ð000Þ T rad/s. The target states are p d = ð000Þ T and ω d = ð000Þ T rad/s. The maximum value of control torque is a max = 1 N ⋅ m.
The policy and value functions are approximated by four-layer neural networks with Relu activations on each hidden layer. The composition of the networks is shown in Table 1.
The hyperparameter settings during the implementation of the algorithm are given in Table 2. All network architecture and hyperparameters used in the three cases are the same.

Case 1: End-to-End TD3 Algorithm under Ideal
Environment. In order to verify the performance of the End-to-End DRL controller and the effectiveness of the reward function proposed in Section 3.2, the agent is trained in an ideal environment without external disturbances to realize the attitude stabilization and attitude tracking control of spacecraft, respectively. Figure 4 shows the learning process of the End-to-End TD3 algorithm. It can be seen that the algorithm basically achieves convergence after 100 episodes of training, and the rewards for each episode finally stabilized at 15000.
In this case, the control strategy output by the agent, namely, the control torque, is shown in Figure 5 9 International Journal of Aerospace Engineering target value, and the angular velocity is basically zero. The convergence speed is fast, and the amount of overshoot is small. The error curve of the attitude angle is shown in Figure 5(d). The error level drops to 10 -4 , which shows that the control accuracy is relatively high.
The objective of attitude tracking control is to track the desired attitude with p T0 = ð0:1, 0:05,−0:15Þ T and ω T ðtÞ = ½0:02 cos ðt/15Þ, 0:02 sin ðt/20Þ, 0:02 sin ðt/15Þ T rad/s. The simulation results are illustrated in Figure 6. It can be seen that both the attitude and angular velocity converge to the target attitude rapidly, which indicate that the tracking objective is accomplished with the proposed End-to-End DRL controller. speed of the algorithm is slightly reduced, and it still has good convergence accuracy. It can be seen that the disturbance torques are completely compensated by the controller.

Case 3: PID-Guide TD3 Algorithm under Unknown
External Disturbances. In order to verify the superiority of the PID-Guide TD3 algorithm proposed in Section 3.3, the training speed and stability performance of PID-Guide TD3, End-to-End TD3, and pretraining/fine-tuning method are compared, respectively. The training process and testing process are illustrated in Figures 9 and 10. The simulation results are illustrated in Figure 11. A PID-like controller designed for spacecraft attitude stabilization is used as the guide controller [26] The weighting coefficients are chosen as K p = 40I 3×3 , K i = 0I 3×3 , and K p = 60I 3×3 .
From Figures 9 and 10, it can be discerned obviously that all the three algorithms got good convergence performance after training. However, compared with End-to-End TD3 algorithm, the PID-Guide TD3 has significantly faster training speed and higher convergence accuracy. The PID controller itself does not have very high performance, but it can guide the TD3 to produce better sample data at the beginning of training, thus speeding up the training process.
Further, it can also be observed that the pretrained agent only needs a few episodes of training to adapt to the new environment and at the same time avoid the occurrence of 11 International Journal of Aerospace Engineering dangerous states in the process of exploration. The benefits of pretraining extend beyond merely quick convergence, since pretraining can improve model robustness and uncertainty. Therefore, the pretrained DRL controller can be deployed to the spacecraft and then fine-tune parameters on orbit, instead of training from scratch.

Conclusions
A DRL-based control approach is proposed to handle the model-free attitude control problem of OOSS under the guidance of the mixed reward system. The PID-Guide TD3 algorithm based on prior knowledge is proposed to increase the training speed and learning stability of the TD3 algorithm. In addition, the pretraining and finetuning method is proposed to realize the deployment of DRL controller in space. Simulation results show that the DRL controller can achieve high-precision attitude stabilization and attitude tracking control with fast response and small overshoot. The learning curves of the three DRL methods illustrate that the proposed PID-Guide TD3 algorithm has faster learning speed and convergence accuracy than the baseline TD3 algorithm. The pretraining and fine-tuning method can make the controller adapt to uncertain and unknown disturbances in the actual environment very quickly.

12
International Journal of Aerospace Engineering

Data Availability
All data, models, or code generated or used during the study are available from the corresponding author by request.

Conflicts of Interest
The authors declare no conflicts of interest.