Reinforcement Learning for Computational Guidance of Launch Vehicle Upper Stage

. This manuscript investigates the use of a reinforcement learning method for the guidance of launch vehicles and a computational guidance algorithm based on a deep neural network (DNN). Computational guidance algorithms can deal with emergencies during ﬂ ight and improve the success rate of missions, and most of the current computational guidance algorithms are based on optimal control, whose calculation e ﬃ ciency cannot be guaranteed. However, guidance-based DNN has high computational e ﬃ ciency. A reward function that satis ﬁ es the ﬂ ight process and terminal constraints is designed, then the mapping from state to control is trained by the state-of-the-art proximal policy optimization algorithm. The results of the proposed algorithm are compared with results obtained by the guidance-based optimal control, showing the e ﬀ ectiveness of the proposed algorithm. In addition, an engine failure numerical experiment is designed in this manuscript, demonstrating that the proposed algorithm can guide the launch vehicle to a feasible rescue orbit.


Introduction
This manuscript studies the computational guidance algorithm based on DNN. Lu [1] proposed the concept of "computational guidance and control," in which the generation of guidance and control commands relies extensively on onboard computation and does not require a specified reference trajectory.
So far, most of the research on computational guidance is based on the optimal trajectory planning problem. The primary aim of the trajectory planning algorithm is to solve the optimal control problem (OCP), which is generally based on nonlinear dynamics and achieves specific performance indicators under the constraints of state and control variables. The solution to the problem is mainly achieved using indirect [2][3][4] and direct [5][6][7] methods. The indirect method solves the optimal control problem by using the classical variational method and Pontryagin's minimum principle to derive the necessary first-order conditions of the optimal control and transform the problem into a twopoint boundary value problem (TPBVP) [8]. However, the convergence of the numerical iteration is extremely sensitive to the initial value, and the TPBVP is difficult to solve. Therefore, the indirect method cannot be directly applied to launch vehicles' guidance systems without simplification.
The direct method transforms the optimal control problem of continuous space into a nonlinear programming problem and uses a numerical method to directly optimize the performance index [9][10][11]. In 2007, JPL proposed lossless convexity technology for dynamic descent guidance of the Mars lander [12]. After that, a systematic summary of the research and development of lossless convexity technology was presented in [13]. Unfortunately, only a few nonconvex constraints can be used for lossless convexification. For the problem that the lossless convexification technique cannot be used, a sequential convexification method was proposed. But this method was based on the linearization technique, which required multiple iterations, and was also sensitive to the initial value. Nevertheless, considering the rapidity of the convex optimization algorithm in solving convex problems, in recent years, trajectory planning based on this algorithm, such as planetary landing [14], rocket ascent guidance [15], and entry guidance [16], has been widely studied.
In recent years, with the application of machine learning methods in various fields, researchers in the aerospace field also began to pay attention to machine learning, especially deep learning and reinforcement learning. DNNs are among the most versatile and powerful machine learning tools, thanks to their unique capability of accurately approximating complex nonlinear input-output functions when provided with a sufficiently large amount of data consisting of sample input-output pairs (i.e., a training set) [17]. The term "G&CNet" (namely, guidance, and control network) was coined by the European Space Agency [18] to refer to an onboard system that provides real-time guidance and control functionalities to the spacecraft by means of a DNN, which replaced the traditional control and guidance architectures [19]. Aimed at dealing with the sensitive problem of the initial value guess of the indirect method, a method was proposed in [20] to obtain a good initial guess through the DNN, and the numerical experimental results showed that this improved the computational speed of the indirect method. Carlos and Dario [21] directly applied the deep learning method to the optimal control problem, and the numerical experimental results showed that the trajectory obtained by the DNN architecture was close to the optimal one. This work opened up the possibility of using a DNN to directly drive the state-action selection. To solve the 2D trajectory optimization problem of a hypersonic vehicle, the authors of [22] proposed a DNN architecture. The idea in [22] was similar to that in [21], where the DNN was used to obtain the mapping relationship between state and control. Compared with the traditional optimal control problem, the DNN can ensure real-time performance of the algorithm. A fast approach to generating time-optimal asteroid landing trajectories was presented in [23], and a DNN was developed to approximate the gravitational field of asteroids, and the corresponding time consumption of gravity calculation in trajectory propagation was significantly reduced.
The above methods use the supervised learning (SL) method to train the DNN. However, the SL method needs large expert samples like state-control pairs, which are obtained by solving the OCP. But obtaining expert samples creates a heavy computational load to construct a dataset for training. Another approach to training DNN is called reinforcement learning (RL). RL does not require prior computation for generating the expert samples. In RL, samples are collected from the interaction between agent and environment, and the agent understands and improves the current performance through the reward obtained by interaction. Therefore, the reward function is the key; researchers may never get the ideal results if the reward function is not designed well. In [24], the performance of behavioral cloning (BC) and RL was investigated on a linear multi-impulsive rendezvous mission. An interactive deep reinforcement learning (DRL) algorithm with an actorindirect method architecture was presented in [25] to train the DNN-driven controller for optimal control of the landing problem. In [26], the authors applied reinforcement learning to a Mars landing guidance system to directly generate guidance commands. In [27,28], the authors applied the RL meta-learning framework to optimize an integrated and adaptive guidance and control system for exoatmospheric and endoatmospheric interception problems, and the numerical results showed the system was robust to the parasitic attitude loop. In [29][30][31], the authors used RL metalearning framework in the vehicle landing problems with distributions like sensor noise and actuator failure, and the numerical results showed that RL metalearning could deal with these distributions well and get good results. A robust trajectory design method based on reinforcement learning was proposed in [19], and the experimental results showed that good results could be obtained through different models. In [32], the image-based reinforcement metalearning was applied to solve the lunar landing task with uncertain dynamic parameters, and the numerical results showed that the resulting closed-loop guidance policy was effective even if the environment was partially observed. The image-based reinforcement metalearning was also used in the autonomous guidance of an impactor in a binary asteroid system, and the numerical results showed that the guidance system was robust and could be applied to almost all test scenarios [33].
Once a neural network is trained, it only needs to do simple matrix multiplication when in use. Compared with guidance-based optimal control which requires solving the optimal control problem, the calculation time can be ignored. Thus, the method based on machine learning has real-time performance. Aimed at the guidance of the launch vehicle ascending phase, a guidance algorithm based on reinforcement learning is proposed in this manuscript. In Section 2, the background of reinforcement learning is introduced, and in Section 3, the guidance-based reinforcement learning framework is proposed, combined with a dynamics equation. Section 4 presents the experimental results and a discussion.

Reinforcement Learning
2.1. Markov Decision Process. The Markov decision process (MDP) is a mathematical model of a sequential decision problem. In an environment where the system state has a Markov nature, it is used to simulate possible random strategies and rewards of agents. The complete MDP is usually described by ðs, a, R, PÞ, where s represents the state set fs 0 , s 1 , ⋯, s n g, a represents the action set fa 0 , a 1 , ⋯, a n g, R represents the scalar reward, and P represents the state transition probability of the environment Pðs, a, s ′ Þ. In reinforcement learning, the agent is the learner and decision-maker of the whole system, and the state is the description of environmental information; the action is the agent's response to the environment, and the reward is the evaluation of action by the environment. The agent observes the environment and selects the appropriate action according to the obtained state information. The environment receives the action, makes corresponding feedback, and enters a new state. The agent obtains the reward from the environment and adjusts the next action.

2
International Journal of Aerospace Engineering The agent and environment interact at each time step. This mapping from state to action is called the policy, which is expressed as: The goal of reinforcement learning is to find the optimal action policy. The more positive feedback an agent receives in the learning process, the better the policy it learns. Therefore, the weighted cumulative sum of the reward value of each step overtime is defined as the return, which is expressed as: where γ represents the discount factor. By maximizing the long-term return G t , the corresponding best action policy can be obtained. To describe the longterm value when executing the policy at the state s, the expectation of return at this time is defined as the statevalue function: To measure the value of executing action a at the state s, the action-value function can be defined as: According to the Bellman equation, the value function can be decomposed as follows: Similarly, the Bellman equation form of the action-value function can be obtained: According to Bellman's principle of optimality, if the value function is the max, the corresponding policy is the optimal policy. Therefore, the Bellman equations of the optimal state-value and action-value functions can be expressed as: According to whether the environment model (state transition probability) is known or not, reinforcement learning can be divided into model-based and model-free methods. Generally speaking, because the model-free method does not make full use of the empirical knowledge obtained in learning, the convergence speed is slower than in the model-based method. However, the model-free method is one of the most important learning techniques   International Journal of Aerospace Engineering in reinforcement learning because of its small amount of calculation per iteration and good adaptability to dynamic unknown environments.

Policy Gradient Method.
Reinforcement learning algorithms can be divided into value function and policy gradient-based according to the optimization objectives. The algorithm based on the value function finds the optimal policy by maximizing the state-value function or actionvalue function, such as Q-learning [34] and sarsa [35]. The algorithm based on policy gradient parameterizes the policy using a nonlinear function and maximizes the cumulative reward by directly iterating the policy, such as policy gradient (PG) [36] and REINFORCE [37].
In 2015, Mnih et al. [38] first proposed the deep Q network (DQN), which achieved end-to-end learning by introducing an experience replay mechanism and constructing an independent target network. DQN was directly learned from high-dimensional perceptual input to a successful policy, and the algorithm was applied to Atari games with great success. However, there were still some unavoidable problems, such as overestimation of Q value, low sample utilization, and poor learning stability. In 2016, Hasselt et al. designed a double Q network structure [39], which was responsible for selecting and evaluating actions through two Q networks; double DQN effectively avoided the overestimation phenomenon caused by the greedy strategy in the DQN algorithm and had better performance. Schaul et al. proposed a DQN algorithm based on the priority experience replay mechanism, which used priority sampling instead of uniform sampling, and improved the convergence speed by increasing the frequency of resampling in the important transition process [40]. Wang et al. proposed a dueling DQN algorithm [41], which separately handled the evaluation of states and actions on two branches of a network, and finally combined them on the output layer for Q-value estimation, which could obtain a better evaluation policy than the traditional DQN. Aiming at the problem of partially observable scenes, Hausknecht and Stone proposed a deep recurrent Q network [42] algorithm, which used longshort-term memory (LSTM) in the DQN structure and could be applied in partially observable scenes. DeepMind proposed the rainbow algorithm [43] in 2017, which integrated six DQN-based methods, including double DQN and dueling DQN, and could achieve better results than any one of them. The algorithms mentioned above are optimized from different perspectives, and the requirements for discrete action spaces are not changed.
In the reinforcement learning task of continuous action space, to obtain the value function, the continuous action space needs to be discretized, which will cause an action dimension disaster. Moreover, the value function iteration method usually uses a greedy strategy to update the value function, which will make the agent learn a fixed policy. To solve the above problems, a policy-based method was proposed, which estimated the gradient of the objective function relative to the policy parameters, then used the gradient ascent algorithm to optimize the parameters and finally obtained the optimal policy. The approximate expression of the policy function can be written as: where θ represents the parameter of the policy, to solve this parameter, the expectation of the agent about the reward J ðθÞ is introduced as the objective function, and the following expression is used to update: where α represents the learning rate and ∇Ĵðθ t Þ is the gradient value of the objective function.
According the policy gradient theory, ∇Ĵðθ t Þ is rewritten as:  where Q π θ ,k ðs, aÞ is the action-value function, and the expression is: Q π θ ,k ðs, aÞ can be estimated without bias by the Monte Carlo method. Although this will reduce the deviation from the target value, it will also make a large variance and affect the convergence speed of the algorithm.
To solve the above problems, an actor-critic method was proposed. The actor is responsible for updating the policy gradient and executing the actions calculated by the policy.

International Journal of Aerospace Engineering
The critic is responsible for scoring the actor through the evaluation mechanism and then feeding the score back to the actor to guide it to update the policy gradient.
Trust region policy optimization (TRPO) [44], proposed by Schulman et al., is a kind of actor-critic method. According to this method, the gradient of the reward objective func-tion can be transformed into the following expression:   International Journal of Aerospace Engineering The above expression can be regarded as a generalized actor-critic framework, where Ψ is the evaluator.
To further improve the stability of the learning process and reduce the variance in the policy gradient estimation, a baseline function without changing the deviation is considered. Generally, the state-value function V π θ ðsÞ is selected as the baseline function. By using the baseline function, an advantage function A π θ ðs, aÞ can be obtained, and the expression is:   7 International Journal of Aerospace Engineering Next, the advantage function is estimated by using temporal-difference error (td-error); the expression is: It can be proved that td-error is an unbiased estimator of the advantage function by the following expression: Therefore, by estimating the advantage function by tderror, the policy gradient can be obtained as: In the above expression, the advantage function is brought into the reward objective function, and the effect caused by the change of the state-value function is removed from the action-value function, thereby the variance is reduced. In this method, a neural network can be set up to approximate the policy and evaluation functions.    The TRPO algorithm adopts a monotonic maximum step size method to update the policy, while using KL divergence to express the special constraint that the new policy is better than the old policy. The algorithm does not aim to update the step size, but uses an alternative loss function, which finally transforms the reinforcement learning policy update problem into the following optimization problem: The TRPO algorithm uses Taylor expansion to expand the constraints and uses the conjugate gradient method to optimize the network parameters, which can ensure monotonic improvement of the policy model during optimization.
However, the theory of the algorithm is complex and not easy to implement and debug by coding. To solve this problem, Schulman et al. made a first-order approximation of the TRPO algorithm and proposed the PPO algorithm. The expected approximation is completed by using the Monte Carlo method, so the objective function becomes: where r t ðθÞ represents the ratio of old and new policies π t ð ajsÞ/π old,t ðajsÞ in the expression, and the objective function is transformed into: The PPO algorithm rewrites the objective function in the TRPO algorithm as:

10
International Journal of Aerospace Engineering where A t = Q π old ,t ðs, aÞ − V π old ,t ðsÞ. The clip function limits the range of r t ðθÞ to ½1 − ϵ, 1 + ϵ, which ensures that each update will not fluctuate too much, and ϵ is a hyperparameter. The PPO algorithm adds the objective of the value function to the optimization objective, and the expression is: where c is the value function coefficient and J V t ðθÞ is the mean squared error between the current value function estimation and the obtained reward to go, which is expressed as: In practice, the process of PPO is as follows (shown in Figure 1): (1) Rollouts phase. First, train n episodes in the environment through the current policy, and generate a batch of trajectories; each trajectory associated with a single episode, including the corresponding states, actions, and rewards (2) Update phase. The policy optimization algorithm updates the policy using a batch of trajectories (rollouts). Then, the network's parameter θ is updated by the following expression: (3) The training is stopped when a user-defined iteration number is reached

Dynamics Model.
In the ascent process of the launch vehicle, the flight time in the atmosphere is short, and deviation in the atmosphere can be corrected by the guidance of the upper stage. Thus, the guidance of the upper stage determines the orbit insert accuracy of the launch vehicle. Therefore, this manuscript focuses on the guidance of the upper stage of the launch vehicle. The dimensionless equations of motion of a three-dimensional (3D) launch vehicle can be expressed in launch-inertial coordinate as follows: 11 International Journal of Aerospace Engineering in which r = ½x, y, z is the position of the launch vehicle in the launch-inertial coordinate, which is normalized by the radius of Earth R 0 = 6378145 m, and v = ½v x , v y , v z is the velocity of the launch vehicle in the launch-inertial coordinate, which is normalized by ffiffiffiffiffiffiffiffiffi ffi R 0 g 0 p , in which g 0 = 9:81 m /s 2 . The position from Earth's center to the vehicle r e = ½x + R ex , y + R ey , z + R ez is normalized by R 0 , and R e = ½R ex , R ey , R ez is the dimensionless position from Earth's center to the launch-inertial coordinate's origin which is the launch point. m is the mass of the launch vehicle. T is the thrust magnitude; as in most launch vehicles, the mass flow is uncontrollable; therefore, the thrust magnitude T is uncontrollable during the same flight phase [15,47]. I sp is the specific impulse of the engine. φ and ψ are the pitch and yaw angle, respectively, measured in the thrust vector in the launch-inertial coordinate. The differentiation of equations in Equation (24) is with respect to dimensionless time normalized by ffiffiffiffiffiffiffiffiffiffiffi R 0 /g 0 p . To apply reinforcement learning to launch vehicle guidance problems and satisfy the constraints of the flight process, this manuscript uses the optimal control expression, which can represent initial, terminal, and process constraints. The guidance problem of the upper stage of the launch vehicle can be written as follows: Problem: Equation (24), t f free in which r = ½x y z and v = v x v y v z Â Ã . r 0 and r f represent the initial and terminal position, respectively. v 0 and v f represent the initial and terminal velocity, respectively.
t 0 and t f are the start and final time, respectively. And m dry is the dry mass of the launch vehicle. Equation (25) is the cost function. Equation (26) and Equation (27) represent the initial and terminal constraints. The minimum and maximum values of pitch and yaw angle rates are presented in Equation (28). In this manuscript, the angle constraint is regarded as a hard constraint. Once the constraint is violated, the current episode is stopped, and a large negative reward is returned, then a new episode is started. Equation (29) represents the fuel constraint. When the fuel runs out, the current episode is stopped.

Implementation Details.
This section describes the techniques we use in using reinforcement learning. For the network of policy and value, we design a neural network with tanh activations on each hidden layer. In the policy network, the input layer has n , and h v 3 = 5, respectively, where h v 1 , h v 2 , and h v 3 represent the size of each hidden layer. This structure has been studied in aerospace trajectory optimization, such as Mars landing [26] and Earth-Mars transfer orbit [19]. To generate the corresponding action, the policy uses Gaussian distribution with mean π θ ða k js k Þ and a diagonal covariance matrix for action distribution. Moreover, the Adam optimizer is used to adjust the learning rate of policy and value networks. A method similar to the PPO2 algorithm [45] in OpenAI baselines is used to approximate KL divergence. The expression is as follows:  According to the suggestions of [26], we adjust the parameters according to the KL divergence between policy updates, represented by kl. In addition, ϵ max and ϵ min are designed to be 0.5 and 0.01, respectively. We also adjust the parameters according to Equation (31), in which ζ max and ζ min are designed as 10 and 0.1, respectively.
To apply the reinforcement learning method to launch vehicle ascending guidance, in combination with the dynamics model, the observation, action, and reward are designed. In the research of reinforcement learning for aerospace guidance, there is no unified choice for observation. In [48], the authors designed s = fr − r ref , v − v ref g for learning, in which the subscript ref represents the reference trajectory. In [26], the authors used a similar idea: they designed a velocity field v targ that mapped the lander's position to a target velocity for learning, which achieved good results. Unfortunately, the construction of v targ is not general, and it cannot be applied to all problems. However, this method provides an inspiration: if a good reference state can be designed, good learning efficiency and final results can be obtained. In [19], the authors did not use the reference trajectory, the state of the aircraft was regarded as observation, and good results were obtained. Combined with the motion equations introduced in Section 3.1, the expression of observation designed in this manuscript is as follows: The guidance commands of the launch vehicle are generally the pitch angle and yaw angle, but the angular rate is limited. If these angles are used as the action of the neural network, the angular rate is not easy to control. To satisfy the angular rate constraint, we use the angular rate as the action and the attitude angle as a part of the observation.
In addition, it should be noted that stop conditions need to be designed in reinforcement learning. In the research of reinforcement learning guidance algorithms [24,30], terminal velocity or position constraints are usually used as stop conditions for each episode. However, in low earth orbit (LEO) missions studied in this manuscript, the semimajor axis is one of the indicators of engine shutdown. If the semimajor axis of the orbit at the current time exceeds the semimajor axis of the target orbit, the guidance system sends an engine shutdown command. In this manuscript, the current episode is terminated if the semimajor axis of the orbit at the current time exceeds the semimajor axis of the target orbit.

Reward Function.
In [49], the authors presented the hypothesis that the maximization of total reward may be enough to understand intelligence and its associated abilities. A suitable reward can make the agent learn knowledge faster and better, but how to design a suitable reward function is one of the difficulties of reinforcement learning, especially in the aerospace guidance field. In the launch vehicle guidance problem, the thrust magnitude cannot be controlled, and the thrust direction can only be controlled by the attitude of the vehicle, which makes the problem difficult to solve. Thus, although many scholars now use mathematical optimization algorithms as the basis for computational guidance, there are few engineering applications because the problem is not easy to solve, and the calculation time is too long to be applied online.
A common practice is to give a reward after running an episode. However, reinforcement learning randomly selects the control during the training. If the reward is only based on the final result, it is very likely that the terminal condition will never be satisfied. This is called the sparse reward problem. This problem is generally solved using inverse reinforcement learning, where the reward function for each step is learned through expert representations. In this problem, solutions obtained by mathematical optimization algorithms such as convex optimization can be used in expert representations, but the calculation time of the mathematical optimization algorithm is uncertain, and therefore, it cannot be well applied to inverse reinforcement learning.
In the ascending flight of the launch vehicle, the velocity and position increase gradually. At each time step, we can reward the agent if the agent drives it toward the target point. This method called shaping reward was proposed by Ng [50]. Gaudet et al. used this method in the Mars landing guidance [26], but the shaping reward constructed by Gaudet et al. cannot be well applied in other fields. Therefore, a simple but effective shaping reward is proposed in this manuscript. The reward function expression is as follows: where r shape is a negative reward, which represents the distance between the current position rðt k Þ and the terminal position r f , and λ track is a shaping reward coefficient. The way to minimize the shaping reward is to move the vehicle toward the target point directly. Moreover, because the shaping reward is related to the number of steps, the fewer steps, the fewer negative rewards. For a launch vehicle with constant mass flow, the minimum number of steps means the optimal energy. Therefore, the shaping reward designed in this manuscript can not only guide the vehicle to the target but also minimize the number of steps, to achieve the optimal energy. When an episode is stopped, the final reward will be given. We refer to the reward function in [19], and the expression is as follows: where r done is a negative reward and called the final reward, λ d is the final reward coefficient, and ε is tolerance on terminal violation. The expression of ε is as follows [46]: where k is the current episode number, ε 0 = 0:0005, ε f = 1e − 6, k i = 0, and k f = 450000.

International Journal of Aerospace Engineering
The expression of e r,v is as follows: where e r and e v are represent the final position error and the final velocity error, respectively. The expression of e r and e v are as follows: In addition, considering the process constraints on attitude during flight, a penalty function is designed. When the process constraints are violated, the current episode stops immediately and returns the penalty. The penalty function r penalty is given by the following: To sum up, the design of the total reward function for the guidance problem of the launch vehicle is given by the following: where η is a constant positive reward. In the numerical experiment, we found that without this positive reward, the agent will immediately violate the constraint at the beginning of the training. This positive reward is the key to encouraging the agent to continue to move forward. Figure 2 shows how reinforcement learning can be applied to the guidance of the launch vehicle. It can be seen that the DNN obtained by reinforcement learning is called RL-guidance, which outputs the guidance commands, that is, the actions in reinforcement learning. The vehicle flies Δ t time according to the guidance command, and then, the state of t k + Δt is obtained by the navigation system, and reward is obtained by the reinforcement learning model feedback to the RL-guidance system.

Experimental Results and Discussion
In this section, we apply the proposed algorithm to the ascent problem of the launch vehicle to verify its validity. All numerical simulations are implemented on a computer with a 4-core Intel Core E3-1230 V5 CPU @3.4 GHz, and the RL-guidance and the guidance-based optimal control are implemented in Python and Matlab environments, respectively.
The launch vehicle thrust is 2843425 N, and the specific impulse is 3365 m/s. The initial and dry mass are 350306 kg and 83090 kg, respectively. Maximum pitch and yaw angle rate is 5°. Table 1 shows the initial and terminal parameters of the numerical experiment. The fourth-order Runge-Kutta integration is used by integrating with a 0.5 s step, and the guidance step is 1 s.

Policy
Optimization. This section presents the training process of reinforcement learning. Table 2 lists the reward coefficients and the hyperparameters. Rollouts are generated by the interaction between the agent and the environment for 50 episodes, the advantages, the value, and policy function approximators are computed and updated by the resulting trajectories. The total episode is 500000, which took nearly 30 hours. Figures 3 and 4 show the final position and velocity error curves, respectively, it can be seen that with increased training episodes, and the final error gradually decreases and converges after 400 thousand episodes. As can be seen from Figure 5, the reward gradually increases as the training progresses; the value of reward increases rapidly in the early stage of training and gradually converges after 400 thousand episodes.

Policy
Test. At present, in the research of aerospace computational guidance, online trajectory planning is mostly performed by optimal control solvers such as GPOPS or CVX, which replace the traditional offline planning and online tracking mode. It should be noted that if the distance between the current and final point is less than 10,000 m, the integration step size is reduced from 0.5 s to 0.02 s. This method is also usually used in practice, that is, when vehicle approaches the target, the integration step size is reduced to improve the final accuracy.
4.2.1. Experiment 1. Figures 6 and 7 show comparisons of position and velocity, and Figure 8 shows the comparison of flight height. It can be seen that the results obtained by the two methods are basically the same. The final results of the two methods are listed in Table 3. As can be seen in the table, the accuracy of the proposed algorithm is consistent with guidance-based optimal control, which fully proves the effectiveness of the proposed algorithm. In addition, as mentioned before, although the training time is very long, once the training is completed, it only needs to perform some matrix multiplication operations when in use. In this experiment, the average and standard deviation time of a generated guidance command are 0.00055 s and 0.00008 s, respectively, and the median and maximum time are 0.00052 s and 0.0017 s, respectively. As a comparison, the current guidance period in engineering applications is about 0.002 s, and it can be seen that the computational efficiency of RL-guidance allows it to be fully applied online. In contrast, the guidance-based optimal control takes 20 s. Considering the difference in the application environment of the two methods, the calculation speed is still much slower than the proposed algorithm, and it is difficult to be applied online.
14 International Journal of Aerospace Engineering Figures 9 and 10 show the attitude and control curves, respectively, of the vehicle. It can be seen that the solutions of the two algorithms are very close. As mentioned before, because the thrust magnitude of the launch vehicle is not adjustable, the thrust direction can only be adjusted through limited attitude changes, which leads to a small solution space, so the solutions of the two methods are very close. It can be seen in [26] that there is an obvious difference between the solution of GPOPS and reinforcement learning; because the thrust magnitude of landing vehicle is adjustable and the solution space is large, reinforcement learning may learn other solutions that satisfy the terminal conditions. For problems with a small solution space, on the one hand, once the DNN is trained, the solution obtained by the DNN will be very close to the optimal solution, like the results obtained in this experiment. On the other hand, it will be difficult to find a suitable solution during training, resulting in a failure to train a suitable network.

Experiment 2.
In the mission of launch vehicles, the decline of thrust is one of the fatal faults. If the thrust loss is small, the trajectory can be reconstructed to guide the vehicle to the target orbit. However, if the thrust loss is too large, the trajectory planning problem becomes an infeasible problem, and the optimal control algorithm cannot directly give the feasible solution, which means the guidance-based optimal control cannot give new guidance commands. Many scholars have studied that [15,51,52], in that situation, the primary goal of the mission changes from accurately entering the target orbit to moving in the orbit waiting for rescue. And the basic idea is to change the terminal constraints to make the new problems feasible, the new terminal constraint represents the new orbit, which is called the rescue orbit.
The following experiment is that the thrust is reduced by 10%, which is very likely to happen when the upper-stage engine was started.
In the case mentioned above, the remaining energy of the launch vehicle may not be able to send the payload such as a satellite into the target orbit. The guidance-based optimal control transforms the guidance problem into a nonlinear programming problem. If the launch vehicle cannot reach the target orbit because the thurst drops, it means that the original problem is infeasible, and there is no solution. Therefore, the guidance-based optimal control will not work during the flight. In this case, we assume that the guidance algorithm will switch to the method of tracking the reference trajectory, and the reference trajectory is preplanned under nominal conditions. There are two tracking methods, the first method is that the vehicle flies along the reference trajectory and shut down at the reference final time; this method is the traditional guidance. But we know that there will be some surplus fuel in the launch vehicle. Therefore, the second method will expand the flight time until the fuel is completely exhausted or some other indicators meet the requirements. However, there is a problem that when the final time of the reference trajectory is exceeded, there is no new guidance command. In this case, the last group of guidance commands in the reference trajectory can only be regarded as subsequent guidance commands.
In the first tracking method, Figure 11 shows the flight curves of the tracking reference trajectory after the failure, which is compared with RL-guidance. It can be seen from Figure 11 that the vehicle loses a lot of velocity due to the decline of thrust. And to satisfy the semimajor axis requirement, the RL-guidance expands the flight time.
In the second tracking method, Figure 12 shows the velocity curves; the green line indicates that the vehicle flies along the reference trajectory with expanding flight time until the semimajor exceeds the target semimajor. The final velocity of the launch vehicle is basically consistent with the reference terminal velocity. However, we find that this method still cannot put the launch vehicle into orbit through the following analysis.
From Figure 13 and Table 4, it can be seen that the new orbit obtained by RL-guidance is very close to the target orbit and far better than the result of traditional guidance. It should be noted that if the altitude of perigee of an orbit is less than 160 km, it is considered that this orbit is inappropriate, and the payload on this orbit will gradually fall into the atmosphere [52], the purple dash-dotted line indicates the safe orbit mentioned in [52], and the safe orbit is a circular orbit with 160 km orbit altitude that is abovementioned. As can be seen from Figure 13, the red dashed line indicates that the traditional guidance cannot guide the vehicle in an orbit. In addition, even if the flight time is increased, the reached orbit indicated by the yellow dotted line still cannot meet the requirements because the altitude of part of the orbit is less than 160 km. The green solid line indicates the orbit reached by RL-guidance when the thrust drops. It can be seen that this orbit can be used as the rescue orbit. Therefore, neither the first tracking method nor the second tracking method can put the payload into orbit. As a result, although expanding the flight time can increase the velocity, inappropriate guidance commands cannot make the vehicle enter the appropriate orbit. However, RL-guidance can generate the new guidance commands according the current state and guide the launch vehicle to a suitable orbit after the thrust drops.
For computational guidance-based optimal control, it needs an extra strategy to find a new orbit [52], the strategy takes into account various factors, such as the appropriate orbit inclination or longitude of ascending node, so the orbit obtained by this strategy is called the optimal rescue orbit. However, the optimal rescue orbit requires many iterations and takes a lot of time to find.
It still needs to discuss whether it is worth taking so many iterations to obtain the optimal rescue orbit in case of thrust failure and which rescue orbit is more important, optimal, or feasible. The proposed RL-guidance algorithm can quickly get a feasible rescue orbit, which may not be optimal, but feasible. The proposed RL-guidance algorithm continuously generates guidance commands according to the mapping of states and controls that the DNN trained. If the thrust drops, the proposed RL-guidance algorithm can generate new guidance commands according to the current state and guide the vehicle to a feasible rescue orbit. The proposed RL-guidance algorithm is autonomous, can be used as an alternative method, and is worthy of further research in case of thrust failure of launch vehicle mission.
According to the two experimental results given in this section, the results of the proposed RL-guidance are consistent with guidance-based optimal control. In addition, the proposed RL-guidance has higher computational efficiency and can be applied online. In terms of the thrust decline, the guidance-based optimal control transforms the guidance problem into an optimization problem; if the thrust drops, the original problem becomes infeasible because the target orbit cannot be reached, and the optimization algorithm cannot give a solution, which means that the guidance will not work during the flight. In this case, if the guidance system is switched to track the reference trajectory, the results show that it cannot make the vehicle in a suitable orbit. But the proposed RL-guidance can generate the new guidance commands according to the current state and guide the vehicle to a feasible orbit, which makes rescue possible.

Conclusions
This manuscript proposes a guidance-based reinforcement learning method and intends to demonstrate that reinforcement learning is a viable approach to developing a guidance algorithm for launch vehicles. In the research of computational guidance, most methods are based on optimal control algorithms, and the proposed guidance method is based on DNN. First, the reward function was designed to cover all constraints. After that, the mapping from state to control is trained by the state-of-the-art proximal policy optimization algorithm.
Two numerical experiments are designed to test the proposed algorithm. In the first numerical experiment, the results of the proposed algorithm are consistent with guidance-based optimal control. It shows that the proposed algorithm is effective and fast and has the potential for online application. The second numerical experiment aims to demonstrate the ability of the proposed algorithm under thrust drops. The current guidance algorithm research is based on the optimal control algorithm. If the original problem becomes infeasible because thrust drops, the guidance cannot generate commands; therefore, it needs an extra strategy to find a new orbit to make the programming problem feasible, and then, the guidance-based optimal control can output commands, the orbit obtained through the strategy is called an optimal rescue orbit, and it takes a lot of computational time. Not aiming to get the optimal rescue orbit, the proposed algorithm can guide launch vehicles to a feasible orbit and wait for rescue without any extra strategy. Moreover, the numerical experimental results indicate that the traditional guidance that uses offline planning and online tracking mode cannot deal with this kind of emergency. Therefore, the proposed algorithm can be used as an alternative guidance algorithm, especially in the case of thrust decline fault. In future research, guiding the launch vehicle to different rescue orbits under different faults will be considered, as well as adding various disturbances to the training. Since the mission is more complex, more training epochs may be required, and therefore, parallel computing techniques will be considered.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.