Deep Reinforcement Learning for UAV Intelligent Mission Planning

. Rapid and precise air operation mission planning is a key technology in unmanned aerial vehicles (UAVs) autonomous combat in battles. In this paper, an end-to-end UAV intelligent mission planning method based on deep reinforcement learning (DRL) is proposed to solve the shortcomings of the traditional intelligent optimization algorithm, such as relying on simple, static, low-dimensional scenarios, and poor scalability. Speciﬁcally, the suppression of enemy air defense (SEAD) mission planning is described as a sequential decision-making problem and formalized as a Markov decision process (MDP). Then, the SEAD intelligent planning model based on the proximal policy optimization (PPO) algorithm is established and a general intelligent planning architecture is proposed. Furthermore, three policy training tricks, i.e., domain randomization, maximizing policy entropy, and underlying network parameter sharing, are introduced to improve the learning performance and generalizability of PPO. Experiments results show that the model in this work is eﬃcient and stable, and can be adapted to the unknown continuous high-dimensional environment. It can be concluded that the UAV intelligent mission planning model based on DRL has powerful intelligent planning performance, and provides a new idea for researching UAV autonomy.


Introduction
Mission planning is the process of making an operational plan, including a route plan, weapon plan, and avionics plan [1,2].Intelligent planning capability is an important symbol of unmanned aerial vehicle (UAV) autonomy.With the development of UAV technology, UAVs can fly independently and complete simple missions, such as reconnaissance and strike, which greatly improves efficiency and reduces the labor cost.However, for complex cooperative missions, UAV intelligent planning is still a key research issue.
Mission planning is a decision optimization problem that solves the optimal solution of mission objective function under certain constraints, such as the shortest route [3], minimum threat [4], and maximum efficiency [5,6].Because the mission planning problem considers many mutually coupled factors, large decision space, and nonlinear constraints, traditional mission planning is mostly solved by intelligent optimization algorithms.Xin et al. [3] modeled the route planning problem by solving the shortest-path problem from the starting point to the target point, established an optimization model based on an ant colony optimization (ACO) algorithm, and searched for the shortest route.Zhang et al. [4] studied the tactical maneuver planning problem.Taking the minimum threat of a fighter and the maximum damage effectiveness of ground targets as the optimization objectives, and considering the capability constraints of weapons and equipment, the optimal flight route and weapon delivery time were solved using the multiobjective evolutionary algorithm based on decomposition (MOEA/D) [5].Aiming at the task allocation problem of UAV identification, attack, and evaluation targets, a genetic algorithm (GA) was used to solve the optimal allocation result [6].Zhang et al. [7] studied electronic warfare mission planning.By taking the route safety width and electronic jamming effect of a jammer as the objective function, the multiobjective particle swarm optimization (PSO) algorithm was used to solve it, and the optimal jamming array model was obtained.A search-based intelligent optimization algorithm can find the global optimal solution or suboptimal solution of the complex objective function through the parallel optimization mechanism, but its essence is still random search.Each solution can only be researched in the static and known environment (explicit objective function and constraints) and cannot be generalized to the dynamic and unknown environment.e computational complexity increases exponentially with the growth in problem scale.erefore, its application has certain limitations for rapid planning and dynamic scenarios of future large-scale operations.
In recent years, due to the expansion of computing power, the emergence of big data [8], and the development of artificial intelligence (AI) algorithms, learning-based methods such as neural networks [9,10] and reinforcement learning (RL) [11] have promoted the second wave of AI.From AlphaGo [12] to AlphaGo Zero [13], AlphaZero [14], and AlphaStar [15], DRL has achieved a series of breakthroughs in a range of challenging domains.Among such methods, deep learning (DL) has been used to solve highdimensional mapping problems, RL has been used to solve sequential decision-making problems, and DRL has been successfully applied to a series of robotics [16,17], autonomous driving [18], real-time strategy (RTS) games [19], and optimization and scheduling [20] problems.A learningbased method, also known as the data-driven method, refers to feeding data to improve the prediction or decisionmaking performance of a model.Such method uses a neural network to learn or fit the complex and high-dimensional nonlinear relationship between input and output, so as to achieve the minimal mean square error or optimal prediction and decision results, and saves the mapping network parameters to realize offline training and online inferencing.It also has certain robustness and generalization for new input data, which is suitable for fast and dynamic mission planning.e comparison of the two methods is shown in Table 1.
erefore, by taking a high-risk typical suppression of enemy air defense (SEAD) mission planning [21,22] as an example, we propose an end-to-end UAV intelligent mission planning method based on DRL.First, the mission planning problem of SEAD operation is formalized, then the basic principles of a DRL algorithm are introduced, and a DRLbased intelligent planning model is established.Finally, the superiority and potential value of this method are analyzed and verified by simulation experiments.

SEAD Mission Planning Problem Formulation
A SEAD mission is an operational style of offensive counterair (OCA) combat carried out by an air force.Its mission goal is to break through the enemy's surface-to-air missile (SAM) threat and strike the enemy's radar or target through a cooperative combat between attacking UAV, named fighter, and jamming UAV, named jammer.A schematic of a SEAD mission is presented in Figure 1.
In Figure 1, the mission is to use a fighter to safely destroy an enemy SAM.However, since the detection range of the enemy SAM is longer than the attack range of the fighter, the fighter faces the threat of being detected and attacked by the SAM.erefore, a jammer is required to jam the SAM to reduce its detection range.Only then the fighter can take the opportunity to attack and destroy the SAM.erefore, the jammer must jam at the right position and time, and the fighter must attack at the right position and time simultaneously.
e two cooperate to complete the mission.
To summarize, mission planning is essentially a sequential decision-making problem.Under different spacetime sequence states, a combat unit adopts the optimal decision-making sequence to transfer from the initial situation to the termination situation to achieve the mission objectives.
erefore, SEAD mission planning can be modeled as an end-to-end sequence optimization problem from the state (position and situation) to the decision (maneuver, attack, and jamming).e optimization goal is to solve an optimal state decision sequence to meet the needs of fighters and jammer to destroy an enemy SAM and ensure their own safety through tactical cooperation, as shown in Figure 2.

Deep Reinforcement Learning
3.1.Principles of Reinforcement Learning.RL is a machine learning approach for teaching agents how to solve tasks by trial and error.e main characters of RL are the agent [23] and the environment.e agent perceives an initial state of the environment and then decides on an action to take.e environment changes when the agent acts on it and gives the agent an instant reward signal, while the agent transfers to the next state and continues to choose new actions until reaching a termination state.
e goal of the agent is to maximize the sum of rewards in the entire decision-making process, i.e., to find the optimal policy.erefore, RL is a decision optimization method.e reinforcement learning framework is shown in Figure 3.
RL is usually modeled as a finite Markov decision process, which is represented by five tuples 〈S, A, R, P, c〉, in which S is a finite state set, A is a finite action set, R is a reward function, P is a state transition function, and c is a discount factor, which is used to calculate the long-term discount rewards.Assuming that the state s ∈ S of the agent at time t is to take action a ∈ A according to policy π: S ⟶ A, the environment will feed back an instant reward r ∈ R to the agent, and the agent will transfer to a new state s′ ∈ S. Tabular reinforcement learning evaluates the state-action value through a discrete Q table, but for continuous and high-dimensional problems, it encounters the "curse of dimensionality" problem, which spurred the development of DRL [23][24][25][26].

Deep Reinforcement
Learning.DRL refers to the combination of RL with DL.DRL uses a neural network to approximate policy and value functions to solve the high-2 Complexity dimensional mapping problem.e goal of the agent is to find the parameterized policy π θ with the maximum expected return which refers to the discounted cumulative rewards R(τ) �  ∞ t�0 c t r t on the trajectory τ � (s 0 , a 0 , s 1 , a 1 , . ..). θ is a policy parameter.e optimal policy is as follows: A DRL algorithm is divided into three learning paradigms: value-based, policy gradient, and actor-critic.e actor-critic reinforcement learning algorithm integrates the value function and policy gradient, and uses the value function error to guide the policy gradient update to accelerate the learning speed.e policy is updated through the gradient of expected return, which can be written as follows: where π θ (a|s) is an actor and R(τ) a critic, and can also take other forms, such as a state-action value function When the critic takes the TD residual and the value function is approximated by the neural network with parameter ω, the derivative of equation ( 2) is obtained, the critic is updated according to equations ( 3) and ( 4), and the actor is updated according to equation ( 5).Complexity 3

Proximal Policy Optimization Algorithm.
Because proximal policy optimization (PPO) [27] algorithm is a simple, stable, and easy-to-implement actor-critic algorithm.Both Dota2 AI (OpenAI Five) [28] and Honor of Kings AI Juewu (Tencent) [29] are implemented by PPO. e PPO algorithm addresses the computationally expensive problem in the trust region policy optimization (TRPO) [30] algorithm, which needs enormous calculations to ensure monotonous policy performance improvement.
rough the first-order approximation, the surrogate loss function is optimized.
e new policy is calculated in each iteration, and it is close to the old policy.
e policy is optimized in the direction of minimizing the loss function (maximizing the expected return).e PPO algorithm achieves a balance between sampling efficiency, algorithm performance, and engineering implementation complexity.
In a PPO algorithm, a critic uses the advantage function to measure the quality of the action, and (2) becomes the following: Because PPO is an on-policy method, importance sampling is introduced to improve sample utilization, and the old π θ′ is used the sample to obtain the following: Because π θ ∇ θ log π θ � ∇π θ , Eq. ( 7) becomes the following: e optimization objective function corresponding to the gradient is as follows: In practical application, based on the sampling estimation expectation, the optimization objective of PPO, i.e., the surrogate loss function, is simplified as shown in equation (10).e policy update amplitude is limited by the truncation operation, i.e., CLIP, to ensure the training stability.
where r(θ) � π θ (a|s)/π θ′ (a|s) is the ratio of new and old policies, and ε is a hyperparameter.e generalized advantage estimation [31] is used to calculate the advantage and keep the variance and deviation estimated by the value function small, as shown in the following:.
e PPO algorithm is executed as Table 2:

UAV Kinematics Equation.
In this paper, we aim to study the feasibility and future value of an end-to-end DRL method in intelligent mission planning.erefore, we construct a simple two-dimensional (2-D) environment, where the fighter and jammer adopt a three-degree-offreedom (3-DOF) model, and its kinematic equation is as follows: where _ x f , _ y f , v f , and φ f represent the differentiation of coordinate X and Y, speed, and heading of the fighter, respectively, and _ x j , _ y j , and v j , and φ j represent the differentiation of coordinate X and Y, speed, and heading of the jammer, respectively.e heading is any continuous value in (−π, π) and the speed is a continuous value in (0, 1].

State Space.
e state of the fighter, jammer, and SAM is defined as the coordinate in 2-D space, which are continuous values and represented by (x f , y f ), (x j , y j ) , and (x s , y s ), as shown in Table 3.

Action Space.
e actions of the fighter and jammer are their respective heading and speed, as shown in Table 4.
e UAV can change the position in 2-D space by controlling the heading, and reaching the attack position by controlling the speed.When the fighter fires the missile and 4 Complexity the jammer turns on the jamming, it is automatically completed by default by setting the distance conditions.In an actual mission, it is necessary to calculate the fire time, position, and parameters in detail.

Reward Function.
e design of the reward function follows the principle of Occam's razor, i.e., it should be simple and effective.If the jammer suppresses the SAM without entering its missile range, and the fighter destroys the enemy's SAM radar without entering its missile range, it will get a reward +1.If the jammer or fighter enters the range of SAM or flies out of the environmental boundary, it will get a reward −1.For other circumstances, reward shaping is adopted according to experience knowledge, and a continuous reward of the relative distance is given to guide the agent to learn.e reward function formula is expressed as follows: where d fs and d js represent the distance between fighter and SAM and between jammer and SAM, respectively, d s represents the attack range of the SAM, d f represents the attack range of the fighter, and d j represents the jamming range of the jammer.

Environment Class Development.
e environment class mainly includes two parts: the step() and reset() functions.e step() function realizes the deterministic state transition according to the UAV kinematics equation, and returns the judgment results of the new state, reward, and termination.e reset() function is designed to reset the fighter and jammer to the initial position.

Domain Randomization.
To improve the robustness of the agent policy and adapt to diversified inputs, add disturbances to the inputs in the training stage [32], as shown in Eq. ( 14); that is, train the agent in different environments with parameter disturbances on each random seed, so that the agent can abstract higher-level policy features, avoid overfitting the one environment and policy, and that the final learned policy is the more robust and better generalization to new environments.

Maximization Policy Entropy.
Entropy is used to measure the randomness of random variables.e greater the entropy, the more random it is.erefore, while maximizing the cumulative rewards, the entropy of the policy is maximized and the policy is made as random as possible.e agent can fully explore the state space to avoid the policy falling into local minima, and can explore multiple feasible schemes to complete the mission, which improves the exploration performance, robustness, and generalization performance of the policy.e calculation formula is shown as follows:   17), and then put into the two-layer fully connected neural network.Finally, the heading, speed of the fighter, and jammer are output.e optimization goal is to maximize the cumulative rewards.e solution is the optimal or suboptimal policy, and the policy network architecture of the intelligent planning model is shown in Figure 4.
erefore, this paper adopts the idea of offline training and online inferencing.First, the agent is trained in the environment.After the training is completed, the inference module is carried out to test the planning performance of the agent.e research framework is shown in Figure 5.
A general intelligent planning architecture is proposed, including environment, planner (agent), and controller.First, the planner inputs the initial situation and interacts with the environment, that is, offline training.e trained planner can be directly applied and input the new initial situation for online inferencing.e inference decision sequence, i.e., planning results, is put into the controller for execution.
e architecture realizes the entire process of intelligent planning, as shown in Figure 6.

Experimental Setup and Training Tricks.
In this paper, the SEAD environment is a square area of 100 × 100 km 2 .
e fighter position is (40, 40), the attack range is 30 km, the jammer position is (30,30), the attack range is 50 km, the SAM position is (80, 80), and the attack range is 40 km.After jamming, the SAM attack range is reduced to 25 km, as shown in Figure 7.In the experiment, the above range is reduced 100 times for normalization, which is easy for neural network training and can prevent gradient vanishing.
e simulation environment uses Python 3.6 and PyCharm.e intelligent planning experiments of three different scenarios are completed separately.Compared with other classical DRL algorithms, robustness and ablation studies are carried out to verify the algorithm performance.
e hyperparameter settings in this work are shown in Table 5. e neural network adopts orthogonal initialization, the optimizer is Adam [33], and other training parameters are described in Section 5.1.1.

State Space.
e status of the fighter, jammer, and SAM is defined as the coordinate position in 2-D space, which are continuous values and represented by (x f , y f ), (x j , y j ) , and (x s , y s ).
e advantage function value is normalized to improve the training stability and policy learning skills.e formula is as follows: (2) Value function normalization.Similarly, the value function loss is also normalized.e policy training skills are studied and compared in the subsequent simulation.e formula is as follows:

Adaptive Adjustment Parameters
(1) Adaptive learning rate.In the early training stage, a larger learning rate is adopted to accelerate convergence, and in the later training stage, a smaller learning rate is adopted to find the optimal value.e formula is as follows: (2) Adaptive clip value.e clip adaptive change is consistent with the change of learning rate.A larger clip value is allowed to accelerate the policy update in the early training stage, and a smaller clip value is used in the later stage to ensure the policy update is stable.e formula is as follows:

Intelligent Planning for Fighter-Jammer Scenario.
In Experiment 1, we set up a fighter-jammer scenario, as depicted in Section 5.1.erefore, the fighter has to be able to safely destroy the SAM under jamming to complete the mission.erefore, in this work, the intelligent planning model is used to simulate the 300 episodes to train the model, and use the trained model for online inferencing to test the intelligent planning performance and cooperative attack performance.

Offline Training.
ree different random seeds are used to train the PPO policy and value function network.We record the average timesteps reward in the training process, and compare it with the classical advantage actor-critic (A2C) and TRPO algorithms to obtain the cumulative rewards learning curve, as shown in Figure 8.
As it can be seen from the above figure, the A2C model has a large variance and the TRPO model has poor convergence performance.However, the PPO model used in this work has higher episode rewards, more stable training, smaller episode rewards variance, and good robustness, so its performance is better.6 Complexity

Online Planning.
e trained model is used to input the initial situation information of the environment, test the online inferencing performance of the model, and obtain the cooperative attack process, as shown in Figure 9.
It can be seen from Figure 9(a) that the fighter and jammer cleverly complete a cooperative attack.Before the jammer successfully jams, the fighter does not enter the detection range of the SAM or start an attack for its own safety.As can be seen from Figure 9(b), when the jammer jams the SAM and degrades the detection range of the SAM, the fighter attack quickly and successfully destroys the SAM, which reflects strong intelligent cooperative planning performance.

Intelligent Planning for Fighter-Decoy Scenario.
To further test the intelligent planning performance of the model proposed in this paper, a cooperative attack for fighter-decoy scenario is designed in which the SAM attack range is 20 km and that of the fighter is also 20 km, which could not directly and safely destroy the SAM and target.e decoy is a lowcost expendable UAV, which could be sacrificed.erefore, the fighter has to attack with the decoy cooperatively, sacrificing the decoy first and utilizing the interval between the SAM attacks on the decoy to destroy the SAM and target to complete the mission.erefore, the reward function is modified as follows: It can be seen from the figure that the A2C and TRPO models have large variance, low training stability, and poor convergence, while the PPO model proposed in this paper obtains higher cumulative episode rewards, a smooth and stable learning curve, and excellent convergence performance.

Online Planning.
e trained PPO model is tested in the environment to verify the planning performance of the model.e results are as follows: In Figure 11(a), the decoy flies directly to the SAM to attract SAM radar tracking and attacking, while the fighter waits for an opportunity to fly around.In Figure 11(b), the fighter utilizes the time interval between the SAM tracking, locking on to the decoy, and completing the attack quickly, successfully destroying the SAM and the targets.e fighter sacrifices the decoy, but completes the mission, which shows that the PPO-based intelligent planning model has certain tactical cooperative planning performance.

Robustness Studies.
e robustness of the intelligent planning model is tested next, and certain randomness is added to the training environment.As shown in Figure 12, the starting positions of fighter, decoy, and SAM can change randomly in a certain surrounding area to test the generalization performance and robustness of the trained model in the unknown environment.
e test results are shown in Figure 13.In Figure 13, the initial position of the fighter is (20,20), that of the decoy (85, 95), and that of the SAM (55, 60).Upon being input into the trained model, it is found that the fighter will still intelligently wait for the decoy to enter the SAM range first, and then take the opportunity to attack quickly and successfully destroy the SAM and target, as shown in Figure 13.erefore, this shows that the model has certain robustness and generalization performance to unknown situations, can adapt to an uncertain environment, and has strong practical application value.14.
As it can be seen from Figure 14, the episode rewards using all training tricks are higher.e advantage function normalization will greatly improve the early convergence speed of the model, layer orthogonal initialization can improve the final performance of the model, and other tricks have relatively little impact on model performance.

Conclusions
In this paper, we propose an end-to-end DRL-based UAV intelligent mission planning method.First, a SEAD mission is selected as the research object, and mission planning is described as a sequential decision-making problem.en, the SEAD mission intelligent planning model based on the PPO algorithm is established, including the design of UAV state space, action space, and reward function.ree policy training tricks, namely domain randomization, maximizing policy entropy, and parameter sharing, are introduced, and the intelligent planning general architecture is constructed.Finally, two experiments analyses, robustness studies, and an ablation experiment are completed.We conclude that the intelligent planning model based on deep reinforcement learning, which adopts an end-to-end architecture and offline training and online inferencing, can adapt to the dynamic situation, and it is advanced and valuable.In future work, this end-to-end method will be extended to large-scale complex combat scenarios, and the problem of multi-agent cooperative planning will be studied in depth.
Fighter Position

Figure 8 :Figure 9 :Figure 10 :Figure 11 :Figure 12 :Figure 13 :
Figure 8: Learning curves for the fighter-jammer scenario.e shaded region represents the standard deviation of average evaluation over three trails.

Table 1 :
Traditional and intelligent planning.
Offline training and online inferencing, high-dimension, scalable Limitations Not scalable, static Computational power Jammer Fighter SAM Figure 1: Schematic of a SEAD mission.