A UAV Pursuit-Evasion Strategy Based on DDPG and Imitation Learning

The UAV pursuit-evasion strategy based on Deep Deterministic Policy Gradient (DDPG) algorithm is a current research hotspot. However, this algorithm has the defect of low e ﬃ ciency in sample exploration. To solve this problem, this paper uses the imitation learning (IL) to improve the DDPG exploration strategy. A kind of quasiproportional guidance control law is designed to generate e ﬀ ective learning samples, which are used as the data of the initial experience pool of DDPG algorithm. The UAV pursuit-evasion strategy based on DDPG and imitation learning (IL-DDPG) is proposed, and the algorithm obtains the data from the experience pool for experience playback learning, which improves the exploration e ﬃ ciency of the algorithm in the initial stage of training and avoids the problem of too many useless exploration in the training process. The simulation results show that the trained pursuit-UAV can ﬂ exibly adjust the ﬂ ight speed and ﬂ ight attitude to pursuit the evasion-UAV quickly. It also veri ﬁ es that the improved DDPG algorithm is more e ﬀ ective than the basic DDPG algorithm to improve the training e ﬃ ciency.


Introduction
At present, UAVs are more and more widely used, such as sensor networks [1], data security [2], smart network systems [3], intelligent transportation systems [4], automatic identification systems [5], target encirclement control [6], and pursuit-evasion confrontation [7]. The UAV pursuitevasion confrontation is the game between two drones with conflicts of interest. The pursuit-UAV tries to capture the evasion-UAV through the pursuit maneuver strategy, and the evasion-UAV tries to escape by evasion maneuver strategy.
The methods on the UAV pursuit-evasion strategy include differential game method [8], expert system method [9], and influence diagram decision method [10]. However, the common problem of these methods is that it is more difficult to obtain analytical solutions. The DDPG algorithm is a policy-based reinforcement learning (RL) method which can use neural network (NN) for end-to-end learning. Research on the pursuit-evasion strategy based on the DDPG algorithm [11] is a current research hotspot.
Based on the DDPG algorithm, Zhang et al. [12] studied the cooperative pursuit of incoming targets by UAV swarm and designed a guided return function for specific pursuit tasks. Song et al. [13] designed a reward function considering the tracking error and trajectory stability for the landing trajectory tracking control problem of UAVs, then proposed a trajectory tracking control method based on DDPG algorithm. The trained result has higher accuracy than the traditional PID control method.
A problem of RL is that the efficiency of sample exploration is low, which makes learning and training inefficient. In the early training stage of reinforcement learning, a relatively large random noise is set for the exploration strategy to improve the exploration ability. But it will also produce a lot of inefficient samples (that is, useless action exploration), resulting in small rewards at the initial training stage. Therefore, how to improve the exploration ability, obtain efficient samples, and improve the utilization rate of samples is an urgent problem to be solved for RL training.
Expert experience and the mixed decision-making technology have been used to accelerate the training process of reinforcement learning. Wang [14] used reinforcement learning algorithm based on expert knowledge to solve the UAV path planning problem. The algorithm used multiple tasks with known environmental parameters to train the UAV and then transferred the trained result knowledge to the training of new tasks to accelerate the training process. Wu [15] studied the UAV reactive obstacle avoidance algorithm based on transfer learning and deep reinforcement learning, which makes the UAV quickly and efficiently respond to unfamiliar scenarios. In order to improve the motor skills of the manipulator and the learning ability of unmanned driving, Lu [16] and Zuo [17] integrated the experience of experts in their respective fields into reinforcement learning algorithm and designed the reinforcement learning algorithm under different tasks. Mu [18] studied the UAV cooperative formation maintenance and collision avoidance method based on the fusion of model knowledge and data training. The switching system based on the consensus theory and the multiagent cooperative collision avoidance method was learned in advance before training, which improves the training efficiency of the UAV formation control method.
Inspired by these works, the UAV pursuit-evasion strategy based on DDPG and imitation learning (IL-DDPG) is proposed, the algorithm can avoid excessive useless exploration and converge more quickly. The main contributions of this paper are as follows: (1) A kind of quasiproportional guidance control law is designed for the instructor to realize effective pursuit. The control law can be used to generate effective learning samples for the pretraining of IL-DDPG algorithm (2) The exploration strategy of the DDPG algorithm is improved. In the pretraining stage, the instructor maneuver samples generated by the quasiproportional guidance control law are used as the data of the initial experience pool. The algorithm obtains the data from the experience pool for experience playback learning, which improves the exploration efficiency of the algorithm in the initial stage of training and avoids the problem of too many useless exploration in the training process The rest of this paper is organized as follows. In Section 2, the system model and problem statement are presented. UAV pursuit strategy based on DDPG is presented in Section 3. In Section 4, UAV pursuit strategy based on IL-DDPG is proposed. Then, Section 5 provides the experimental results. Conclusions are given in Section 6.

Problem Description and Modeling
2.1. The Pursuit-Evasion Problem of UAV. In the pursuitevasion problem, the pursuit-UAV must chase and capture the evasion-UAV, and the evasion-UAV must escape and stay away from the pursuit-UAV.
For this problem, a zero-sum differential game model with control constraints is established. The geometric model of pursuit-evasion is shown in Figure 1.
In Figure 1, P represents the pursuit-UAV, E represents the evasion-UAV, v p is the speed of the pursuit-UAV, v e is the speed of the evasion-UAV, ψ p is the heading angle of the pursuit-UAV, ψ e is the heading angle of the evasion-UAV, and δ is the angle of the Line of Sight (LOS); LOS refers to the ray of the pursuit-UAV P pointing to the evasion-UAV E. The goal of the pursuit-UAV is to capture the target in the shortest time. The goal of the evasion-UAV is to stay away from the pursuit-UAV and to avoid being captured in the preset time or to maximize the delay time of being captured. The standard differential game is described as (1) and (2) [19].
where L is the distance between the two UAVs and T c is the moment when the pursuit-UAV P captures the evasion-UAV E. Equation (1) is the objective function of the pursuit-UAV, and (2) is the objective function of the evasion-UAV.

The Kinematic
Model of UAV. The motion state equations of the UAVs are defined as where ω i represents the angular velocity of the UAVs and a i represents the acceleration of the UAVs. The motion control variables of the UAVs are −ω e max ≤ ω e ≤ ω e max , where v p max and v e max are the maximum speed of the UAVs and ω p max and ω e max are the maximum angular velocity of the UAVs.
where ΔT is the simulation time step, r i is the turning radius, r i min is the minimum turning radius, Δψ i is the maximum turning angle within ΔT, and n i max is the maximum lateral overload. Therefore, the maximum angular velocity can be obtained.
The initial state of the UAVs is defined as If the distance between the evasion-UAV and the pursuit-UAV is within the capture range of the pursuit-UAV and does not increase, the capture is successful, as shown in (8), and the capture range can be the detection range or the attack range of the UAV.
where l c is the capture range of the UAV and kd t k is the 2norm of the two-dimensional vector ðd xt , d yt Þ and it can be calculated by where d xt and d yt represent the instantaneous relative distances of the pursuit-UAV and the evasion-UAV in the xaxis and the y-axis at time t, respectively. The UAV is set to carry onboard GPS equipment and gyroscopes to obtain its own position and speed, namely, ξ p = ½x p , y p , v p , ψ p . We also set the UAV to carry the on-board airborne radar to obtain detected target's position and speed, namely, ξ e = ½x e , y e , v e , ψ e . In order to increase the adaptability of the algorithm, the relative position is used to establish the state space model.

UAV Pursuit Strategy Based on DDPG
where α p and α e are the angles between the speed direction of the pursuit-UAV and evasion-UAV and the LOS, respectively. α pe = α e − α p is the angle between the speed direction of the two UAVs, d pe is the distance between the two UAVs, v p is the speed of the pursuit-UAV, and Δv pe refers to the speed difference between the two UAVs.
3.1.2. MDP Action Space. The control input of the UAV is a two-dimensional vector, namely, action space where a i is the acceleration of pursuit-UAV and evasion-UAV and a i = _ v i , ω i is the angular velocity of pursuit-UAV and evasion-UAV. Both v i and ω i satisfy the constraints (4).

(1) MDP state transition function
The state transition function is as shown in 3.1.3. MDP Reward Function. A combination of sparse reward and guided reward function is designed: where R t represents the total reward of the UAV. R t1 is a guided reward, d t represents the distance between the pursuit-UAV and the evasion-UAV at time t, and d t−1 represents the distance at time t − 1; k is proportionality; R t2 represents the sparse reward of the pursuit-UAV being too far away from the evasion-UAV; R t3 represents the sparse reward of the pursuit-UAV to complete the task. R t1 is the variation of the relative distance between the pursuit-UAV and the evasion-UAV. When the relative distance becomes smaller, the pursuit-UAV gets a positive reward; when the relative distance becomes larger, the pursuit-UAV gets a negative reward.
where D far represents the relative distance threshold and R far is a large positive constant, which punishes the algorithm when the pursuit-UAV's action strategy is incorrect and the distance from the evasion-UAV is too far.
where R finish is a large positive constant, which rewards the algorithm when the UAV completes the task.
3.2. DDPG Algorithm. The core of reinforcement learning is that the agent obtains rewards by interacting with the environment and adjusts the strategy according to the size of the rewards to realize the optimization of decision-making. Deep reinforcement learning (DRL) combines the approximate fitting of deep learning (DL) and the decision-making optimization of reinforcement learning (RL). The most representative DRL algorithm is the deep Q-learning (DQN) algorithm. DQN uses two networks with the same structure but different parameters. One network generates the current Q value, and the other network generates the target Q value. Then, these two values are used to minimize the loss function, and the parameters of the current network are copied to the target network after a period of time. DQN uses experience replay to break the relevance of RL data and uses random sampling to extract data from the experience pool for training.
The Deep Deterministic Policy Gradient (DDPG) algorithm which was developed based on the core idea of DQN also uses the Actor-Critic dual network mechanism and combines the advantages of the value function and the strategy function method. 4 International Journal of Aerospace Engineering The DDPG algorithm has four subnetworks, and the network structure of the algorithm is shown in Figure 2.
Actor network and Critic network both have target network (TargetNet) and evaluation network (EvalNet), so DDPG has a total of 4 subnetworks.
The Actor selects action μðs t Þ according to the action probability provided by itself. The Critic_EvalNet evaluates the current state and the value Qðs t , μðs t ÞÞ of the action selected by the Actor, and the Critic_TargetNet evaluates the next state s t+1 and the value Q ′ ðs t+1 , μðs t+1 ÞÞ of the action μðs t+1 Þ selected by the Actor_TargetNet for s t+1 . Then, the Actor will adjust the probability of the action according to Critic's evaluation of the action [20,21]. ðθ Q , θ u Þ and ðθ Q′ , θ u′ Þ are the EvalNet and TargetNet parameters of the Critic and Actor, respectively. Actor and Critic use different functions to train and update the parameters.
Critic uses the mean square error loss function to update the parameters θ Q of the Critic_EvalNet through the gradient of the neural network, as shown in where K is the sample size and is y i defined as Actor uses the gradient of Equation (19) to update the parameter θ u of the Actor_EvalNet.
Like DQN, the EvalNet will train the network parameters in real time to update ðθ Q , θ u Þ, and the TargetNet parameters ðθ Q′ , θ u′ Þ will follow the EvalNet through soft updates. The advantage of using soft update is to make algorithm training more stable and easy to guarantee convergence. The soft update is described as where τ is the inertial update rate. A major innovation of DDPG is the use of motion noise. Adding a random noise to the action generated by the Actor turns the deterministic decision into a random process. It enhances the exploration of the algorithm. Commonly used random noises are Gaussian Noise and Ornstein-Uhlenbeck (OU) Noise.
OU Noise is also called OU process, which is a random process. It will explore a certain distance around the mean value in the positive or negative direction. This is conducive Training algorithm for UAV strategy based on DDPG initial experience pool D with memory size M initial the Eval networks of Actor network and Critic network:μðs ; θ u Þ and Qðs, ajθ Q Þ forepisode=1 to MaxEpisodedo initialize OU-Noise NðtÞ initialize the state of pursuit-UAV and evasion-UAV in set range randomly, obtain the initial state s 0 of simulation environment fort=1 to MaxStepdo select actiona t = f clip ðuðs t jθ u Þ + N t Þ of pursuit-UAV where f clip is the action constraint processing process s t select maneuver strategy for evasion-UAV input the control signal into the UAV integrate to get the next state of UAV, and calculate the environment state s t+1 obtain the immediate reward r t from the environment store experience sample ½s t , a t , r t , s t+1 in D randomly sample form D to get a sample set of BatchSize f½s t , a t , r t , s t+1 g update the Eval network parameter θ Q of the Critic update the Eval network parameter θ u of the Actor update the Target network parameters θ Q′ and θ u′ of Critic network and Actor network by (20) if the episode end condition is satisfied, break end for end for Algorithm 1: The UAV pursuit strategy using DDPG. 5 International Journal of Aerospace Engineering to exploring in one direction and can improve the efficiency of exploration and training for inertial systems.
The agent obtains the sample set ðs t , a t , r t , s t+1 Þ of the training network in the process of interacting with the environment and stores these samples in the experience pool. During training, the agent selects some minibatch samples according to the random sampling strategy to train the neural network parameters through experience replay.

Imitation Learning.
Model-free and model-based reinforcement learning methods both learn a strategy from scratch that maximizes the cumulative return. For complex tasks, the agent has a huge search space and cannot get meaningful rewards frequently in the initial stage, which leads to a slow convergence rate of reinforcement learning.
IL means that the agent uses the decision data provided by experts to learn the best strategy [22]. It can be used to solve problems that the reward cannot be given. We can integrate IL with RL to accelerate the process of strategy learning by providing effective samples through experts' demonstrations.
At present, scholars have successfully verified the feasibility of this method. A deep Q-learning from demonstrations (DQfD) algorithm is proposed [23], which combines the TD updates with the supervised classification of the instructor's actions, and the demonstrations are used to pretrain the Q network in the DQN, and at the same time, the demonstrations are put into the experience pool, and these expert data are used to accelerate the learning process on a     International Journal of Aerospace Engineering large scale. The DQfD is proved that it has better initial performance than DQN. On this basis, the DDPG algorithm is combined with the demonstrations in a similar way to construct the DDPGfD algorithm [24].

IL-DDPG Algorithm
. DDPG based on imitation learning algorithm (IL-DDPG) is designed to solve the maneuver decision-making problem of the UAV pursuit-evasion. The design of this algorithm mainly includes two aspects; one is the algorithm framework, and the other is the maneuver strategy of the instructor. Figure 3 shows the algorithm framework of IL-DDPG. In this framework, the instructor's strategy is used to generate amounts of experience and store them in the experience pool in the initial stage. And these experiences are used to train the network by RL. Figure 4 shows the process of UAV offline training and exploration. Before starting any interaction with the environment, IL-DDPG initially only trains the demonstrations, which is the pretraining process. A value function that satisfies the Bellman equation is used to imitate the instructor so that it can be updated with TD_error once the UAV starts interacting. The subsequent learning and training of IL-DDPG are consistent with the DDPG algorithm.

Instructor Confront Strategy.
The main improvement of our algorithm is the design of DDPG initial exploration   7 International Journal of Aerospace Engineering strategy, that is, instructor confront strategy. Proportional guidance is one of the missile guidance methods, and it is also often used in the interception of maneuvering targets. Therefore, the proportional guidance method can be used as our instruction strategy.
The pure proportional navigation method [25,26] is that during the guidance process, the rotational angular velocity of the controlled object's velocity vector is proportional to the rotational angular velocity of LOS, and the core equation of the guidance is shown as where K is the scale factor and its range is ð1, ∞Þ, and ε = 0 is the ideal control relation equation describing the guidance method. Figure 5 shows the relative movement relationship of pure proportional method.
The disadvantage of pure proportional guidance is that the normal overload required to hit the target is directly related to the target speed at the hit point and the UAV's attack mode, and it leads to difficulty in selecting the value ofK. We can use the generalized proportional guidance method to improve the characteristics of proportional guidance.
The normal overload is selected according to the rotation angular velocity of the LOS, namely, n = Kj _ dj _ δ. The normal overload when the UAV hits the target is It can be seen that the required overload at the hit point has nothing to do with the UAV speed and attack direction.
Considering the characteristics of the UAV's capture range, a quasiproportional guidance control law [27] is designed as shown in Figure 6. Compared with pure proportional guidance, it fully considers the difference between UAV guidance and missile guidance.
In Figure 6, the red circle is the effective capture range of the pursuit-UAV, and l c is the capture radius of the UAV. v r is the relative speed of the pursuit-UAV and the evasion-UAV, and v r ! = v p ! − v e ! . ψ r is the angle of the v r . EA and E B are the two guiding boundary lines; ε and φ are the angles of these lines. δ is the angle of LOS, and γ is the angle between LOS and the two boundary lines. The quasiproportional guidance law guides the pursuit-UAV P to make the evasion-UAV E fall into the capture range of P. For this purpose, the relative velocity vector v r and its angle φ r are controlled. In the guidance process, approaching the target along EA or EB depends on the difference between ψ r and the boundary line. If jψ r − φj is   International Journal of Aerospace Engineering smaller, EA will be selected as the guiding boundary; if jψ r − εj is smaller, EB will be selected.
If EB is chosen as the guidance boundary line, the quasiproportional guidance instruction can be and we can get where d is the distance between pursuit-UAV P and evasion-UAV E.
The state transition equation of maneuver decision control is shown as
The training parameters of algorithms in the experiment are shown in Table 1.
The experiment parameters of the UAV pursuit-evasion game simulation environment are shown in Table 2.
The evasion-UAV adopts the classic escape strategy [28], namely, where calculation of ξ e is shown in the following: All networks are multilayer feedforward neural network with a single hidden layer. The number of neurons in each layer of Target-Actor network and Eval-Actor network is [6,128,2]. Their hidden layer uses Relu (x) as activation function, and their output layer uses Tanh (x) as activation function. The input of Critic network are the MDP state and generated actions by the Actor network, so the number of neurons in each layer of Target-Critic network and Eval-Critic network is [8,128,1]. Their activation functions are the same as those of Actor network. The training applies ADAM Optimizer as optimizer.

Instructor Confront Strategy.
The pursuit-UAV only uses the designed quasiproportional guidance strategy. The evasion-UAV adopts two different strategies: uniform linear motion and the classic escape strategy. The speed of the evasion-UAV is set for 10 m/s. As shown in Figure 7, the evasion-UAV escapes in a straight line simply, and the pursuit-UAV successfully

12
International Journal of Aerospace Engineering captures the evasion-UAV after adjusting the speed direction. But in Figure 8, the evasion-UAV escapes successfully. It is because the pursuit-UAV that uses the quasiproportional guidance method as the pursuit guidance law needs time to adjust its heading, which creates an opportunity for the evasion-UAV to escape within the predetermined maximum time.
Although the quasiproportional guidance law cannot let the pursuit-UAV capture the evasion-UAV when the evasion-UAV uses the classic escape strategy, it can guide the pursuit-UAV to explore good initial experience as an instructor strategy.

Comparison.
Average reward is used to verify the convergence and effectiveness of the proposed algorithm, and it is defined as the average value of reward in latest 100 episodes.
With the same training parameters and the same experiment parameters, the average rewards of the trained results obtained by the IL-DDPG and DDPG algorithms are shown in Figure 9.
As shown as Figure 9, the IL-DDPG algorithm converges faster than the DDPG algorithm, and it is more stable than the DDPG algorithm.
In order to compare the trained results of the two algorithms under the same initial conditions, the results are used to simulate the UAVs pursuit-evasion process. The simulation results are shown in Figures 10 and 11.
It can be seen from Figures 10 and 11 that the trained results obtained by the IL-DDPG algorithm achieves a shorter capture time.
Furthermore, as shown in Figures 12 and 13, the pursuit-UAV using the IL-DDPG algorithm can adjust its speed and heading in time to capture the evasion-UAV, no matter whether the evasion-UAV adopts uniform linear motion strategy or random motion strategy.
These experiments prove that the UAV pursuit strategy based on the IL-DDPG algorithm has a good generalization, and the trained UAV can successfully complete the pursuit task in the pursuit-evasion game. Figures 14 and 15 increase the velocity of evasion-UAV to 11 m/s and 12 m/s, respectively. It can be seen that pursuit-UAV can capture the evasion-UAV within a given time, which verifies the Imitation of the IL-DDPG algorithm.

Conclusion
The training algorithm of the UAV pursuit strategy based on the IL-DDPG algorithm introduces a quasiproportional guidance control law as the instructor strategy to improve the exploration efficiency in the early stage of DDPG training and avoids the problems of excessive useless exploration. Simulation results show the effectiveness and generalization of this algorithm.
For the future work in this paper, we should study how to effectively combine the imitation learning and the multiexperience pool technology to accelerate the training of the algorithm.

Data Availability
The numerical data used to support the findings of this study is included within the article.

Conflicts of Interest
The authors declare no conflict of interest.