UAVs Maneuver Decision-Making Method Based on Transfer Reinforcement Learning

Aiming at the 1vs1 confrontation problem in a complex environment where obstacles are randomly distributed, the DDPG (deep deterministic policy gradient) algorithm is used to design the maneuver decision-making method of UAVs. Traditional methods generally assume that all obstacles are known globally. In this paper, a UAV airborne lidar detection model is designed, which can effectively solve the problem of obstacle avoidance when facing a large number of unknown obstacles. On the basis of the designed model, the idea of transfer learning is used to transfer the strategy trained by one UAV in a simple task to a new similar task, and the strategy will be used to train the strategy of the other UAV. This method can improve the intelligence of the UAVs in both sides alternately and progressively. The simulation results show that the transfer learning method can speed up the training process and improve the training effect.


Introduction
In the battlefeld, UAVs can play a role in reconnaissance, detection, target tracking, attack interception, damage assessment, and others [1]. UAVs can also be used to intercept the enemy UAV [2]. How both sides maneuver to achieve the corresponding task objectives has aroused the attention and research interest of military experts and a large number of scholars.
At present, many experts have proposed diferent algorithms to solve the maneuver decision-making problems in diferent situations. In the traditional method, the main algorithms are the diferential game method [3], expert system method [4], and guidance law [5]. Tese methods have shown good efect on simple tasks, but they cannot be applied to complex battlefelds where the environment is unknown, and it is difcult to obtain analytical solutions. Terefore, scholars try to apply intelligent algorithms to UAV attack and defense confrontation problems, including bionic modeling [6], fuzzy cybernetics [7], and swarm intelligence algorithms [8].
Deep reinforcement learning, as an artifcial intelligence technology that combines neural networks and reinforcement learning, is a new type of a decision-making method, which has good application prospects for the research of UAV countermeasures. For the scenario of UAV swarms chasing enemy targets, the DDPG algorithm is used to train UAVs to pursue targets [9]. Aiming at the confrontation problem with multiple UAVs, the cooperative decisionmaking method of multiple UAVs based on the multiagent reinforcement learning algorithm is proposed [10]. An MPPO algorithm is proposed to solve the confrontation problem of a large-scale UAV swarm [11]. A hierarchical framework based on reinforcement learning and two kinds of motion planning strategies for the problem of chasing and escaping games in the presence of obstacles is presented [12]. Liu and Wang proposed an adversarial decision generation method based on the generative adversarial network for the confrontation between UAVs in a barrier-free environment [13]. Wen and Shi proposed an intelligent decision making method for multicoupled tasks of cluster UAV confrontation in complex environments [14]. Wang and Guo improved the reward function of the cluster UAV confrontation model and optimized the reward calculation method [15]. Tese works have verifed the feasibility of applying deep reinforcement learning to the UAV confrontation problem. Most of the current research is carried out under the condition that the scene information is completely known, and the designed strategy is suitable for specifc confrontation scenarios. If the scene becomes complicated, these studies may turn inefective.
In this paper, to solve the problem of obstacle avoidance when facing a large number of unknown obstacles, a UAV airborne lidar detection model is designed, and a 1vs1 maneuver decision-making method based on the DDPG algorithm is proposed. To get a better training efect, three training methods are designed by the idea of transfer learning. Te scenarios corresponding to these three training methods are interrelated, that is, gradually increasing the task difculty and fxing the strategy of the other UAV when one UAV is trained so as to make the confrontation environment of the agent relatively stable. We can transfer the relevant experience gained during the interaction between the UAV and the environment into new training scenarios to improve the intelligence of the UAVs on both sides alternately and progressively. Te experimental comparison between the transfer and nontransfer methods shows that the transfer reinforcement learning makes the two UAVs have their own intelligent strategies in a 1vs1 confrontation game. It also shows that the method can speed up the training process and improve the confrontation efect.

Problem Description and Modeling
2.1. 1vs1 Confrontation Problem. Te scenario of 1vs1 confrontation can be described as that there are one blue UAV and one red UAV in a limited planar area, which are called the attack UAV and the defense UAV. Te purpose of the attack UAV is to break through the interception of the defense UAV to reach the target area (light red area in the fgure) from the initial position (blue fag). Te purpose of the defense UAV is to intercept and destroy the attack UAV from the initial position (red fag). As shown in Figure 1, this paper assumes that circular obstacles (black areas) are distributed in the environment randomly. Only when the obstacles are within the detection range of the UAV's airborne radar, the UAV can obtain their positions.
In Figure 1, a and d represent the attack UAV and the defense UAV, respectively. s ip � (x ip , y ip )(i � a, d) represents the position coordinates of the UAVs. s id � ψ i (i � a, d) represents the heading angle of the UAVs. R a and R d represent the radar detection radius of the UAVs, respectively. (x tp , y tp ) represents the position of the center point of the target area. R t represents the efective radius of the target area. s k op � (x k o , y k o ) represents the position of the kth obstacle center point. For the convenience of research, there is a battlefeld boundary in the limited confrontation environment, and neither UAV can move out of the boundary.
It is assumed that the defense UAV can obtain the position and heading of the attack UAV in real time through the ground surveillance radar, and both sides carry lidar to detect obstacles and boundary of the local environment. It is also assumed that the attack UAV knows the position of the ground target area in advance.

Kinematics Model of UAVs.
It is assumed that the UAVs fy in a two-dimensional plane. Te kinematics equations of the UAVs are shown in formula: where v i represents the speed of the UAVs. a i and ω i represent the acceleration and angular velocity of the UAVs, respectively.
where x min , x max and y min , y max represent the boundary of the area. v imax represents the upper limit of the UAV speed. a imax represents the maximum value of the UAV acceleration. ω imax represents the maximum value of the UAV angular velocity.   Computational Intelligence and Neuroscience

Radar Detection Model.
It is assumed that both UAVs are equipped with lidar to detect the circular obstacles and enemy in the environment. As shown in Figures 2 and 3, the detection area of the UAVs is discretized into m state variables. In the fgures, R i (i � a, d) represents the UAV radar detection radius. θ i (i � a, d) represents the detection angle range. R k o (k � 1, · · · , N k ) represents the radius of the circular obstacle, where N k represents the number of obstacles with diferent radius sizes. (x k o , y k o )(k � 1, · · · , N o ) represents the position of the obstacles, where N o represents the total number of obstacles.
As shown in Figures 2 and 3, in order to better represent the detection state of the radar, the detection angle range of the UAV radar is discretized into l (l � 7) directions at equal intervals. In the fgure, it is represented by 7 rays, and the length of each ray is D n (n � 1, . . ., l). Te length of the blue ray is the maximum detection radius of the UAV radar, and the length of the red ray is the relative distance between the UAV and the obstacle or boundary detected in the corresponding direction. x n io (i � a, d)(n � 1, · · · , m) represents the ratio of D n to the UAV radar maximum detection radius. If the ratio is closer to 1, it indicates that the UAV is farther from the obstacle or boundary in this direction. Otherwise, it indicates that the UAV is closer to the obstacle or boundary in this direction.

1vs1 Confrontation Maneuver Decision-Making Method Based on Reinforcement Learning
In this paper, the reinforcement learning algorithm of DDPG is used to study the 1vs1 confrontation scenarios. Before using this algorithm, it is necessary to defne the state space, action space, and reward function. . Te attack UAV usually knows the position of the target area in advance. To simplify the input state dimension of the UAV, the position of the target is combined with the radar detection state. As shown in Figure 4, the direction corresponding to the maximum value of state quantity x i ao , (i � 1, ..., l) in s ao (there may be multiple such directions, such as the four blue ray directions in Figure 4) will be determined, and then the direction with the smallest angle with the UAV target line of sight direction will be selected as the optimal heading (such as the green ray direction in the fgure) of the attack UAV.
Te number of this direction is marked as c(1 ≤ c ≤ 7), and let x c ao equals to 2, which means that the attack UAV moves in this direction as much as possible.
In summary, the state of the attack UAV includes the UAV's own position, speed, heading angle, the radar's detection state, and the target's direction. Terefore, the state contains 10 dimensional data in total, which is defned as formula: For the defense UAV d, the status is similar to the attack UAV, which is defned as formula.  Computational Intelligence and Neuroscience s d � s dp , s dd , a dv , s do � x dp , y dp ,

Action Space.
It is assumed that the attack UAVs have stronger maneuverability. Te control inputs of both UAVs are acceleration and angular velocity, and the action space is shown as formula: 3.3. Te Reward Function. Reinforcement learning mostly uses sparse rewards in the feld of AI games and has achieved good results [16]. However, the sparse reward cannot make the UAVs to learn efciently at the beginning of the confrontation task. Terefore, the reward function of this experiment is set by the combination of guided reward and sparse reward. Te design of guided reward R g is shown in the following formula: where d t−1 and d t represent the relative distance between the UAV and the target at time t − 1 and t, respectively. R d represents the variation of relative distance. R h represents the cumulative value of the UAV radar detection state variable x n io relative to 1. R v represents the reward of the current speed of the UAV. R c represents the deviation of the current heading ψ i of the UAV from the optimal heading ψ opti .
Te design of sparse reward R s is expressed in the following formula: where R 1 represents the penalty for the UAV colliding with the boundary. R k o represents the radius of the k − th obstacle. dis(·) represents the Euclidean distance in two-dimensional space. R 2 represents the penalty for UAV colliding with the obstacle. R t represents the radius of the target area. R f represents the attack distance of the defense UAV. R 3 is the reward of the attack UAV to reach the target or the punishment for it being destroyed. Te success signal of the defense UAV is that the attack UAV is destroyed.

Te DDPG Algorithm.
Te DDPG algorithm is a classic reinforcement learning algorithm based on the actor-critic framework [17]. It is a deterministic policy gradient algorithm referring to the experience playback mechanism and the dual network structure in the DQN algorithm, and it realizes the direct mapping from the continuous state space to the specifc high-dimensional action space through the actor network. Te network architecture of DDPG is shown in Figure 5.
As shown in Figure 5, the algorithm mainly includes the interactive environment, the experience pool, and the network module of the algorithm. Before the UAV interacts with the environment, it is necessary to determine the number of layers and nodes of the network. We need to initialize the current network parameters randomly and copy the evaluated network parameters to the corresponding target network for the frst time. In each step of interaction, the initial state s t of environmental feedback is taken as the state input of the actor evaluated network, and the action 4 Computational Intelligence and Neuroscience value μ(s t ; θ) of UAV is obtained by the actor network. We need to add Gaussian noise to increase the exploration of the action space on this basis. Due to the limitation of the UAV's angular velocity, the action of the UAV is the combination of Gaussian noise and motion constraints, which is expressed in the following formula: where f clip represents the limitation function of the UAV action, N is Gaussian noise, which should obey the formula : where σ represents the variance of action noise. Te state of the UAVs is determined by the state transition formula (3), and the corresponding reward is obtained according to the reward function. Ten, the network training sample [s t , a t , r t , s t+1 ] is obtained, and we stored it in the experience pool. If the number of samples reaches the requirements for starting training, the parameters of the network are trained according to the method of random sampling. Te specifc method is to randomly take m sets of sample data from the experience pool. s n , a n , r n , s n ′ represents the n − th sample. Te back propagation algorithm can be used to update the evaluated network parameters.
Te loss function J(ω) of the critic evaluated network is calculated as formula: y n − Q s n ; a n ; ω 2 , where ω represents the parameters of the critic evaluated network, Q(s n , a n ; ω) represents the evaluation value of the critic evaluated network of the current state and the actions performed, and y n is defned as formula: where r n represents the reward after the UAV performs action a n , c represents the attenuation coefcient of the reward, and Q ′ (s n ′ ; μ ′ (s n ′ ; θ ′ ); ω ′ ) represents the evaluation value of the critic target network. Te parameter of the critic evaluated network is updated as formula: where α C is the learning rate of the critic evaluated network, and ∇ ω J(ω) is calculated as formula: y n − Q s n ; a n ; ω ∇ ω Q(s; a; ω)| s�s n ,a�a n .
Te parameter updating method of the actor evaluated network: where α A is the learning rate of the actor evaluated network. ∇ θ J(θ) is calculated as formula: Te parameters of the actor target network and the critic target network are updated through a soft update method. Such a slow updating process makes the training process more stable. Te process of updating as formula: where τ represents the soft update coefcient.

Transfer Learning.
It is common that the trained strategies of deep reinforcement learning can only be applied to specifc environments. As the complexity of the task increases, it is more difcult for the strategies to apply to new scenarios. Transfer learning is an algorithm that can make full use of the knowledge and experience that could be gained in previous related tasks and applied to new tasks [18]. Transfer learning has a strong ability of model generalization. Tis idea can also be refected in daily learning. For example, people use their mother tongue to learn foreign languages. People who are familiar with C++ can quickly learn other programming languages. A solid mathematical foundation is helpful for learning professional courses. All those mentioned previously are based on the previous knowledge to continue learning to solve new problems. Diferent scenarios or tasks in transfer learning are generally called domains. Te domains that have learned experience and knowledge are called source domains, and the domains to be learned are called target domains. Te defnition of transfer learning is as follows.
Based on the given source domain D s and source domain task T s , the knowledge K s learned in the source domain is used to learn K t in the target domain D t to complete the task T t of the target domain.
Te idea of transfer learning can also be applied to reinforcement learning. In this paper, the parameter transfer method of transfer learning is used to deal with the scenario of 1vs1 confrontation. Te core idea of this method is that the agent learns in a simple task frstly, and if the learned strategy is getting better, the difculty of the agent's task can be gradually increased. Te agent strategies which are suitable for simple tasks will be transferred to more complex tasks to continue learning. Tis process can reduce the difculty of exploring complex tasks efectively and avoid the problems caused by sparse rewards successfully.

Confrontation Maneuver Decision-Making Method Based on Transfer
Learning. Aiming at the 1vs1 confrontation model established in Section 2, this paper lets the UAV learn in a simple environment frstly and gradually transfer the learned experience to more difcult mission scenarios. In the learning process, when one side's strategy is to be trained, the other side's strategy trained in the previous scenario will be used initially. After the training is completed, the strategy of this training will be used to train the other side. We can use alternate training methods to improve the strategy of the UAVs from the two sides progressively. Te specifc training process is shown in Table 1.
Te pseudocode of the strategy training algorithm for DDPG-based 1vs1 confrontation is shown in Table 2.

Experimental Environment and Parameter Settings.
Te experimental software package is PyCharm 2020.1 and Anaconda3. Te experimental program is based on the Python language. Te settings of the confrontation scenario are shown in Figure 1. Tis paper uses the standard GUI writing library named Tkinter of Python to build a twodimensional environment. Te neural network is constructed by the PyTorch module, and the version of it is 1.8.1.
Te specifc parameters of the experimental environment are introduced as shown in Table 3. Te obstacles are distributed in each episode randomly, and they are limited in the specifc area.
Te simulation step ΔT is 1s. Te PyTorch module is used to build the neural networks of this paper, which all are 3-layer fully connected feedforward neural networks. Te number of neurons in each layer of the actor network is [10,128,64,2], and the number of neurons in each layer of the critic network is [12,128,64,1]. Te activation function is the ReLU function. To ensure that the action output by the actor network is reasonable, the value output by the fnal output layer is multiplied by the maximum action limit value by the tanh function. Te network parameter optimizer uses the AdamOptimizer module. To reduce the burden of the neural network and speed up the training of the network, the state input of both UAVs will be processed in advance. In this paper, the position coordinates are divided by the maximum boundary length, and the angle is limited to [0, 2π and divided by 2π.
Te algorithm training parameter settings are shown in Table 4.
In addition, there are two specifc conditions of episode termination in this experiment. One is the number of time steps that the UAV interacts with the environment reaching the maximum number of time steps per episode. Te other is that the UAVs collide with obstacles and boundaries or successfully achieve their required targets. For sparse rewards, if the UAV collides with an obstacle or boundary, the rewards R 1 and R 2 are set to −10. If the UAV completes the required task, the reward R 3 is set to 10. For the guided Table 1: Te training method of 1vs1 confrontation maneuver decision-making based on transfer learning.
Step Diferent scenario requirements from simple to difcult 1 Set that there is only one attack UAV in the battlefeld environment and train the UAV to avoid collision with obstacles and boundaries until it can reach the target area 2 Use the strategy of the attack UAV in step 1 and add a defense UAV in the environment. Te maneuverability of the defense UAV is not as good as that of the attack UAV. Te defense UAV is trained to avoid collision with obstacles and boundaries, and we perform the task of intercepting and attacking the attack UAV 3 Use the strategy of the defense UAV trained in step 2. It is set that the attack UAV can detect the defense UAV in advance. Use the transfer strategy and the nontransfer strategy for training, respectively Pseudocode of the 1vs1 countermeasure algorithm based on DDPG (1) Randomly initialize the parameters θ and ω of the evaluated network of actor and critic. Initialize experience pool D with a capacity of M. Te number of initialization batch samples is batch_size. Te initial attenuation factor is c. Te initial soft update coefcient is τ. Te initial Gaussian noise variance is noise. Te maximum number of initialization rounds is Max_Episode. Te maximum number of initialization steps per round is Max_Step (2) For episode � 1 to Max_Episode do (3) Obtain the respective state s t of both sides according to the initial settings of the simulation environment (4) For t � 1 to Max_Step do (5) Enter s t as the input of the actor evaluated network to get the UAV's action a t � f clip (μ(s t ; θ) + N), where f clip represents the function of the upper and lower limits of the UAV's restricted action (6) If there is an enemy UAV, the enemy UAV takes the corresponding confrontation maneuver decision-making according to the description in Table 2and we need to execute action a ct and update its own state s ct to s c(t+1) (7) Select the action according to the ε − gree dy strategy, that is, training the UAV to randomly select the action within the action range with a certain probability or the action a t of step 5, then obtain the corresponding reward value r t , and change the environment state to s t+1 at the next moment (8) Store the sample data [s t , a t , r t , s t+1 ] of the interaction between the UAV and the environment in the experience pool D (9) Randomly select batch_size of training sample data [s n , a n , r n , s n ′ ] from experience pool D (10) Calculate the loss function of the critic evaluated network and update the parameter ω of the critic evaluated network through backpropagation to minimize the loss function (11) Calculate the loss function of the actor evaluated network and update the parameter θ of the actor evaluated network through backpropagation loss function (12) Update the parameters θ′ and ω′ of the actor and critic target network for every step C (13) end for (14) end for

Training Result Analysis.
Te purpose of the reinforcement learning algorithm is to train the agent's strategy to maximize its cumulative reward expectation. Te evaluation index of training results can generally be the average reward value of the episode. It is a graph which shows the change of the reward value obtained by the agent training with the number of the episodes. Te faster the reward value rises and the more stable and higher the reward value converges, the better the training efect is. Tis paper uses the average reward of the last 100 episodes as the fnal average reward value. If there are less than 100 episodes from the beginning of training, only the average reward value of the existing rounds will be used.
According to the training steps in Table 2, we can use the strategies trained by the UAVs in the simple task scenarios in step 1 and step 2 to in the scenario of step 3. In step 3, the task difculty increases gradually, and the transfer and nontransfer methods are used for comparative analysis, respectively. Te migration methods are based on the network parameters of 1500 episodes previously trained. Te details are as follows: As shown in Figure 6(a), the ofensive UAV has prior information of its starting position and goal position in the environment of step 1, and it is trained to avoid obstacles and boundaries. After 1500 episodes of training, the reward function curve of the attacking UAV is shown in Figure 6(b).
Te abscissa of Figure 6(b) represents the number of training episodes, and the ordinate represents the average rewards of the most recent 100 episodes. It can be seen from the fgure that the UAV is not clear about what it is  going to do at the beginning. It is just exploratory interaction with the environment, and the data of these interactions are extremely useful. After the experience pool is flled (about 520 rounds), as the algorithm begins to train, the reward curve begins to rise gradually, and it starts to show a trend of convergence after 720 episodes with good stability. As shown in Figure 7(a), in step 2, the defense UAV uses the trained strategy of the attack UAV in step 1 to avoid obstacles and boundaries, and on this basis, the defense UAV is trained to intercept the attack UAV. If the distance from the attack UAV to the target location (yellow) is less than the distance from the defense UAV to the target location, the defense UAV cannot complete the interception and strike mission, it is due to the fact that the maneuverability of the attack UAV is better than the defense UAV. Terefore, the episode will be terminated early, and it means that the attack UAV completes its task successfully and the defense UAV fails to defend.
Te abscissa of Figure 7(b) represents the number of training episodes, and the ordinate represents the average rewards of the most recent 100 episodes. It can be seen from the fgure that after the experience pool is flled (approximately 580 episodes), the training curve begins to gradually rise and begins to converge around 850 episodes with good stability.
In step 3, the defense UAV used the defensive strategy trained in step 2. It is assumed that the attack UAV can detect the defense UAV by its airborne lidar and take the defense UAV as obstacles to avoid. Ten, the attack UAV is trained by the strategy of the attack UAV trained in step 1 and the nontransfer method, respectively. Te training results are shown in Figure 8. Similarly, if the distance between the attack UAV and the target position (yellow) is less than the distance between the defense UAV and the target position, the episode will be terminated in advance, and it will be judged that the attack of the attack UAV is successful and the defense of the defense UAV is failed.
Te abscissa of Figure 8 represents the number of training episodes, and the ordinate represents the average rewards of the most recent 100 episodes. It can be seen from the fgure that both transfer and nontransfer methods can converge within a certain period of time. In contrast, the transfer method has a better round reward value before training and a higher reward value after convergence.

Experiment Result Analysis.
In this paper, the training results after 1500 episodes are tested by Monte Carlo for 10000 times. Te parameters of the trained actor evaluated network are set in the UAVs.
Tree diferent scenarios are tested. Te efects of this test are shown in Figures 9-11 (each small circle represents the current position of the UAV at every time step, which is 5 s.). Te test result data are shown in Figure 12.

Computational Intelligence and Neuroscience
Te test results of step 1 show that the attack UAV trained by the presented method can avoid obstacles successfully. Te fnal strategy can achieve stable convergence, and the success rate of avoiding obstacles and reaching the designated area is 99.29%.
Te test results of step 2 show the success and failure of the defense UAV, respectively. As shown in Figure 12, both UAVs can avoid obstacles successfully. Te defense success rate of the defense UAV is 55.54%. Most of the cases of defense failure are that the two UAVs evade from diferent sides of the obstacle, so the defense UAV cannot intercept efectively. Te test results of step 3 show the success and failure of the attack UAV, respectively. Compared with the results of the nontransfer method (86.05%), the transfer reinforcement learning method proposed in this paper can increase the ofensive success rate (87.56%). Moreover, the results of both sides are greatly improved compared to step 2 (43.55%).
1000 Monte Carlo experiments are conducted between the attackers and defenders trained by the traditional MADDPG algorithm and the attackers and defenders trained by the DDPG algorithm based on transfer learning. Te experiment results are shown in Figure 13.
As shown in Figure 13, on the attack side, the winning rate of the transfer learning algorithm is 94.2%, which is signifcantly higher than MADDPG's 45.2% winning rate, while on the defense side, the winning rate of the transfer learning algorithm is 54.8% which is also signifcantly higher than MADDPG's 6.8% winning rate. Tese results demonstrate the efectiveness and superiority of the algorithm proposed in this paper.

Conclusion
In this paper, reinforcement learning is applied to the UAV confrontation problem, and a 1vs1 confrontation method is designed based on the DDPG algorithm. Based on the model, transfer learning is introduced to train the UAVs.

Computational Intelligence and Neuroscience
Te results show that the proposed method can make training converge faster and can increase the ofensive success rate.
Due to its limited mobility, the task success rate of a single defense UAV is not high. Terefore, the next step will continue to study the maneuver decision-making of multiple defense UAVs against a single ofensive UAV on the basis of the method proposed in this paper. In the far future, optimizing the framework structure of the algorithm or complicating the environment and adding more UAVs to the scenario will be the development direction.

Data Availability
Te data used to support the fndings of this study are included within the article.

Conflicts of Interest
Te authors declare no conficts of interest.