A Method of Multi-UAV Cooperative Task Assignment Based on Reinforcement Learning

With the increasing complexity of UAV application scenarios, the performance of a single UAV cannot meet the mission requirements. Many complex tasks need the cooperation of multiple UAVs. How to coordinate UAV resources becomes the key to mission completion. In this paper, a task model including multiple UAVs and unknown obstacles is constructed, and the model is transformed into a Markov decision process (MDP). In addition, considering the inuence of strategies among UAVs, a multiagent reinforcement learning algorithm based on SAC algorithm and centralized training and decentralized execution framework, MA-SAC (Multi-Agent Soft Actor-Critic), is proposed to solve the MDP. Simulation results show that the algorithm can eectively deal with the task allocation problem of multiple UAVs in this scenario, and its performance is better than other multiagent reinforcement learning algorithms.


Introduction
Unmanned aerial vehicle, also known as UAV, has the characteristics of strong mobility, low safety risk coe cient, no need for personnel to take o , repeatability, and so on. UAV was rst used in military elds [1], such as reconnaissance, target strike, air early warning, and electronic jamming. In recent years, UAV technology is developing rapidly, the size of UAV is decreasing, and the cost is getting lower and lower. erefore, UAV is more and more widely used in civil elds such as sensing [2], cargo transportation, communication relay [3], re monitoring, and aerial mapping.
With the increasingly complex application scenarios, such as the combination with the Internet of vehicles [4], a single UAV cannot e ectively complete complex and diverse tasks. It is important to make multi-UAV perform tasks collaboratively not only to meet the requirement of complicated scenarios but also to make the accomplishment of tasks to cause less time-and-resource consumption.
Task planning is the most important part for the cooperative execution of multi-UAV, and task allocation is the basis of task planning. Task assignment refers to the complex task environment existing in several UAVs; after taking full account of the energy consumption, load, nature, role, and other constraints of UAVs, the coordination between UAVs and various resources is coordinated to assign one or more orderly tasks to each UAV, so as to minimize the time and cost and ensure the e cient and successful completion of tasks to the maximum extent. e task allocation problem is generally approximated to the path planning problem [5], that is, how to generate a collision-free path from the starting site to the destination to ensure the safety of the vehicle [6]. However, in the multi-UAV environment, not only the collision between UAVs and obstacles but also the collision between UAVs should be considered. At the same time, with the increase of the number of UAVs, the variation of the environment is also increasing. In addition, every action decision of each UAV can be regarded as simultaneous, and no one UAV can know the current decision of other UAVs, so it is more di cult to avoid collisions between UAVs.
Fortunately, reinforcement learning (RL) techniques are emerging to help solve the problem of real-time decisionmaking in complex and changing environments. e technology allows the drone to learn a strategy to maximize returns or achieve a specific purpose through its constant interaction with the environment.
In this paper, a UAV task allocation model including UAV collision and communication energy consumption is presented; at the same time, an MA-SAC algorithm is proposed to assign tasks and plan paths to UAVs. e specific works of this paper are as follows: (i) A multi-UAV task assignment model based on collision and communication energy consumption is proposed (ii) Based on this assignment model, the dynamic process of task assignment is transformed into MDP (iii) A multi-agent reinforcement learning algorithm MA-SAC is proposed to solve the MDP process e rest of this article is organized as follows. Section 2 describes the related work. In Section 3, the multi-UAV task assignment model is presented. Section 4 introduces the task assignment algorithm proposed in this paper. In Section 5, simulation is performed and the results are analyzed. Finally, the works of this paper are summarized in Section 6.

Related Work
In the past few years, many researchers have done a lot of research on multi-UAV task allocation model and the algorithm to solve the model. ey not only make the model more close to the increasingly complex reality environment but also look for high-performance algorithms. is section will introduce relevant work from these two aspects.

Task Allocation Model.
In various scenarios, different task allocation models need to be established based on a variety of problems that need to be solved by UAV. In the paper [7], and this problem is modeled as a traveling salesman problem (TSP), which minimizes the total flight time and total range of all UAVs by considering the flight capability of UAVs. Jia et al. [8] construct a heterogeneous UAV cooperative multitask allocation scenario by considering kinematic constraints, resource constraints, time constraints, and vehicle path model. Song et al. [9] describe the UAV logistics problem as a mixed integer linear programming problem considering UAV flight time, load, and other constraints. In addition, the task allocation problem of multi-UAV is usually described as multidimensional multiple choice knapsack problem (MMKP) [10,11], dynamic network flow optimization (DNFO) problem [12], and multiple processors resource allocation (CMTAP) problem [13,14].

Task Assignment
Algorithm. Task assignment algorithms are mainly divided into optimization algorithm, heuristic algorithm, and reinforcement learning algorithm.
Optimization methods include Hungarian algorithm [15,16], branch-and-bound method [17], and other commonly used integer linear programming methods. ese algorithms are only applicable to scenarios with simple tasks and small UAV scale. eir calculations grow exponentially as the number of UAVs increases, and these algorithms cannot generate an accurate trajectory for UAVs in complex environments. Heuristic algorithms are proposed relative to optimization algorithms, including GA [18], ACO, and PSO that simulate animal behavior in nature. ese algorithms are generally combined with other algorithms to solve task assignment problems. In [18], GA is combined with clustering algorithm to solve the task allocation and path planning problems of multiple UAV. In [19], the author proposed two improved heuristic algorithms to solve TSP problems, one is IGA algorithm proposed by improving the coding rules of genetic algorithm, and the other is PSO-ACO algorithm combining PSO and ACO. In [20], the author improves swarm gap algorithm and puts forward three algorithms: location loop (AL), sorting and allocation loop (SAL), and limit and allocation loop (LAL), which solves the task allocation problem of the UAV team in a military operation. However, the heuristic algorithm has the disadvantage of falling into local optimum easily, and the realtime performance of the algorithm is worse and worse with the increase of environment complexity. erefore, many researchers began to study the application of reinforcement learning in task assignment.
Reinforcement learning is a kind of algorithm that makes an agent learn the optimal strategy through trial and error in the environment. Reinforcement learning has been widely used in UAV mission assignment scenarios over the past few years. In [21], a transaction inspired multiagent reinforcement learning algorithm was proposed to solve the path planning and coordination problems of UAV clusters. In reference [22], the author proposed a MADOL algorithm to enable multiple UAVs to solve the ambiguous BSN allocation problem in an ambiguous boundary scenario. e literature [23] has developed a multiagent reinforcement learning framework, which solves the problem of dynamic resource allocation of UAV communication network in uncertain environment and realizes the balance between performance gain and UAV overhead. In reference [24], the author proposed a multiagent reinforcement learning algorithm, compound-action actor-critic (CA2C), which solves the problem that UAVs perform sensing tasks through cooperative sensing and transmission. In [25], the author proposed an FTA algorithm by combining DQN algorithm with priority experience replay, which effectively solved the problem of UAV task allocation in uncertain environment. In [11], the author proposed a DDQN-per algorithm to solve the task assignment problem of MCS. However, these single-agent algorithms regard the agents in the environment as independent and cannot train a good agent cooperation model. e proposed MADDPG [26] algorithm adopts the method of centralized training and distributed deployment, which well solves the problem of cooperation and competition among multiagent. In [27], the author proposed an MADDPG algorithm, trained the MADDPG model offline, and then solved the resource allocation problem in the UAV-assisted vehicle network online. However, DDPG algorithm is a deterministic strategy, which may fall into local optimum due to greed. e proposed SAC algorithm [28] introduces entropy, which requires not only maximum reward but also maximum entropy to enhance the spatial exploration ability of agents. Based on the idea of centralized training and separate deployment, this paper applies SAC algorithm to the cooperative task assignment environment of multiple UAVs and proposes an MA-SAC algorithm.

Task Assignment Model
Multi-UAV should not only complete each task but also pay attention to their own safety and energy consumption. Figure 1 shows the task allocation framework of multi-UAV. In this paper, the distance from UAV to the mission positions, the collision of UAV, and the communication between UAV and base station are comprehensively considered to establish the task assignment model, and the specific modeling is as follows.

e Distance between the UAV and the Mission.
is paper considers how to assign multiple UAVs to multiple task points and plan a safe path so as to achieve the goal of reducing the total cost while completing the task quickly and safely. In this paper, the UAV cluster is represented by e position and track data of each UAV can be obtained by the GPS device carried by the UAV itself, and the data will be transmitted to the MEC layer for calculation. For each UAV v i ∈ V, (s xi , s yi ) is used to represent its current position. e set of tasks to be completed is represented by W � w 1 , w 2 , w 3 , . . . , w n . For each task w i ∈ W, (s wxi , s wyi ) is used to represent task position.
e distance between the UAV v i and the mission location w j can be calculated using the following formula:

UAV Collision.
In order to simulate the real environment, some obstacles are added to the environment to block the route of UAV. At the same time, the collision between UAV and other UAVs is considered. As shown in the picture, there is a certain safety buffer area between the UAV and the obstacles. e distance between UAVs can be calculated using the following formula: Once the distance between UAVs or between UAVs and obstacles is less than the safety zone, UAVs are considered to have a safety risk of collision.

UAV Communication.
In order to grasp the status of UAV in real time, the communication between UAV and base station needs to be considered, and the position of base station is represented by (B x , B y ). In this paper, UAV's altitude to the ground is h,and the straight-line distance between UAV and base station can be calculated by the following formula: Transmitting the data collected by UAV sensors needs to consume the energy of the sensor node [29]. In order to study the energy loss of UAV transmission, we consider the path loss of UAV communication with base station. In Friis free space model [30], the relationship between signal transmitting power and signal receiving power can be calculated by the following formula: where P R is the receiving signal power, P T is the transmitting signal power, G T is the transmitting antenna gain, G R is the receiving antenna gain, λ is the signal wavelength, β is the system loss factor unrelated to propagation, and d is the propagation distance. In this paper, d is the distance between each time slot UAV and the base station L uav−base . In order to ensure normal communication, the power of the attenuated UAV signal needs to be greater than the receiving power of the base station. erefore, the signal transmitting power of each time slot n of UAV v i must meet the formula e communication energy consumption of each UAV v i to complete the task can be expressed as where N � (n 1 , n 2 , n 3 , . . . , n t ) is the time slot set for the UAV to complete the task. In this paper, the time slot n is approximated to each step in the simulation. δ is the duration of each time slot n. In this model, δ is set to 1. e total communication energy consumption of UAV cluster can be calculated by the formula

Task Assignment Algorithm
In this section, we consider the application of reinforcement learning in multi-UAV task allocation, apply a soft actorcritic (SAC) algorithm to multiagent environment, and propose an MA-SAC algorithm. is algorithm is usually used to solve the problem described as Markov decision process (MDP). So, this section will introduce the MDP of this model, SAC algorithm and MA-SAC algorithm in turn.

Markov Decision Process.
MDP is usually composed of state, action, and reward function. erefore, the MDP of the model can be described as follows.

State.
In this process, the state space is composed of the position and speed of the UAV, the distance between the UAV and the destination, and the collision risk of the UAV.

Action.
e action space is usually the optional action set of all UAVs in different states. In this model, the action space of UAV is expressed as < front, back, left, right, hover >.

Reward.
In this model, when multiple UAVs are faced with multiple tasks, this paper aims to reasonably allocate task targets and carry out path planning for each UAV, so that each task can be completed safely and quickly with the minimum total energy consumption. erefore, for UAV v i , the reward can be described as e task assignment problem can be described as where R F is the reward for completing the task, and the value is constant. R c is the collision reward. R L is the distance reward. In order to guide the UAV to the mission point, it can be expressed as R L � −min L ij , j ∈ (1, 2, . . . , n), w ij indicates that the mission w i is carried out by UAV v j , and v ij indicates that UAV v i performs mission w j . Formula (10) means that only one UAV can be assigned to perform each task, and formula (11) means that each UAV can only perform one task.

SAC Algorithm.
SAC algorithm is a kind of off-policy reinforcement learning algorithm. is paper is improved based on SAC algorithm proposed in [31]. e algorithm improves the critical network on the first version of SAC algorithm [32]. It removes the value network and uses two Q networks. erefore, the SAC algorithm has one actor network, two critic networks, and two target-critic networks. Among them, the actor network is used to give the corresponding action according to the change of state, and the critic network is used to calculate the Q value to evaluate the action. In order to solve the overestimation problem, the SAC algorithm adopts a pair of independent critic network and takes the smaller value of the two when updating. In order to stabilize the training of Q network, the SAC algorithm introduces a pair of target-critic networks whose update frequency is less than the critic network.
In order to prevent the strategy from getting into trouble due to greed, it is necessary to increase the random exploration ability of the algorithm, so SAC introduces entropy regularization. When the strategy distribution is more uniform, the entropy of the strategy is greater, and the random exploration ability of the algorithm is stronger. erefore, the objective function of SAC algorithm not only requires the maximum final reward but also the maximum entropy. Its objective function can be expressed as E s t ,a t ( )∼ρπ r s t , a t + αH π · | s t , (12) where H(π(· | s t )) is the entropy of strategy, r(s t , a t ) is the reward for time t, and π * max is the optimal strategy. Figure 2 shows the MA-SAC algorithm that we proposed by improving SAC algorithm based on the multi-UAV task allocation model. MA-SAC algorithm is based on actor-critic network framework. In this multi-UAV environment, each UAV has an actor network, a target-actor network, two critic networks, and two target-critic networks, which are all composed of fully connected neural networks.

MA-SAC Algorithm.
In the multi-UAV environment, UAV itself is not only an intelligent body but also a part of the environment of other UAVs. erefore, for the critic network of each UAV, we not only input the environmental state into the critic network. e actions of other UAVs are also fed into the critic network to calculate the Q by a part of the overall environment. SAC, like DDPG and other algorithms, introduces the experience replay mechanism to reduce the correlation between data. erefore, the whole training process is divided into two parts: experience collection and network training. In the experience gathering phase, the agent performs the actions generated in each step, and then stores the tuples that include states, action, next state, and reward S, A, S ′ , R into the replay bu er.
When the data in the replay bu er reaches the threshold, the network training stage can be entered. At each step, some data will be sampled from the replay bu er to update the parameters of actor networks and critic networks. e actor network is trained by the strategy gradient. For each UAV v i ∈ V, the actor network update targets are as follows: J θ i E X,a∼D α log π i a i |s i −Q π i X,a 1 ,...,a n | a i π i s i ( ) , where π i represents the policy network of the agent i, θ i ∈ θ 1 , θ 2 , . . . , θ n represents the parameter of the policy network π i , and X represents the current status of all agents. Critic networks are updated by minimizing the loss function as a goal. e loss function is the mean square error that can be calculated by the formula: L E X,a,r,X′ ( )∼D Q π i X, a 1 , ...,a n − y i 2 , y i r i + cE Q π i X ′ ,a 1 ′ , ...,a n ′ | a i where X ′ represents the next status of all agents, a i ′ represents the next action of the agent i, and s i ′ represents the next state of the agent i.
To ensure the stability of training, the parameters of actor networks and critic networks will be copied to the corresponding target networks in each iteration. Here, the algorithm adopts the soft update method, so in each step, some actor and critic network parameters are updated to the corresponding target network, which can be calculated by the formula where ψ is the parameter of target-critic network, ψ is the parameter of the critic network, and τ is the update ratio. e pseudocode of the MA-SAC algorithm is demonstrated in Algorithm 1, and the meanings of the parameters are shown in Table 1.

Experimental Results and Analysis
In this section, the performance of MA-SAC algorithm in multi-UAV task assignment environment is studied. We use the Pytorch deep learning framework to simulate this scenario and compare it with MADDPG algorithm. Table 2 shows the relevant hyperparameters of the algorithm simulation in this paper.
In this experiment, we constructed an environment in which multi-UAV cooperate to complete tasks. e environment consists of three UAVs, three mission positions, one obstacle, and a base station to communicate with the UAVs. Firstly, the MADDPG algorithm proposed in reference [26] is selected to compare the convergence performance. Figure 3 shows the convergence process of MA-

Mobile Information Systems
SAC algorithm and MADDPG algorithm during training in this environment. In this experiment, we performed 50,000 training episodes and averaged the rewards every 1,000 episodes. By comparing the two algorithms, it can be found that the proposed MA-SAC algorithm can finally converge to around 300, while the MADDPG algorithm finally converges to around 220. It can be seen that the convergence speed of the two algorithms is similar in this scenario, but the convergence result of the MA-SAC algorithm is better than that of the MADDPG algorithm, because the training goal of the MA-SAC algorithm is not only to maximize the reward of the drone but also to maximize the entropy of the UAV strategy. is increases the ability of the UAV to explore the space, thereby improving the performance of the algorithm.
To verify the effectiveness of the algorithm in this scenario, we conducted 500 episodes of tests on the MA-SAC algorithm in this environment and compared it with other (1) Initialize environment (2) Initialize critic network and actor network (3) Initialize max episodes, replay buffer, batch size (4) for episode ∈ [1, episodes] do (5) Reset environment (6) Get current state s i for each agent, i (7) for step ∈ [1, steps] do (8) Select actions a i for each agent v i (9) Get all agents next states s i ′ and rewards r i (10) Store < a i , s i , s i ′ , r i > to replay buffer D (11) if D size > B size then (12) Sample batch B from replay buffer D (13) for v i , where i � 1:N do (14) Update the critic network (15) Update the actor network (16) Update the target network according to formulas (15), (16) (17) end for (18) end if (19) Table 3, the task completion rate of the MA-SAC algorithm reaches 95.16%, which is a great improvement compared with that of the COMA and VDN algorithms, and the task completion rate is also increased by 2.4% compared with the MADDPG algorithm. Figure 4 shows the dynamic assignment process of UAVs in the task area before training. At this time, none of the three UAVs has learned any strategy, so they are in an exploration state in the environment. It can be seen from the route of the UAV in the task assignment process that the UAV does not have a clear mission target at this time, and they move randomly in space. UAV 2 even collides with obstacles. Figure 5 shows the rendering of the multi-UAV task assignment process when using the proposed MA-SAC algorithm for 20,000 episodes of training. It can be seen that although the UAVs have learned to approach the mission point at this time, there is no coordination between them. Both UAV 2 and UAV 3 flew to the same mission location, resulting in not all missions being completed. Figure 6 shows the effect of the task assignment process of the UAV when the training reaches 50,000 episodes. At this point, the trained model can already solve the task assignment problem in this environment well. UAVs not only consider their distance when assigning tasks but also take into account the strategies of other UAVs and cooperate with each other to complete all tasks in the mission area. At the same time, UAVs have also learned to stay away from obstacles to reduce their own risks when completing tasks. It can be seen that UAV 2 is relatively close to the obstacle at the beginning, so there is a possibility of collision. In order to ensure its own safety, it first flies away from the obstacle, and then flies to the mission location after reaching the safe area.

Conclusions
In this paper, a multi-UAV cooperative task assignment model in complex environment is constructed by considering UAV distance, collision, and communication. Meanwhile, we  propose an MA-SAC algorithm to solve the model by combining the SAC algorithm of deep reinforcement learning with multiagent framework of centralized training and decentralized execution. Simulation results show that the MA-SAC algorithm is superior to the MADDPG algorithm in convergence result in multi-UAV task allocation environment. In terms of task completion rate, the model trained by the MA-SAC algorithm also achieved a better result.
In the future work, more complex factors will be considered in the environment, such as making the communication model more suitable for real scenes and weather changes. At the same time, it will also study the larger-scale dynamic task allocation of UAV. Since this paper only studies the UAV cooperation scenario, the UAV task allocation in the countermeasure scenario will be studied in the future.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.