A Fast Optimal Coordination Method for Multiagent in Complex Environment

Facing the implementation problems such like low growth reward, long training time, and poor stability of the multiagent learning methods when dealing with complex environment and more agents, this paper proposes a fast optimal coordination method for multiagent in complex environment (FOC-MACE). Firstly, the environment exploration strategy is introduced into the policy network based on the MADDPG method for higher growth rewards. Ten, the parallel computing technology is adopted in the critic network, in purpose to efectively reduce the training time. Tese tactics together are benefcial to enhance the stability of multiagent learning. Lastly, the optimal resource allocation is carried out to realize optimal coevolution of the multiagents and further improve the learning ability of the agents’ group. To verify the efectiveness of our proposal, the FOC-MACE is compared with several advanced methods at current stage in the MPE environment. Tree diferent experiments prove that by using our method, the growth reward is increased by up to 37.1%, the training is speed up signifcantly, and the stability of the method, which represented by standardized variance, is also improved. In addition, this paper validated the fast optimal coordination method for multiagent systems in the context of UAV scenarios, demonstrating the practical performance of the approach. Trough comprehensive experiments and scenario validations, the study successfully confrmed the efectiveness of the proposed fast optimal coordination method for multiagent systems in complex environments.


Introduction
Te latest progress of multiagent learning methods within complex surroundings is benefcial to solve many difcult conditions in the feld of intelligent development [1].However, as the complexity of the environment increases and the number of agents expands, multiagent learning methods continue to face challenges such as low growth reward, long training times, and poor stability, limiting their efectiveness in practical applications [2].Enhancing growth reward is crucial for accelerating the learning process, enabling agents to adapt to the environment more quickly and improve their performance, thereby achieving expected task goals more rapidly.With the increase in the number of agents, training time also escalates.Shortening training time ofers advantages such as saving computational resources, expediting technological deployment, and improving efciency.A more efcient algorithm can complete training and generate viable solutions in a shorter timeframe [3,4].Stability refers to the ability of a multiagent system to maintain performance under diferent environmental conditions or in the face of changes.Improving stability contributes to making the system more robust, capable of adapting to various complex and changing scenarios.
In the complex feld of multiagent collaborative optimization, there is a scarcity of literature specifcally addressing this issue, lacking clear and relevant research.However, some scholars have already undertaken relevant research on agents exploring environmental issues.Te primary objective of multiagent exploration of the environment is to enhance the system's growth reward and improve its stability.Chen et al. [5] proposed an efective model-free and nonstrategy AC algorithm for robots to achieve skill acquisition and continuous control.Combining task reward with task-oriented guidance reward, an agent can explore the environment more intentionally, so as to achieve sampling efciency.Raziei and Moghaddam [6] developed and tested a deep RL framework based on the concept of task modularity and transfer learning for superactor soft actor-critic (HASAC), which transfers the learning strategy of the previous tasks to the new task through "superparticipants" to enhance the adaptability of the agent to new tasks.Reference [7,8] used the DDPG method to realize the optimal decision-making of robots and UAVs in certain environment.Bougie and Ichise [9] proposed a new end-curiosity mechanism of deep reinforcement learning method so that the agent can better explore its environment.Luis et al. [10] proposed the DDQL method to complete the patrol task of autonomous vehicles.Hou et al. [11] proposed that a double-delay deep deterministic strategy gradient (TD3) is to learn the motion ability of the manipulator.Meanwhile, some scholars pay more attention to ensure the stability of the method.Liu et al. [12] proposed two knowledge transfer methods based on deep neural networks (i.e., direct value function transfer and NSR-based value function) to accelerate multiagent learning and obtain better asymptotic performance.Zheng et al. [13] extended the weighted double estimator to the feld of multiagent learning and proposed an MA-DRL framework, namely, the weighted double depth q-network (WDDQN) which could reduce bias efectively, and is able to handle raw visual inputs.Among the above methods, the DDPG method adapts to the environment faster than other methods under a certain condition and performs better in respect of agent exploration and also has been applied to realize cooperation-confrontation of robots and drones.However, due to the complexity of evaluating network computing, the DDPG method is more suitable for single-agent environment exploration, instead of multiagent complex environment exploration.
In addition to the need to focus on increasing growth reward and improving stability, as the number of agents increases, training time also correspondingly extends.Terefore, shortening training time has become a crucial issue.In the current research work, some scholars focus on promoting the learning methods to deal with a large number of agents.Jadhav and Farimani [14] provided a deep learning framework for extracting agent motion information in constrained multiagent/particle systems such as ants, termites, fsh, and other similar constrained multiparticle systems with elastic collision interactions.Zhou et al. [15] proposed the corresponding policy interaction reinforcement learning algorithm (SIQ) to learn the optimal strategy of each agent and used the neural network to estimate the expected cumulative reward of the interaction between the agent and its adjacent agents and fnally verifed the efectiveness of the method in the hybrid cooperativecompetitive adversarial game.Because multiagents increase the complexity of the model and rise the training time, the research on the small number of multiagents learning is more likely to make progress.Sunehag [16] used a novel value decomposition network architecture to train individual agents and solved the problem of false rewards in multiagents learning.De Souza et al. [17] used deep reinforcement learning to pursue an omnidirectional goal and realized pursuit and evasion actions between multiagents.Te authors in reference [18,19] used the MADDPG method to solve the problem of agent cooperation in MPE environment.Tampuu et al. [20] brought the deep quantum learning framework to a multiagent environment and studied the interaction and cooperative competition between two learning agents in the famous video game Pong.Sui et al. [21] proposed a new method based on deep reinforcement learning (RL) to solve the problem of anticollision FCCA in uncertain dynamic environments.Te authors in reference [22,23] applied the SAC algorithm in mobile robot navigation and the cooperative work of multijoint robots.Zhang et al. [24] proposed an attention value decomposition network (AVDNet) which utilizes the coordination relationship between agents and verifed the advanced performance of the method in cooperative scenarios.Zhao et al. [25] proposed a PCD method to solve the problems of a low training success rate and poor coordination efect in multirobot motion coordination.Among the above methods, when the number of multiagents is small, the MADDPG method shows good performance tacking into account both growth reward and stability.However, MADDPG's policy network would be disturbed by complex environments, and the computing speed of the critic network tends to slow down as the number of agents increases.In addition, the aforementioned method does not involve the resource allocation for multiagent.Reasonable resource allocation can further improve the growth reward, training speed, and stability, which is helpful in realizing the optimal coordination of multiagent.
From the above literature, it is evident that existing methods exhibit suboptimal performance in exploring complex training environments.It is noticed that in the policy network of multiagent learning, the agent is usually given only the current surrounding information which is not conducive to the agent's exploration of complex environment.Terefore, in this paper, the prediction information of the agent at the next moment is introduced into the policy network to improve the adaptability of the agent in the complex environment and increase the reward.Furthermore, it is found out that with an increased number of agents, not only the training time increases, but also the possible overftting of the critic network emerges.Based on this observation, this paper optimizes the network structure by adopting the method of parallel operation for the critic network, to shorten the calculation time, and reduces the variance of the training curve.In addition, we found in the experiments that the learning of multiagent also relates to its resource allocation closely.Reasonable resource allocation on the basis of environmental exploration and parallel computing can further improve the performance of the method and fnally realize the optimal coordination of multiagent.
To address these issues, this paper proposes a fast optimal coordination method for multiagents in complex environments (FOC-MACE), which aims to improve the growth reward of multiagents within complex Te remaining sections of the article are organized as follows.Section 2 introduces the relevant research on the MADDPG multiagent learning method based on AC to address challenges in multiagent environments.Section 3 provides a brief overview of the traditional Markov decision process and AC learning framework, followed by a detailed description of the proposed method.Section 4 presents relevant experiments conducted in the MPE training environment, validating the performance of the proposed algorithm.Section 5 conducts complex scenario validations for the FOC-MACE.Finally, Section 6 summarizes the work presented in this paper.

Related Work
Te AC-based MADDPG multiagent learning method combines policy search and value evaluation, which is suitable for the case of a small number of multiagents.Tis type of method has recently received a lot of attention in relevant researches [18,26].In the referred research works, two methods are proposed for improvements over this method.One is to deduce the interference of the environment to the policy network by designing a new revenue mechanism, so that to enhance the agent's ability to explore the environment [9,13,27].Te second is to change the evaluation strategy of the critic network, for higher response speed together with less calculation errors [15,28,29].With similar focus, the methods proposed in [19,30,31] take into account the characteristics of these two types of methods and presented excellent performances: the growth reward, training speed, and stability of these methods are signifcantly better than other methods.In reference [19], a strategy smoothing technique is introduced into the MADDPG method to reduce the variance of learning strategies so as to alleviate the training instability of cooperative and competitive multiagent and signifcantly improve the stability and performance of training.Te authors in reference [30] introduced a new intrinsic motivation mechanism (GICM), which encourages agents to pursue novel and surprising states to explore the current scene more comprehensively.Compared with the ordinary reinforcement learning algorithm, the obtained algorithm has better strategy convergence speed and higher task success rate.Te authors in reference [31] used parameter asynchronous sharing mechanism and soft sharing mechanism to balance the exploration of agents and the consistency of similar agent strategies.Te empirical results show that this method can signifcantly improve the training efciency of collaborative tasks, competitive tasks, and mixed tasks.Chen et al. [32] proposed a reinforcement learning framework based on global and local attention, addressing decision-making challenges in collaborative tasks involving multiple UAVs.Te efectiveness and efciency of this approach were validated through simulation experiments.However, the experimental environment of the above literature is simple, and there is a lack of multiagent performance verifcation in complex environments.
FOC-MACE introduces the environment exploration strategy into the policy network, guides the update direction of the policy network, and enhances the exploration and prediction of the complex environment so that to prevent low growth reward of the agent in the complex environment and signifcantly improve the performance.In the critic network, the parallel operation technology is adopted, and the time diference function is introduced to evaluate the network parallelly, aiming to avoid overftting, and improve the training efciency of the multiagent and achieve better training speed and stability of the method.In addition, when multiagents face complex environments, multiobjective optimization methods are applied in this paper to optimize resource allocation, which realizes the optimal coordination of multiagents and further improves the performance of the method.
Te scheme of our FOC-MACE is illustrated in Figure 1.In the interaction process between the agents and the environment, the environment exploration strategy is helpful in refning the growth reward during training, thereby enhancing the exploration ability of the agent.In addition, the parallel operation mechanism improves the training speed and stability of the method by shortening the training time and reducing the variance of the training curve.Finally, the optimal resource allocation is carried out to jointly realize the optimal coordination of multiagent.
In order to verify the superiority of the method in this paper, in Section 4.1, the same MPE training environment as other methods used is applied to produce comparisons around evaluation indicators such as reward, episode, and variance, reporting the exploration ability, training rate, and stability of our method.

Method of This Article
Tis section frst introduces the used symbols, then briefy summarizes the traditional Markov decision process and AC learning framework, and fnally presents the detailed description of the methods proposed in this paper.Among them, Section 3.1 is about environment exploration, Section 3.2 focuses on parallel operation, and Section 3.3 explains the optimal resource allocation.

International Journal of Intelligent Systems
Markov decision process can model the interaction process between agents and the environment.Each agent i is represented by an independent quintuple (S, A, P, R, c), where S represents the state space of the agents, which is the set of all possible environmental states; A represents the action space of the agents, which is the set of all possible actions that the agents can take; P is the state transition function.To ensure that the agents' action output is deterministically continuous, the state transition function p i for agent i specifes the probability of transitioning from state s i to another state s i+1 after taking action a i ; R represents the reward function, and c represents the discount factor, where 0 < c < 1. Te discount factor is used to adjust the balance between long-term and immediate rewards.Te agent i in the environment, at time t, has the corresponding state s t i .Te agent i will use a certain strategy π t i to select the action a t i and interact with the environment, and its state s t i will be transferred to the state s t+1 i at the next moment.In addition, in this process, according to its own state and the action it performs, it will get the beneft at time t, that is, s t i × a t i ⟶ r t i .Tis cycle process is a Markov decision process.
Similarly, for multiagents, the joint state of the agent at time t is S t � (s t 1 , s t 2 , . . ., s t N ); the joint strategy is In our work, the experimental environment designed is a two-dimensional space with continuous attributes.In each iteration of the agent, it acquires the current state from the environment.Using a preexisting policy, the agent selects an appropriate action from the action space, applies the chosen action to the environment, and fnally obtains a reward from a predefned reward function.Simultaneously, the environment state transitions as actions take place.Te design details for the state space, action space, and reward function are as follows: (1) State space: in the state space, the information received by the agent from the environment includes both position and velocity information.During the process of reaching the target point or the hunter chasing the prey, diferent states entail various pieces of information.Positional information encompasses the positions of the hunter s hunt pos , the prey s prey pos , the target point s targ pos , and any obstacles s land pos .Velocity information includes the velocities of the hunter s hunt vel and the prey s prey vel .Hence, the state space can be represented as s � s hunt pos , s prey pos , s targ pos , s land pos , s hunt vel , s prey vel  . ( (2) Action space: in reinforcement learning, the set of efective actions, determined by the environment, constitutes the action space.In our work, the actions of the agent can be decomposed into accelerations along the x and y directions.For N agents, the action space vector is as follows: (3) Reward function: in the experimental environment, there are two distinct categories of agents: hunters and prey.Te number of agents varies from few to many, and the complexity of environmental obstacles increases gradually.Te reward function is as follows: Equation ( 3) represents the reward function for the hunters, where n r denotes the quantity of prey in the experiment, and x and y represent the coordinates of each agent.Equation (4) represents the reward function for the prey, where n b denotes the quantity of hunters in the experiment.As the distance between hunters and prey decreases, the reward for hunters increases.Conversely, as the distance between prey and hunters increases, the reward for 4 International Journal of Intelligent Systems prey increases.When a hunter collides with prey, it signifes a successful capture by the hunter, i.e., the prey experiences a capture event.Conversely, the prey's reward is based on maintaining distance from the hunters.
In the AC learning method, the Q function is introduced instead of the reward R t .Te policy network outputs a t i according to s t i .Te critic network calculates the Q function value based on a t i and s t+1 i obtained by the interaction between the agent and the environment.In addition, both the policy network and the critic network use the loss function composed of Q function values to update the gradient.
Te critic network is updated according to the square error between the target Q value and the actual Q value.Te loss function of the critic network is Among them, N is the total number of agents in the environment, θ Q is the critic network parameter, y t is the target Q value of the target critic network output, and its formula is shown in equation (2).Q(S t , A t , θ Q ) is the actual Q value of the critic network, which represents the optimal value of the action A t under the current state S t .
in equation ( 2), c is a discount factor, and the objective of the critic network is to fnd the minimum loss error function.Terefore, the parameter θ Q of the critic network is updated by the gradient descent method.Te goal of the policy network is to maximize the value of the action, so the gradient calculation formula of the policy network is In equation ( 3), except that the action a t corresponding to the current agent needs to be calculated in real time using the policy network, the remaining actions can be obtained from the experience playback set.Te goal of the policy network is to maximize the score of the evaluation network, so the parameter θ u of the policy network can be updated by the gradient rise method.
In the learning framework of AC, we found some problems and proposed the following assumptions: Assumption 1: by introducing the environment exploration strategy into the policy network, the agent can explore the complex environment and improve the growth reward of the training curve.In the policy network gradient update formula, the proportion of the target Q value y t+1 of the next moment t + 1 is introduced to improve the exploration ability of the agent.In equation ( 3), the gradient update formula of the policy network only considers the actual Q value of the current time t critic network.However, the optimal decision at the current moment may not be the global optimal decision.Te policy network of the AC learning framework may produce local optimal problems.Introducing the target Q value y t+1 at the next moment can avoid local optimal problems.Assumption 2: the introduction of parallel computing technology into the critic network can efectively shorten the training time and avoid overftting problems.Te calculation amount of the diferential error function in equation (1) will increase with the increase of the number of agents, which will afect the training speed of the method.Te introduction of parallel computing technology can reduce the network computation and improve the training speed.Assumption 3: optimal resource allocation for complex multiagent systems can efectively improve the performance of the method.Equations ( 1) and (3) show that the operation of the policy network and the critic network is closely related to the state S t of the agent.
Te training curve will be afected by the initial information of the agent.Reasonable resource allocation can further improve the growth income of the training curve, reduce the training time, and improve the stability.
In summary, the environment exploration strategy and parallel computing technology proposed in this paper can cope with multiagents in complex environments and can signifcantly improve the growth reward, training speed, and stability of agents.In addition, the optimal resource allocation can solve the optimal coordination problem of multiagents and further improve the performance of agents in complex environments.

Environmental Exploration Strategy.
Te policy network of the AC learning method inputs the current information of the agent, which means that the agent only responds to the current moment.Tis behavior is not conducive to the agent's exploration of complex environments.Terefore, according to assumption 1, we introduce the prediction information of the next moment of the agent into the policy network to improve the adaptability of the agent in the complex environment and the growth reward of the training curve.Te critic network combines the state and action information of other agents to output the evaluation value Q to guide the update of the policy network.On this basis, some Q ′ values are introduced to increase the predicted state and action of the agent at the next moment, ensure the update direction of the strategy network, and improve the adaptability of the agent in the complex environment and the growth beneft of the training curve.Te gradient formula of environmental exploration strategy is International Journal of Intelligent Systems Among them, N is the number of agents, θ u is the policy network parameter, θ Q is the critic network parameter, θ Q ′ is the target critic network parameter, β � 0.8 is the selfcoefcient, and φ � 0.2 is the exploration coefcient.Equation (4) includes the estimated information of the agent at time t + 1, which can make the target of the policy network tend to the subsequent movement of the agent and improve the adaptability of the agent to the complex environment.In addition, the standardized β and φ control the weight of Q value and Q ′ value, which ensure the gradient balance of policy network update.Te self-coefcient β retains the main guiding efect of the actual Q value on the policy network update and also plays a role in attenuating the overestimated Q value.On the other hand, the exploration coefcient φ introduces the Q ′ value at the time of t + 1, which improves the exploration ability of the agent, attenuates the excessive Q ′ value, and avoids local optimization during training.
Te goal of the policy network is to maximize the value of the action.Equation ( 5) is the gradient formula of the random mutation module for updating the policy network: Te diference between the target Q value and the actual Q value is called the time diference error value.Te process of using this method to replace the cumulative discount reward and punishment is a guiding process.Terefore, the parallel operation of the critic network can be realized according to the two continuous states and the corresponding reward and punishment values.Equation ( 7) uses the gradient descent update method to ensure the convergence of the critic network.
Dropout is a regularization technique proposed by Hinton et al. [34] in 2012 to reduce overftting during neural network training.Figures 2(a) and 2(b) depict schematic diagrams of the technology before and after the introduction of dropout, respectively.Te core idea of dropout is to randomly select a portion of neurons to be dropped out during each training iteration.Terefore, each iteration is akin to training a diferent subnetwork.Tis training approach makes the network more robust, efectively preventing the model from overly relying on specifc neurons and reducing the risk of overftting to those neurons.Introducing dropout can address the overftting phenomenon that may arise when dealing with a large number of agents in an intelligent system.

Optimal Resource Allocation.
For multiagents in complex environments, the initial information of the agent can determine the degree of confusion of the agent.In addition, the operation of the policy network and the evaluation network is closely related to the agent state S t , so it is necessary to allocate the multiagent resources reasonably.Based on assumption 3, frstly, the growth reward and training time are taken as the optimization objectives, and the multiobjective optimization algorithm is used to output the agent Pareto-optimal candidate solution group.Ten, the entropy weight is calculated according to the importance of growth reward, training time, and variance, and fnally the distributed evolutionary decision is made to output the optimal resource allocation.
Distributed evolutionary decision-making can comprehensively evaluate various indicators and make optimal decisions.For candidate schemes, distributed evolutionary decision-making uses objective weights to perform grey 6 International Journal of Intelligent Systems correlation analysis so as to rank their comprehensive advantages and disadvantages.Te optimal resource allocation can correspond to the highest correlation degree, and the correlation degree calculation formula is in equation ( 8), n is the number of candidate schemes, m is the number of indexes, and p ij is the characteristic proportion of the j th index of the i th candidate scheme.Te formula is where ρ is the resolution coefcient, ρ ∈ [0, 1], ρ is 0.5 by default, and r ij is the 0-1 normalization of x ij .x max ,j and x min ,j represent the maximum and minimum values of the jth index in all schemes, respectively.Te higher the correlation degree of the resource allocation scheme, the better the overall evaluation of the scheme; that is, the optimal coordination of multiagent is realized.Environment exploration strategy and parallel computing mechanism can solve the problems of poor training performance, slow training speed, and poor stability of multiagent in complex environment.In addition, the optimal resource allocation can further improve the performance of multiagents in complex environments and achieve fast optimal coordination of multiagents.

Experiment
Te computer used for experiments whose cache is 16G, CPU is core i7, and operating system is Ubuntu 18.04.Te code is implemented based on python3.6 with Parl 1.3.1,Gym 0.10.5.To assess the performance of the algorithm proposed for achieving fast optimal coordination, we conducted six runs for each experimental scenario, setting the data points in the table as the average of the six experimental results.Te hyperparameters employed in the experimental environment are presented in Table 1.
Te experimental training environment used in this section is the Particle MPE environment, and the complexity of the environment is set to three diferent levels to estimate concerned learning methods from diferent perspectives.From the frst level to the third level, the complexity increases gradually and the corresponding experiments are presented in Section 4.1 to Section 4.3, separately.In Section 4.1, our method, the FOC-MACE, is compared with other commonly used multiagent learning methods at this stage.Te experiment in Section 4.2 is to compare the training efect of our method with the DDPG and MADDPG methods in a more complicated environment.Te experiment in Section 4.3 is to compare the training efect of our method with the MADDPG method in an even complex environment.All experiments in this section efectively verify that the proposed method successfully achieve faster and optimal collaboration in complex environments.
Since the training curves, namely, the recorded rewards in the process, obtained from diferent training methods in complex environment, have high deviations in between, the variance of the training data of diferent methods is standardized to refect the fuctuation of the training curve when it is stable.Te formula of variance standardization is International Journal of Intelligent Systems where x i is the income value of the sampling point, x is the average reward value of the sampling point, n is the number of sampling points, and up is the growth reward.In the experiment, the data are sampled when the training curve of each method converges stably and the sampling points are evenly distributed on the curve.
4.1.First-Level Complex Environment.Figure 3(a) shows a kind of calibration environment for diferent methods, in which three agents move to lock and chase the corresponding target points.Te distance between the agent and the corresponding target point is inversely proportional to the obtained reward.As shown in Table 2, a comparison is made between the proposed approach and existing advanced learning methods [19,30,31,33] across six aspects: initial reward, maximum reward, stable reward, growth reward, training time (episode), and standardized variance.
In Table 2, the training time of the literature [30] is better, and the growth reward and standardized variance of the literature [33] are better.From Table 2, the growth reward of our method is 37.1% higher than that of the method in reference [33].Te training time is the same as that in reference [30], but the growth reward and variance are obviously better.Te variance is reduced by an order of magnitude compared with the method in reference [33], indicating that it has excellent environmental exploration ability, fast training speed, and good stability.In order to visually demonstrate the superiority of the method, we standardize the training curves of other methods to make them have the same starting point as the training curves of our method.It can be seen from Figure 3(b) that the growth reward and training speed of our method training curve are signifcantly higher than other methods, indicating its excellent exploration ability and training speed in complex environment.
In order to further compare with diferent methods, the hunting environment is set as shown in Figure 4(a).In this environment, three hunters will look for prey and the hunter's reward is inversely proportional to the distance to the prey.In addition, prey will escape from the hunter and its reward is proportional to the distance to the hunter.Tere is also an obstacle locating in the environment that could block the moving route between hunters and prey, increasing the complexity of the environment.
In Table 3, the growth reward of our method is 10% higher than that of the best performing method [29].Te training time is the same as that of literature [30], but the growth reward and variance are signifcantly better than [30].Te variance is reduced by an order of magnitude compared with [30], indicating that it has excellent environmental exploration ability, training speed, and stability.Similarly, in order to visually demonstrate the superiority of our method, we standardize the training curves of other methods to make them have the same starting point as our training curves.It can be seen from Figure 4(b) that the growth reward and training speed of our method training curve are signifcantly higher than other methods, indicating its excellent exploration ability and training speed in complex environment.

Secondary Complex Environment.
Te more complex environment has not been verifed in the research work of [19,[29][30][31]33].Te experiments in this section increase the number of agents in the MPE pursuit environment to 10, accompanied by hidden areas and obstacles, aiming to verify the performance of multiagents in complex environments.When the agent enters the white hidden area, it cannot be detected by the agent outside the area.Te introduction of the hidden area greatly increases the complexity of the training environment.Figures 5(a 4 is the comparison of the proposed method with the existing DDPG and MADDPGA methods in six aspects: initial reward, maximum reward, stable reward, growth reward, training time (episode), and standardized variance under three diferent environments in Figure 5.
As can be seen from Table 4, under the same training time, the growth reward of our method under environment in Figure 5(a) is 45%, which is 12.9% higher than that of DDPG and MADDPG, respectively; variance compared with DDPG and MADDPG decreased by 36% and 3%, respectively.Under environment Figure 5(b), also compared with DDPG and MADDPG, the growth reward of our method is increased by 83% and 30%, respectively.Te variance is reduced by an order of magnitude compared with DDPG while 100% compared with MADDPG.Similarly, under environment as shown by Figure 5(c), compared with DDPG and MADDPG, the growth reward of our method is increased by 36.5% and 16.2%, respectively.Te variance is reduced by an order of magnitude compared with DDPG and 68.4% compared with MADDPG.As explained previously, in order to visually demonstrate the superiority of our method, we perform variance processing on the training curves of other methods so that they can have the same starting point as the training curves of our method.Te obtained comparative training curves are shown in Figure 6.It can be seen from that the growth reward and training speed of our method training curve are signifcantly higher than those of DDPG and MADDPG methods, indicating its excellent exploration ability and training speed in complex environment.

Tree-Level Complex Environment.
In this section, we increase the number of active agents to 20, with two hidden areas and two obstacles.Compared with the second-level complex environment in Section 4.2, the complexity of the          International Journal of Intelligent Systems experimental environment in this section is greatly increased and the number of agents is doubled.In this experimental environment, the performance of the multiagent fast optimal coordination method and MADDPG method in complex environment under diferent resource allocation conditions are explored, respectively.Since the DDPG method is not applicable to a large number of multiagents, it is not discussed in this section.With the goal of increasing revenue and training time, the number of 20 agents is the limit and the MOEA method is used to fgure out the optimal resource allocation group as shown in Figure 7.
It can be seen from Table 5 that under the same training time, the growth reward MADDPG of our method in Environment Figure 7(a) is 5% higher than that of MADDPG, and the variance is one order of magnitude lower than that of MADDPG.In Figure 7(b), the growth reward of our method is 24.4% higher than that of MADDPG, and the variance is 5% lower than that of MADDPG.Te growth reward of our method in Environment Figure 7(c) is 15.1% higher than that of MADDPG.Te variance is reduced by an order of magnitude compared with MADDPG.Similarly, in order to visually demonstrate the superiority of our method, we perform variance processing on the training curves of other methods so that they can have the same starting point as the training curves of our method.Te obtained comparative training curves are shown in Figures 8(a)-8(c).It can be seen from Figure 8 that the growth reward and training speed of our method training curve are signifcantly higher than those of the MADDPG method, indicating its excellent exploration ability and training speed in complex environment.It can also be seen in Table 5 that our method in Environment Figure 7(b) has better growth reward and variance than Environment Figures 7(a) and 7(c), so it is inferred that Environment Figure 7(b) is the optimal resource allocation.
In order to verify the above conclusions, we use the growth reward, training time, and standardized variance as evaluation indicators to investigate the grey correlation degree of the confgurations of the three scenarios.As shown in Table 6, the correlation degree of confguration 2 is the highest, indicating that the growth income, training time, and standardized variance of confguration 2 are the best, which realizes the optimal resource allocation and fnally realizes the optimal coevolution.

Scenario Verified
In this section, the FOC-MACE is verifed in complex scenes for possible application.One of the main modes of future air confrontation is the UAV cluster systems confrontation.How the multi-UAVs conduct collaborative pursuit of escaped UAVs through environmental exploration within time as short as possible is critical and also an ultimate expectation in air confrontation.Terefore, the research on multi-UAV combats in air confrontation has important practical signifcance.
In this paper, unity 3D is used to establish a complex scene of UAV group, as shown in Figures 9(a Compared with MPE scene, the state space and action space of UAV change from two-dimensional to threedimensional.While retaining the number of 20 agents in the three-level complex environment, the scene also introduces real physical variables such as gravity, wind speed, obstacles, and fog areas.Te training scene becomes more complex than the MPE scene.Te environment includes 20 UAV agents, 15 trees, a mountain range, and two hidden fog areas.In this scenario, the blue UAV group aims to capture the red UAVs, and vice versa, the red UAVs aim to avoid being captured.
Figure 10 shows the entropy change curve of the UAV group.Policy entropy is an efect mean to represent whether a random strategy is "certain."In the training process, as the strategy evolutes, the strategy becomes more and more specifc.At this time, the entropy should be gradually reduced and the entropy curve tends to be gentle.Te determinate strategy shall ensure the stable states of each UAV and then intimate the optimal coordination of the UAV group.
In Figure 10, with the increase of training times, the policy entropy of UAV swarm is in a decreasing state.From the perspective of action space, the UAV explores various actions as much as possible, maximizing entropy and minimizing policy entropy, which implies that various state spaces will also be explored.Te strategy entropy tends to be stable after 20000 times training episode, which refects the training speed and stability of the method.Te decreasing trend of strategy entropy is equivalent to the increase of growth reward.In summary, the strategy entropy curve shows that our FOC-MACE method is capable of achieving the optimal coordination of the UAV group for the chasing tasks in the above scene of air confrontation.

Conclusion
Our research results show that when multiagents face complex environments, the introduction of environmental exploration strategies into the policy network can improve the adaptability of agents to complex environments, thereby improving the growth reward of the reward curve.Te introduction of parallel computing technology in the critic network can reduce the amount of network computation and signifcantly improve the training speed of the agent.Finally, on this basis, the optimal resource allocation of multiagent is carried out, which can realize the fast and optimal coordination of multiagent in complex environment.In this paper, a series of complex environment experiments are carried out.In the experiment, the method of this paper, the advanced multiagent learning method, MADDPG method, and DDPG method are compared and evaluated.Te experimental results show that the multiagent fast optimal cooperative method in complex environment has maintained a high growth reward, faster training speed, and better stability in various complex environments.In addition, the proposed method is verifed in the UAV swarm scenario, which further verifes the efectiveness of the multiagent fast optimal coordination method in complex environment, and shows that the proposed method has application prospects in the UAV swarm pursuit scenario.
)-5(c) show three diferent complex environments, and Figures 6(a)-6(c) show the training curve in three diferent environments of Figure 5. Table

Figure 3 :
Figure 3: Calibration environment and training curves of diferent methods.(a) Calibration environment.(b) Training curve.

Figure 4 :
Figure 4: Pursuit environment and training curves of diferent methods.(a) Pursuit environment.(b) Training curve.

Figure 6 :
Figure 6: Training curves in diferent secondary complex environments.(a) Training curve in environments with 2 hidden areas.(b) Training curve in environments with 2 hidden areas +1 obstacle.(c) Training curve in environments with 2 hidden areas + 2 obstacles.

Table 3 :
Comparison of training curves of diferent methods in pursuit environment.

Table 4 :
Comparing training curves of diferent methods in three complex environments in Figure5.SceneMethod Initial reward Max reward Stable reward Growth reward Episode Standardized variance

Table 5 :
Comparing training curves of diferent methods in three complex environments in Figure7.SceneMethod Initial reward Max reward Stable reward Growth reward Episode Standardized variance