Learning Attentional Communication with a Common Network for Multiagent Reinforcement Learning

For multiagent communication and cooperation tasks in partially observable environments, most of the existing works only use the information contained in hidden layers of a network at the current moment, limiting the source of information. In this paper, we propose a novel algorithm named multiagent attentional communication with the common network (MAACCN), which adds a consensus information module to expand the source of communication information. We regard the best-performing overall network in the historical moment for agents as the common network, and we extract consensus knowledge by leveraging such a network. Especially, we combine current observation information with the consensus knowledge to infer more effective information as input for decision-making through the attention mechanism. Experiments conducted on the StarCraft multiagent challenge (SMAC) demonstrate the effectiveness of MAACCN in comparison to a set of baselines and also reveal that MAACCN can improve performance by more than 20% in a super hard scenario especially.


Introduction
In real life, multiple autonomous agents need to work together to complete a large number of complex tasks, such as formation control [1], autonomous vehicles [2], unmanned aerial vehicle [3], and multiplayer confrontation games [4]. Multiagent reinforcement learning (MARL) has made signifcant progress in optimizing the cumulative global rewards in these multiagent systems [5]. However, several open problems are still not well solved in the feld of MARL. One of the problems is to learn cooperative behaviours between agents under the partial observation.
For the partially observable multiagent cooperative task, one approach is to use value-based reinforcement learning [6]. Independent Q-learning [7] is a typical representative of applying single-agent reinforcement learning methods directly to multiagent problems. Although independent Qlearning has good scalability, it brings nonstationarity due to the constant changes in the strategies of other agents in the training process, where the performance of each agent is often poor. Based on the series of value decomposition methods [5,8,9], the paradigm of the centralized training and decentralized execution (CTDE) [10] is adopted to deal with the situation, in which all agents are centrally controlled during training and each agent individually utilizes a distributed policy during execution. Another way is to exploit the communication in MARL. Te CommNet method proposed in [11] employs a continuous communication channel in which an agent gets the sum of the information transmitted by other agents. Te communication module is also utilized for engaging with other agents inside policy or critic networks [12]. Typically, agents take the current observation or hidden layer information as the raw information input for communication [13], which limits the source of information.
To address the issues, we add an additional common network for each agent to preserve the historically bestperforming overall policy network. Te overall policy consists of the policies of all agents. Te common network with the best historical performance, which represents the optimal overall policy to a certain extent, can predict the goals to be achieved by future agents. Te ultimate goal of (1) We add an additional common network to save the historically optimal overall policy for all agents, which gives each agent another source of information to make decisions (2) We process communication and consensus information via the attention mechanism in order to extract more efective information (3) We demonstrate additional experiments in the DAgger framework to verify the feasibility of the common network as an expert network to gather data and improve the performance of agents on the StarCraft multiagent challenge

Related Work
After reaching outstanding results in the single-agent setting [16], researchers move on to the more challenging multiagent environment [17,18]. Te most straightforward approach to multiagent learning is to have each agent train and learn on their own [19]. Tis early approach, known as the independent Q-learning (IQL) method, was one of the frst of its kind, although it does not generally fare well in practical applications. Despite this, many challenges employ IQL as the baseline for experimental comparison due to the ease of implementation and strong scalability as the number of agents increases.
To further the application of deep deterministic policy gradient (DDPG) to the feld of multiagent systems [20], the multiagent DDPG (MADDPG) algorithm was suggested in [21] as part of the research algorithm for multiagent cooperative tasks. In the training phase, MADDPG can obtain the behaviours of all agents, allowing it to solve the problem that each agent in a hybrid environment has its own local reward. Based on MADDPG, the authors of [22] propose shared memory as a form of communication, and a diferent MADDPG-MD [23] algorithm with improved robustness is inspired by dropout.
Another way to deal with the problem of global reward credit allocation is through value function decomposition. Te VDN algorithm presented in [8] decomposes a global Qfunction into the sum of the individual local Q-functions of the agent, alleviating the problems of lazy agents and credit allocation among agents. In addition to adding a hybrid network and a nonlinear component to the decomposition on the foundation of VDN, the Q-value mixing network (QMIX) [9] guarantees that the global Q function is monotonous in the local Q function. Te COMA [24] algorithm measures the contribution of an agent to the global reward based on the counterfactual baseline. In order to be suited for tasks that satisfy the decomposition conditions but not monotonicity in QMIX, the QTRAN [25] algorithm reduces the structural limitations of QMIX in order to be able to handle more generic problems. Teoretically, Qatten [26] provides a generic decomposition of the value function that explicitly models the infuence of intelligence on the whole, based on multihead attention. Te regularized softmax (RES) algorithm [27] is an enhancement of QMIX that solves the overestimation of the Q-value, which employs the softmax approach while computing the target.
In terms of multiagent communication, the authors of [28] frst introduce communication information into deep reinforcement learning with the goal of resolving the discrete communication channel problem, which combines DQN and IQL applied to multiagent problems. Te CommNet algorithm permits multistep communication, and the gradient can be transmitted back to each agent via the continuous communication channel. BiCNet [29] presents a bidirectional recurrent network to each agent, which conceals the information exchange between agents. In order for agents to learn to communicate better, the IC3Net proposed in [13] includes a gate control for each agent so that they can decide whether or not to communicate with other agents.
In terms of screening efective information, the attention model has been widely used in computer vision [30], plant species recognition [31], resource allocation [32,33], and 2 Computational Intelligence and Neuroscience reinforcement learning as a successful method. Te ATOC algorithm suggested in [34] gives agents the ability to choose whether or not to contact one another, as well as which other agents they want to connect with. Te method of targeted multiagent communication (TarMAC) [12] uses a signaturebased soft attention mechanism during several communication rounds to gauge the importance of communication. Te G2ANet [35] method employs a two-stage attention network model, using the hard attention mechanism to defne the interactive agent and the soft attention mechanism to determine the weight of the interaction, and automatically learns the relationship between the constantly changing agents in large-scale complicated games.
In the feld of multiagent modeling, several earlier eforts have learned the models of other agents through observation. Te SOM [36] algorithm employs its own approach to forecast the behaviours of adversaries, infers the target information of other agents, and then makes decisions based on this target information. By providing agents with intrinsic rewards that have a causal impact on the behaviour of other agents, the social infuence approach [37] seeks to create coordination and communication in MARL through a unifed strategy. However, both of these approaches involve extra work to build this predictive network using supervised learning methods.
Most of these communication algorithms are limited to the hidden layer information of the policy network at the current moment. Our MAACCN algorithm expands the source of communication information by incorporating the common network, and the algorithm of MAACCN selects the algorithm fow of the classical value decomposition series as the fundamental framework to improve both the consensus information module and the communication module, which extract the efective communication information for better collaboration among agents, and in contrast to the SOM algorithm, we select a common network with the best historical performance as the expert network without additional training. Additionally, we take advantage of the DAgger framework, which gathers information from experts and current policies to improve the training dataset.

Background
Tis section introduces some of the ideas behind the DAgger framework and MARL based on the value-based algorithm.

DAgger.
With the DAgger framework, we can run both the expert policy we want to clone and the novice policy we need to teach at the same time, expanding the capabilities of conventional supervised learning techniques [15]. By aggregating additional expert evidence, the reward structure and the fundamental model are revealed.
DAgger collects more training instances from a combination of the current policy π and the expert policy π * in an iterative manner. A decision rule governs the interaction of the current policy and the combined expert with the environment during a certain episode, following the new policy: where α i is annealed progressively from 1 to 0.1. DAgger takes the new dataset D by the new policy π θ i with each iteration to retrain the next policy as follows: where D i is gathered by the expert policy π * and D is the previous set.

Reinforcement
Learning. MARL involves many agents and numerous states, which is the combination of the Markov decision process and the matrix game, whereas the Markov decision process involves one agent and multiple states and the matrix game involves multiple agents and one state. Te evolution of MARL is inextricably linked to game theory [38], and the partly observable multiagent cooperation problem (Dec-POMDP) can be characterized as 〈N, S, A, R, P, O, c〉. It is defned for N agents by the global states S, action sets In this study, we investigate a purely cooperative task in which each agent receives the same reward, and the overall objective is to discover the most efective cooperative method in order to maximize the cumulative reward G � T t�0 c t r t . Te VDN [8] method is a value decomposition structure based on DRQN [39] to learn the action value function of distinct agents using only global benefts, thereby resolving the issue of partially observable fake gains and lazy agents. Te value decomposition function of VDN is as follows: QMIX [9] is an enhanced version of VDN that employs a hybrid network to combine local agent functions and global state information to give positive weights for the hybrid network during training. Taking argmax for the joint action value is the same as taking argmax for each local action value function as follows: QMIX converts the preceding expression into a monotonic constraint, which is realized by a hybrid network. Te following is the restriction:

Multiagent Attentional Communication with the Common Network
In order to more efectively handle the problem of multiagent communication and cooperation and assure the adequacy and efcacy of communication information, we propose the multiagent attentional communication with the common network (MAACCN) learning algorithm, which regards the common network as the expert to each agent based on the DAgger framework and employs the attentional mechanism to process the consensus information and communication information. Tis section elaborates on the concept of our algorithm and the structure before introducing the attention unit and further explores the infuence of the common network.

Te Framework of MAACCN.
In the problem of multiagent cooperation, agents can make better decisions if they can infer the consensus information of other agents based on their states, actions, and thoughts. From this perspective, we suggest the common network with the highest historical performance of agents, from which we obtain the consensus knowledge. More efective information can be gathered through the attention mechanism to extract consensus information, and this part of the information can be incorporated into the policy of each agent, which can enhance the capacity of the agent for decision-making. Te framework of the proposed multiagent attentional communication with the common network (MAACCN) algorithm is shown in Figure 1, which is divided into three stages. Te frst stage of MAACCN is the processing of information features, in which the common network can be accessed by each agent. Te structure of the common network is the aggregation of the policy networks of all agents. Te common network contains policies of all agents.
Terefore, inputs of the common network are the observations of all agents. Te feature h i t is obtained by processing the action-observation(o i t , a i t− 1 ) through the gated recurrent unit (GRU) network. Te attention mechanism is used to assign weights to the consensus information (h 1 t , . . . , h n t ) of other agents collected through the common network in order to generate the output c i t . Te second stage is the communication module, which expands the communication channel among agents based on the multihead attention mechanism. Each agent broadcasts the information to be conveyed and selects the important information received from other agents based on the multihead attention mechanism in order to get the information that is efective for its own decision-making. Before making a decision, an agent engages in numerous iterations of communication with other agents to ensure an adequate exchange of information. Te third stage involves integrating the local Q i function using the mixing network to get Q tot .
In the framework of the method described above, all agents share the same set of network parameters, and different types of information can be acquired based on the observations and the ID numbers of agents at diferent times. Terefore, the historically optimal policy of each agent is the same. In order to alleviate the challenges caused by partial observability, actions and communication information are passed to agents at the next moment. In the second stage, only one GRU neural network can be utilized for cyclic and iterative communication since the two GRU neural networks that the communication module needs to transit through can share parameters.
Te algorithm of MAACCN utilizes ofine updating to add the state, action, reward, and termination state of numerous agents interacting with the environment to the experience pool, where a batch of complete episodes is selected for learning. Similar to DQN, our algorithm constructs a target network and duplicates the current network Common Network  Computational Intelligence and Neuroscience parameters every fxed steps to calculate the value of the next moment, which can expedite convergence and contribute to the stability of the algorithm. Te loss function of the algorithm is where θ ′ is a parameter for the target network Q target . Te procedure of MAACCN is described in Algorithm 1 in Appendix, which is a value-based algorithm. In the MAACCN training procedure, the optimal Q-value function estimation is found by minimizing the loss function. During the learning process of agents, the previous common network needs to be replaced by the policy network, which has better performance.

Common Network and Attention Mechanism.
Among the many approaches that can be taken to mitigate the problem of instability in a multiagent environment, it is more helpful to model the behaviours of other agents and infer the consensus information of other agents than to simply treat the other agents as part of the environment. Humans use the aims, beliefs, and preferences of other groups with whom they interact to make better judgements, according to cognitive science research. Humans mimic the behaviours of others based on their observations, a cognitive process that enables them to better comprehend the consensus and actions of others and to react appropriately in social circumstances. Inspired by this, we add a common network for all agents. In contrast to other methods of modeling the aims or actions of other agents, the common network is a replication of the historically optimal overall network, which is judged according to the average score of each episode during the test. Each common network is the same for each agent due to parameter sharing. Tus, an agent can deduce from the common network the consensus information of other agents for communication. In light of one of the fundamental assumptions of the algorithm, namely, that the common network can be used temporarily as the optimal policy network, we investigate the efect of the common network based on the DAgger framework in order to establish the validity of consensus information. Terefore, the policy π * of the common network and the current policy π jointly decide on the fnal policy π θ i . Te policy network of each agent receives the observation o i t , which is then passed through the fully connected layer and RELU function before being input fc i t into the GRU cyclic neural network in order to generate the output h i t . Te other input of GRU comes from the output h i t− 1 obtained after processing the information the last time. Te formula is as follows: where σ and tanh are the activation function, W and b represent the weight matrix and the bias to be trained, and r and z, respectively, represent reset gates and update memory gates to make more efective use of past data and alleviate local observation constraints. As visualized in Figure 2, the information from these hidden layers is processed using the multihead attention mechanism so that additional valuable information can be collected to help agents cooperate. Te attention unit can be depicted as the given information input X � (h i t , h 1 t , . . . , h n t ), the vector h i t being the information of an agent at the current time, and the scoring mechanism score � key T q/ � � d √ employing the scaling dot product model, where d is a constant. Te attention function maps a query and a set of key-value pairs to the following output: Common Network Score Score Score Score Computational Intelligence and Neuroscience     Te key and value vectors are then obtained following matrix transformations for the input data X. Te score is then computed, followed by the various weights of the value. Te fnal attention information h c i is then obtained by multiplying these vectors by the corresponding weights.

Experiments
Tis section introduces the experimental setting and assesses the performance of MAACCN, and we describe the SMAC experiment scenarios, experimental parameters settings, and baseline algorithm before analyzing the performance of MAACCN and the ablation experiment fndings to verify the efectiveness of the algorithm. Additionally, the viability of the historical optimum network as the common network is examined.

Setting.
Te gaming settings of SMAC are meticulously crafted so that agents must master one or more micromanagement skills in order to vanquish their adversaries [40]. Each scene involves a clash between two forces, with the initial position, number, and kind of each force varying from scene to scene.
Each agent receives a local observation of its feld of vision at each time step, which contains map data within the circular area of each cell. Specifcally, the feature vector comprises both friendly and hostile attributes inside the range of view. In this partial observation, an agent is unable to discern whether the remaining agents are out of sight or dead. Following previous work [8,9], we adopt various hyperparameters. In particular, Table 1 in Appendix contains the algorithm parameters.

Evaluation of MAACCN.
Te comparison experimental results of the MAACCN algorithm with baseline algorithms in fve experimental scenarios of SMAC are frst shown in this section, and these results are then further investigated.
Te game win rate is chosen as the ultimate evaluation criterion since the goal of these experimental scenarios is to learn how to guide the team under our direction toward success. Every evaluate cycle time during algorithm training, the algorithm evaluates the learned policy of an agent. An agent runs evaluate epoch rounds of game tests in the associated setting and provides our game victory percentage by tallying the number of victories. To avoid losing generalizability, various random seeds are utilized in the algorithm, four repeated tests are undertaken, and 95% confdence intervals are employed. Following training, statistics of the winning rate are used to draw the curve.
We conducted a comparative experiment in fve scenarios, where complicated scenarios necessitated more effective cooperative tactics. As depicted in Figure 3, we add the MAACCN structure to the QMIX algorithm. QMIX is currently recognized as an algorithm with excellent performance, which employs a mix network to nonlinearly Computational Intelligence and Neuroscience combine local agent functions, and QMIX has good scalability [10,26], which adopts the paradigm of centralized training and decentralized execution. MAACCN-QMIX is a method for multiagent attentional communication with the common network, while QMIX and IQL are the baseline algorithms. In the majority of situations, our MAACCN-QMIX algorithm provides signifcantly enhanced performance over the baseline method. In simple   Computational Intelligence and Neuroscience scenario 3s5z and hard scenario 3s_vs_5z, the MAACCN-QMIX algorithm produces faster convergence and superior fnal outcomes than the baseline method. In extremely challenging scenario 6h_vs_8z, the existing baseline algorithms achieve outcomes close to zero, while the approach presented in this research achieves an average victory rate of almost 25%. Table 2 shows the fnal performance of diferent algorithms during testing (the maximum median of all test results obtained in the last 250k steps of the training process). In the majority of scenarios, the results indicate that agents can cooperate more efectively and attain optimal performance by utilizing the MAACCN approach. Te direct use of the global reward to update policies produces nonstationarity, which becomes more severe as the number of agents increases, resulting in a low win rate for IQL. Te loose limitations in complex scenarios may reduce the accuracy of its updating, which lowers the QTRAN algorithm's performance.

Ablation.
We focus on three aspects of experimental analysis in particular. Te frst is to remove the mixture network from the third module of the MAACCN structure and replace it with the VDN summation method, therefore minimizing the efect of the mix network on the experiment. Te second is the ablation experiment of the communication module. One of the contrasted algorithms is the TarMAC-VDN algorithm, which removes the common network and preserves the multihead attention communication module, and the other is the CommNet-VDN method, which removes the module of the communication network and replaces it with the communication structure of the CommNet [11] algorithm. Te MAACCN-without-DAgger algorithm abandons the DAGGER framework but keeps the common network. Te fnal point to consider is the feasibility of using the common network as the expert network to guide the decisions of agents based solely on the DAgger framework. Figure 4 shows the results of comparing the MAACCN-VDN algorithm and the baseline method in three diferent scenarios. In all scenarios, the convergence rate is demonstrably superior to alternative baseline algorithms, and the performance is superior as well. In scenario 3s5z, our MAACCN-VDN algorithm achieves a high win rate frst and is markedly superior to the baseline method of VDN. Tis demonstrates that the MAACCN approach has a high degree of universality, allowing it to be used with other value-based algorithms. Figure 5 verifes the validity of the communication mode in our algorithm. In all three scenarios, CommNet-VDN performs the poorest since an agent merely uses the actionobservation sets of every agent and simply merges them. Both TarMAC-VDN and MAACCN-VDN perform better than CommNet-VDN, indicating that the hidden layer has richer information. Also, it demonstrates that the multihead attention communication structure described in this study is able to extract useful information from a complicated environment. Te performance of the TarMAC-VDN algorithm can catch up with the MAACCN-VDN algorithm in scenario 3s5z, but not in other scenarios. Compared with TarMAC-VDN, the results of MAACCN-without-DAgger Table 3: Experimental results of GPU memory cost for diferent algorithms.

Scenes
MAACCN-VDN TarMAC-VDN  CommNet-VDN  VDN  2s_vs_1sc  1490M  1120M  690M  680M  5m_vs_6m  1550M  1150M  710M  680M  3s_vs_5z  2010M  1480M  750M  730M   25  50  75  100  125  150  175  Computational Intelligence and Neuroscience demonstrates that extracting consensus information from the common network as communication information is more benefcial than extracting communication information simply from the action-observation sets or the hidden layers of the current network, validating the efcacy of the common network module. Te results of scenario 3s_vs_5z and scenario 6h_vs_8z show that the method of MAACCN-VDN outperforms the existing communication methods in terms of performance and convergence speed within the DAgger framework. We validate the impact of the common network within the DAgger framework. Based on QMIX, the CN-QMIX algorithm adds only the common network without any additional attention mechanisms or communication structures. Compared to benchmark algorithm QMIX, a signifcant performance disparity may be detected. As the number of training steps increases in Figure 6, the CN-QMIX algorithm tends to converge faster and become more stable. Tis clearly demonstrates the viability of the main hypothesis that the common network can be utilized as an expert to aid agents in making better judgements in big and complicated situations. Table 3 displays the GPU memory requirements of various algorithms. After each algorithm is trained over a period of time (epoch � 1000), the GPU memory consumption is obtained by calculating the mean several times. Te ordinate of Figure 7 represents the average amount of time consumed per epoch by each algorithm. Te experiment is conducted on scenario 2s_vs_1sc, as all algorithms can quickly identify the optimal strategy there. In order to minimize interference from other variables, each algorithm runs on a single GPU. Te VDN algorithm has the smallest memory and time overhead, whereas the TarMAC-VDN algorithm with a communication module signifcantly increases memory and time overhead. Compared to the TarMAC-VDN algorithm, the MAACCN-VDN algorithm requires more memory due to the addition of a common network. Since the communication module is essential and the time cost of adding a common network is tolerable, we employ the MAACCN algorithm for better performance.

Conclusion
Tis paper proposes a consensus storage mechanism to model the consensus information of other agents by adding an additional common network to store the historically best-performing policy networks. Te communication module is utilized to enhance the depiction of the entire network via a multihead attention mechanism. Te MAACCN algorithm may fully utilize the knowledge of the common network, expanding the source of information while not limiting the information of the current policy network. In addition, under the DAgger framework, we regard the common network as an expert network to guide the policy of the agent, and the feasibility of this operation is verifed through ablation experiments.

B. Parameters
Parameters required for MAACCN training are shown in Table 1. Choosing from the controller the action that corresponds to the maximum Q-value is the strategy of an agent. Multiple agents utilize their own techniques to interact with the environment, collect sample data from each episode into the experience replay, and then choose batches of data from the bufer to train agents. Te agent generates and loads N episode s sample data into the memory pool at each epoch. After receiving (batch, episode s da ta) data, an agent calculates the current Q-value at each moment of each batch, as well as the target Q-value for the next moment, depending on the value-based mode. Te total loss is calculated by frst summing the two portions of Q values for all agents, then calculating the loss of each batch of each episode at each instant l � (Q − (r + max a Q target )) 2 , and fnally summing the loss of all episodes at all times. Te loss was fnally optimized using the RMSProp optimization approach.

Data Availability
Experiments were conducted on the StarCraft multiagent challenge (SMAC). Open source environment can be found in the paper "the StarCraft multiagent challenge."

Conflicts of Interest
Te authors declare that they have no conficts of interest.