A Value Factorization Method for MARL Based on Correlation between Individuals

Value factorization is a popular method for cooperative multi-agent deep reinforcement learning, which eectively solves explosion of state-action spatial dimension and partial observability problems. However, most existing algorithms only consider the impact of individuals rather than correlation between individuals, which leads to poor coordination between agents in complex environments. In order to resolve this problem, this paper proposes a multi-agent deep reinforcement learning value factorization method based on correlation between individuals, CI-VF, which promotes coordination between agents eectively. Firstly, the individual value function vectors are obtained according to the output of individual networks in each round. Secondly, a Spearman correlation coecient matrix can be calculated by the vectors to measure the correlation degree of agents, and the joint correlation coecient can be obtained to optimize joint value function. Next, we use optimized joint value function to train individual networks. Experimental results show that our method outperforms QMIX and other baselines in various scenarios under the StarCraft Multi-Agent Challenge environment.


Introduction
Cooperative multi-agent tasks in real world usually involve cooperation of various number or types agents, such as resource allocation of wireless sensor network [1] and vehicle scheduling of intelligent transportation system [2], which poses great challenge to realize multi-agent collaboration that traditional intelligent algorithms [3][4][5] are no longer suitable for. One solution to the problem is value factorization (VF), a classic multi-agent deep reinforcement learning (MADRL) method, which factorizes the globally shared joint value function into individual value functions based on local observation to urge agents to learn an optimal joint action policy.
VF's training mode is centralized training with decentralized execution (CTDE) [6,7], which optimizes individual networks and greedily selects actions according to joint value function and individual value functions, respectively. Compared with the completely centralized training method [8], the VF method e ectively alleviates explosion of stateaction spatial dimension and partial observability problems.
While compared with independent Q-learning [9], it can better deal with the problems of non-stationarity of environment and multi-agent collaboration. erefore, the VF method is worthy of study extremely. e VF method was rst proposed in an additive way [6], where joint value function is represented by the simple sum of individual value functions. Later, Rashid et al. [7] proposed an improved algorithm, QMIX, which represented joint value function as a monotonic non-linear combination of individual value functions through monotonic mixing network. QMIX is a classic and e cient VF method and performs well in StarCraft Multi-Agent Challenge (SMAC) environment. Besides, there are many optimized VF algorithms. For example, weighted QMIX [10] is proposed to solve the problem that QMIX falls into suboptimal policy due to monotonic constraint. QTRAN [11] is proposed to solve the problem that QMIX has a strong structural constraint. Also, Qatten [12] is proposed to solve the problem of multi-agent credit assignment by attention mechanism.
However, the existing VF algorithms only consider individual impact on the whole system and ignore the correlation between individuals.
is correlation may be critical sometimes, especially in cooperative environment. For example, an attacking agent with low health needs the protection of defensive agents; otherwise, it is extremely vulnerable to be killed by enemies. Even though this correlation is already reflected in the constraint of joint value function and individual value function, it is fuzzy and insufficient.
is paper proposes a MADRL method, CI-VF, to study correlation between individuals, which utilizes Spearman correlation coefficient of individual value functions to quantify correlation between individuals and takes the advantage of the correlation to guide network training. CI-VF describes the interaction between individuals mainly through the relative variation trend of individual value function in the training process, which is based on the common sense that if there are two closely related agents in a cooperative environment, their individual value functions should be relevant to some extent. By enhancing this correlation, each agent will select actions that are more beneficial to itself and its partners to learn the optimal joint policy. We evaluate CI-VF on the challenging StarCraft II micromanagement mission and show that CI-VF outperforms QMIX and other baselines in various scenarios. e rest of this paper is organized as follows. Section 2 briefly introduces the relevant work and Section 3 introduces basic theories of VF. Section 4 proposes CI-VF, a VF method based on correlation between individuals. Section 5 carries out comparative experimental analysis on CI-VF and baselines. Section 6 summarizes and prospects the future work of our research.

Related Work
In recent years, significant breakthroughs have been made in deep reinforcement learning (DRL), and MADRL has gradually become a research hotspot. MADRL combines multi-agent system with DRL, aiming to utilize the powerful decision-making and presentation capabilities of DRL to solve multi-agent collaboration problems, such as UAV (unmanned aerial vehicle) path optimization [13], network resource allocation [14], and so on. VF is one of the MADRL methods which can solve the multi-agent collaboration problem in cooperative environment. It was first proposed by representing joint value function as the sum of individual value functions, which achieved initial success in solving the problem of multi-agent reinforcement learning [6]. Since then, researches began to pay attention to the field of VF. QMIX [7] is one of the most classical and effective VF algorithms, which uses mixing network with global information to decompose joint value function into a complex non-linear combination of individual value functions, thus improving the representational ability, application scope, and stability of the algorithm. Later on, realizing that QMIX was easy to be trapped in suboptimal policy due to monotonic constraints, Rashid et al. [10] proposed a weighted QMIX algorithm, WQMIX, which introduced weights to weight the square error of joint value function when updating networks, so as to obtain the optimal joint policy. Xu et al. [15] proposed a MMD (maximum mean discrepancy) mixing network with global status information, which combined distributed reinforcement learning and VF to adapt to the environment with high randomness.
However, although VDN (value decomposition network), QMIX, and WQMIX performed well in the experiment, they lacked theoretical basis. erefore, Yang et al. [12] proposed a cooperative multi-agent reinforcement learning framework, Qatten, which derived the factorization formula of joint value function theoretically and adopted the mixing network based on multi-head attention to approximate it, enhancing the theoretical and rationality of the VF method. Zhao et al. [16] also proposed DQMIX, a novel value-based multi-agent reinforcement learning (MARL) method, from a distributional perspective, which employed a distribution mixing network to integrate individual value distributions into the global value distribution and further proved that DQMIX can meet the distributional-individualglobal-max (DIGM) principle with respect to the expectation of distribution.
e experimental results showed DQMIX coped well with the randomness of long-term returns.
VDN, QMIX, and WQMIX have the strong structural constraints due to their joint value functions obtained from additive or monotonic networks. In order to relax this constraint, Son et al. [11] proposed a general VF method, QTRAN, which transformed joint value function into decomposable function with same optimal actions, thereby improving the scalability of the algorithm. However, although QTRAN has good theoretical guarantee, its empirical performance is not good. In order to bridge the gap, Son et al. [17] proposed an improved version of QTRAN, QTRAN++, which used joint action value obtained from multi-head monotonic mixing network to imitate the real joint action value obtained from semi-monotonic mixing network to optimized the loss function, so as to stabilize the training process. Later, Wang et al. [18] proposed a MADRL algorithm based on duplex dueling network architecture, QPLEX. It used duplex dueling structure to transform individual-global-max (IGM) principle into consistency constraint of advantage functions and introduced an extensible multi-head attention mechanism, which improved the stability and extendibility without limiting the representational ability. Zhou et al. [19] proposed a novel framework, LSF-SAC, which took a variational inferencebased information-sharing mechanism as extra state information to assist individuals in value function factorization. Making full use of shared latent state information, LSF-SAC significantly expanded the ability of value function factorization and performed well in SMAC.
In addition, some knowledge in other fields has also been introduced into the VF method. Zhang et al. [20] proposed an attention-based method, AVD-Net (attention value decomposition network), which introduced attentional mechanism into VDN and QMIX to learn the correlation between individuals, so as to effectively factorize joint value function and urge agents to learn the cooperative relationship adaptively. Wu et al. [21] combined attention mechanism and communication to learn a more general VF formula, so as to complete multi-agent collaboration in complex environment. Liu et al. [22] proposed an attention relational encoder (ARE) for the representation of attentional relational state in decentralized MARL. It used attention mechanism to aggregate the information of neighbor agents to expand the observation space of agents, which improved computational efficiency and flexibility and performed well in the micromanagement tasks of StarCraft II. Iqbal et al. [23] proposed a randomized entity-wise factorization method for imagined learning, REFIL, which randomly selected a subset of observed entities and predicted the utility of each agent in these subsets. Making full use of the independence of agents, the learning efficiency and extendibility of REFIL were greatly improved.
Our method also draws on the idea of correlation coefficient in mathematics, which has been applied to multiagent systems. Wan et al. [24] proposed an energy-efficient sleep scheduling mechanism with similarity measure for wireless sensor networks, which divided sensor nodes into different categories by measuring the similarity degree of member nodes through correlation coefficient and effectively reduced energy consumption. Zhang et al. [25] proposed an agent-level coordination-based MARL method, which screens the communication messages between agents by analyzing the correlation between individual agents based on the Pearson, Spearman, and Kendall correlation coefficients, effectively solving the problem of the high dimensionality of input space of the state-action value network.

Background
is section introduces the basic theories related to VF, mainly including the modeling method, correlation measurement approach, and classical VF algorithms.

Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Dec-POMDP is a classical modeling method of MADRL. A fully cooperative MARL task can be described as a Dec-POMDP denoted by a tuple G � 〈N, S, U, P, r, Z, O, c〉 [26], where N denotes the number of agents and s ∈ S denotes the global state. At each time step, each agent i selects an action u i ∈ U, which forms a joint action u ∈ U ≡ U n . e state transition probability function P(s ′ |s, u): S × U × S ⟶ [0, 1] describes a transition from state s to state s ′ . r(s, u): S × U ⟶ R is the reward function shared by all agents. Also, individual observations z ∈ Z are drawn by observation Each agent has an action-observation history h i ∈ H ≡ (Z × U) * , forming a joint action-observation history h � (h 1 , . . . , h N ). Also, each agent selects actions according to a stochastic policy π i (u i |h i ): H × U ⟶ [0, 1], which forms a joint policy π � (π 1 , . . . , π N ). e joint actionvalue function is denoted by

Spearman (Rank) Correlation Coefficient.
In statistics, Spearman correlation coefficient [27,28] is a non-parametric measure of rank correlation, which calculates the statistical dependence between the ranks of two variables. It measures the extent to which a monotone function is used to describe the correlation between two variables. Definition 1. Suppose there are two groups of sample data X and Y with size N, and we assign the ranks rX i and rY i to the elements X i and Y i , respectively, according to their average descending position in the overall data. Spearman correlation coefficient is defined as follows: where cov(rX, rY) is the covariance of the rank and σ rX , σ rY are the standard deviations of the rank. When there are no elements with the same value in the sample, Spearman correlation coefficient can also be calculated by the following formula.
where d i � rX i − rY i is the difference between the corresponding ranks of the i-th pair elements. Table 1 shows the value and corresponding ranks of elements in both cases.
Since there is no constraint on the distribution and continuity of the variable, Spearman correlation coefficient can measure the monotonic correlation between two sets of ordered data. It also indicates the direction of the correlation between variables X and Y. If Y increases as X increases, then Spearman correlation coefficient is positive; otherwise, Spearman correlation coefficient is negative or 0. e closer to complete monotonic correlation, the greater the absolute value of Spearman correlation coefficient. In the case of complete monotonic correlation, the absolute value is 1.

QMIX.
QMIX is a classical VF method, which represents the joint action value function Q tot (h, u) as a complex non-linear combination of individual action value functions Q i based only on local observation. It makes the result of argmax onQ tot (h, u)consistent with on eachQ i by imposing monotonicity constraints onQ tot (h, u) and Q i . at is, when (3) holds, (4) can be derived.
arg max QMIX adopts three neural networks. (1) Individual network: the individual network is represented by DRQN [29], which calculates individual value function Q i (h i , u i ) according to local observation o i t and the last action u i t−1 at each time step. (2) Hypernetwork: the hypernetwork consists of a single linear layer and an absolute value activation function. It takes global state information to generate the non-negative weights w. (3) Mixing network: the mixing network is a feedforward neural network that mixes Q i (h i , u i ) with non-negative weights w to get Q tot (h, u). e overall architecture of QMIX is shown in Figure 1. QMIX trains the model by minimizing the loss function defined as where y tot � r + cmax u′ Q tot (h ′ , u ′ , s ′ ; θ − ) is the real joint value, b is the sampling size, and θ − is the parameter of target network. As a classical VF method, QMIX has achieved a good performance in experiments. However, under the monotonicity constraint, QMIX cannot deal with the problem of solving the optimal policy in non-monotonic environment. erefore, a weighted QMIX algorithm, WQMIX [10], was proposed to obtain the optimal joint value function by weighting the square error of joint action values when updating the network. WQMIX puts forward two weighting methods, called ideal centre weighting (CWQMIX) and optimistic weighting (OWQMIX), respectively. e weight functions are defined as follows: .
Also, the loss function of WQMIX is defined as

Correlation between Individuals in Value Factorization
As we all know, the VF method for MADRL is an effective approach to solve the multi-agent collaboration problem, which mainly uses IGM principle and attention mechanism to evaluate the contribution of agents, so as to realize multi-agent collaboration. However, the existing VF algorithms only evaluate individual impact on the whole system and ignore the interaction between individuals. Such interaction is sometimes critical, especially in a cooperative environment [25]. For example, an attacking agent with "weak defense" but "high damage and long range" attributes usually relies on a defensive agent with "strong defense and high health" attributes. Without the protection of defensive agents, the attacking agent is easy to become enemies' prime target and be killed effortlessly, which will result in a significant reduction in combat effectiveness. erefore, the interaction between individuals should be considered appropriately to urge agents to select actions that are beneficial not only to themselves but also their partners, so as to learn the optimal joint policy. e interaction between individuals reflects the correlation between individuals, which will be embodied in the variation trend of agent's individual value function in cooperative environment. In other words, if two agents interact with each other, their individual value functions corresponding to the selected actions in the training process should be relevant to some extent. at is, Q j will increase or decrease as Q i increases. For example, the individual value function corresponding to "attack" will increase if an agent receives a positive reward after hitting an enemy. At the moment, if another relevant agent selects "charge," it is more likely to kill the enemy, and the individual value function corresponding to "charge" also increases.
In order to fully consider the interaction between individuals, we propose a novel VF method, CI-VF, to study the correlation between individuals. Our method parameterizes the interaction between individuals according to Spearman correlation coefficient of individual value functions and takes the parameterized results as an auxiliary term to train network, whose aim is to improve the multi-agent collaborative ability by stimulating this correlation. e individual value function vector in each round is a set of ordered data, and the correlation between individuals can be measured by the change trend of the set of ordered data. erefore, it is reasonable and feasible to use Spearman correlation coefficient to describe the correlation degree of individual value functions. e agent will select the best action for itself and its partners by using this correlation, so as to learn the optimal joint policy. e overall framework of CI-VF is shown in Figure 2. Weighted network module represents the weight calculation process of individual value functions, including hypernetwork and attention mechanism. Mixing network module represents the integration process of individual value functions to joint value functions in the VF method. Spearman correlation module is the main work of this paper, which represents the quantification process of the correlation between individual agents. e module is designed as follows.
In each round, agents iterate T steps. Also, at each time step, agent i will select an action and obtain the individual action-value function corresponding to the action, denoted as Q i t . e individual value function in each round is denoted . Each individual action-value function Q i t is assigned a rank rQ i t according to its average descending order, forming a rank vector rQ i (rQ i 1 , rQ i 2 , . . . . . . , rQ i T ). Spearman correlation coefcient matrix r s is calculated by If the correlation coe cient is positive, there is a positive correlation between (Q i , Q j ). us, Q j will increase as Q i increases, and vice versa. Besides, the higher the absolute value of correlation coe cient is, the stronger the correlation between (Q i , Q j ) will be.
In cooperative environment, if correlation between two agents is stronger, their correlation coe cient should be positive and greater. erefore, we sum up each pair of correlation coe cients to obtain joint correlation coe cient and take it as the auxiliary term of joint value function to update networks. Joint correlation coe cient r tot s is de ned as r tot s i≠j en, we can get a new joint value function Q tot ′ Q tot + r tot s and the loss function L r (θ). L r (θ) is de ned as e pseudocode of CI-VF is as follows (Algorithm 1).

Settings.
In this section, we evaluate the performance of CI-VF in the StarCraft II micromanagement mission and compare it with the classic VF algorithms such as QMIX, VDN, IQL, and so on. During training, our units are controlled by CI-VF, while the enemies are controlled by the built-in AI algorithm. Also, all training parameters are the same as QMIX in SMAC. CI-VF mainly includes CIVDN and CIQMIX, which are proposed for VDN and QMIX, respectively. We evaluate CIVDN and CIQMIX on four di erent types of maps, including 8m, 2s3z, 3s_vs_5z, and MMM2, and compare them to baselines. e features of these maps are shown in Table 2.

Validation.
First, we evaluate CIQMIX and CIVDN in easy scenarios as shown in Figures 3(a) and 3(b), and they complete multi-agent collaboration tasks well both in homogeneous and heterogeneous scenarios. In addition, compared with previous algorithms, their learning e ciency and stability are slightly improved on the whole, but the improvement is not obvious. It is mainly because previous algorithms also work well in easy scenarios. IQL has signi cantly bad performance mainly because it is a poor multiagent credit assignment scheme that does not take into account the interaction between individuals. en, we test CIQMIX and CIVDN in the hard scenario, 3s_vs_5z, and results are presented in Figure 3(c). 3s_vs_5z is a homogeneous and asymmetric scenario, in which our agents must learn to keep distance with enemies to disperse enemies' forces and then surround and kill them one by one.

Mathematical Problems in Engineering
Agent i greedily selects action u i t based on Obtain the joint action u t (6) End for (7) Transfer to next state s t+1 and get total reward r t after executing u t (8) Store the sample (s t , o t , u t , r t , s t+1 ) to the experience pool D (9) Fori � 1, . . . , Ndo Set for terminal s for non − terminals (13) Update the network parameters θ by the gradient descent method (14) End for (15) Update parameters θ − � θ every K steps (16) End for (17) End for ALGORITHM 1: Value factorization method based on correlation between individuals (CI-VF). is puts a high demand on algorithms. Experimental results show that CI-VF still works well in asymmetric hard scenario. Compared with previous algorithms, its performance is improved significantly, especially CIQMIX. Obviously, this improvement owes to the consideration of the correlation between individuals.
Finally, we test CIQMIX and CIVDN in super-hard level scenario, MMM2, and the results are shown in Figure 3(d). MMM2 is a heterogeneous and asymmetric scenario, consisting of Medivac, Marauder, and Marine, in which we had to rely on three types of units working together to win. In other words, Medivac draws enemies' fire and heals the wounded, and Marauder breaks through enemies' defense and harvests wounded enemies, while Marine provides protection and attacks enemies' Medivac. erefore, it is necessary to take account of interaction between individuals. Experimental results also indicate that CI-VF stands out in performance than previous algorithms. is just further proves the necessity and validity of considering correlation between individuals.
In a word, whether the scenario is homogeneous or heterogeneous, easy or hard, CI-VF has the better performance than baselines.
In order to test the performance of all algorithms clearly and comprehensively, we conduct 5000 rounds of tests on the trained models to obtain the statistical win rate, which is the ratio of the rounds of winning to the total test rounds, with a maximum of 1. e results are shown in Table 3, and the highest values are given in bold.
From the table, we can find that CI-VF has almost the best performance, and the final win rates are higher than those of baselines in all scenarios, especially in MMM2.

Conclusions and Future Work
In this paper, we propose a MADRL VF method based on correlation between individuals, which provides a new perspective for solving multi-agent credit assignment problem. We first analyzed the necessity of considering interaction between individuals, then introduced Spearman correlation coefficient to measure interaction between individuals and analyzed its rationality and feasibility, and finally optimized joint value function according to joint correlation coefficient obtained by individual value function. Experimental results in SMAC environment show that our method has a better performance than baselines in various scenarios.
For future work, it will be important to consider whether there are better ways to measure interaction between individuals more than Spearman correlation. For example, designing a correlation network that calculates interaction between individuals according to observations and rewards may be worthy of further study.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments
is study was supported by the National Natural Science Foundation of China (61806221).