Decentralized Multiagent Actor-Critic Algorithm Based on Message Diffusion

. The exponential explosion of joint actions and massive data collection are two main challenges in multiagent reinforcement learning algorithms with centralized training. To overcome these problems, in this paper, we propose a model-free and fully decentralized actor-critic multiagent reinforcement learning algorithm based on message di ﬀ usion. To this end, the agents are assumed to be placed in a time-varying communication network. Each agent makes limited observations regarding the global state and joint actions; therefore, it needs to obtain and share information with others over the network. In the proposed algorithm, agents hold local estimations of the global state and joint actions and update them with local observations and the messages received from neighbors. Under the hypothesis of the global value decomposition, the gradient of the global objective function to an individual agent is derived. The convergence of the proposed algorithm with linear function approximation is guaranteed according to the stochastic approximation theory. In the experiments, the proposed algorithm was applied to a passive location task multiagent environment and achieved superior performance compared to state-of-the-art algorithms.


Introduction
An agent of reinforcement learning (RL) masters skills in a trial-and-error way. The agent is settled in an environment that gives the responses and rewards corresponding to its actions over discrete time steps. The learning process is modeled as a Markov decision process (MDP) [1,2]. The goal of the agent is to find an optimal policy that maximizes the expectation of long-term gain without the knowledge of the world (model-free [3]). Traditional tabular RL fails to handle situations with large or continuous state space, which limits its wider application. Recent theoretical and practical development has revealed that deep learning (DL) techniques with the powerful representation ability can deal with such situations efficiently. Deep reinforcement learning (DRL), a combination of RL and DL, has made remarkable achievements in the fields of chess [4], video games [5], and physical control tasks [6]. At the same time, using DRL to solve multiagent problems [7,8] has gradually broadened so that it has now introduced a new research field, called multiagent deep reinforcement learning (MARL) [9]. A series of recent studies have indicated that MARL algorithms have reached the top human levels in multiplayer real-time strategy games [10]. An important problem in MARL is to learn cooperation on team tasks [7], i.e., the MARL cooperation problem. So far, the most popular solution to this problem is centralized training and decentralized execution [11,12]. Methods in this fashion need a global value function [13,14] or a global critic [11] in the training stage and assume the existence of a control center that is able to access the global state [15]. However, due to the constraints of real-world factors such as energy, geographical limitations, and communication ability, it is hard to collect the data of all agents to a center with a large number of agents.
Fully decentralized MARL algorithms yield promising results in large-scale multiagent cooperation problems, where agents learn and execute actions based on local observations. Independent Q-learning (IQL) [16,17], as a simple and scalability algorithm, is a typical fully decentralized MARL algorithm. An IQL agent is trained through its local observations and executes an independent local policy. However, in the perspective of an individual agent, other agents' actions are nonstationary [18], which makes the IQL fail in many tasks. In [19], the authors proposed a decentralized multiagent version of the tabular Q-learning algorithm called QD-learning, where each agent is only aware of its own local reward, exchanges information with others, but observes the global state and joint actions. [20] assumes that all the agents are placed in a time-varying network, and it proposes a fully distributed MARL actor-critic algorithm, in which each agent has its own value function that is parameterized and updated with a weighted sum of other agents' parameters. The main limitation of this algorithm is that it also assumes that the global state and the joint actions of all agents can be obtained directly, which is still difficult to satisfy in many actual scenarios.
The problem addressed in this paper is to reduce the problem of MARL algorithms caused by centralized training. To this end, this paper proposes a model-free and completely decentralized MARL algorithm based on message diffusion. In the method, all agents are assumed to be in a timevarying communication network, where each agent obtains information from its neighbors and spreads its local observation and action in a fashion of diffusion [21]. Each agent has its own reward function, and the global reward is computed as the mean of local rewards. Each agent is designed as an actor-critic reinforcement learner and maintains a local estimation of the global state and joint actions. These estimated variables are updated with their own observations as well as messages received from neighbors, namely, in a diffusion style. We leverage stochastic approximation to analyze the convergence of the update process in the proposed algorithm with linear function approximation, and the convergence is guaranteed under reasonable assumptions. The proposed algorithm is evaluated on multiagent passive location tasks, and the results demonstrate its convergence and effectiveness.

Actor-Critic for a Single Agent
The reinforcement learning process can be described by an MDP M = hS, A, P, ri, where S represents the set of states and A is the set of actions. Pðs′ | s, aÞ: S × A × S ⟶ ½0, 1 denotes the state transition probability. The reward function is denoted as Rðs, aÞ: S × A ⟶ ℝ, and the instant reward at t is r t+1 = Rðs t , a t Þ. The agent's policy is represented as π : S × A ⟶ ½0, 1, which is parameterized to πð·, · ; θÞ or π θ for short, with θ representing the parameters. The task of the agent is to learn a policy π θ to maximize the expectation of the cumulative reward, i.e., the objective function where d θ ðsÞ = lim t Pðs t = s | π θ Þ represents the stationary state distribution of the Markov chain for policy π. For policy π θ , the action value function is defined by which represents the expectation of the cumulative reward in the future since state s and taking action a. Q θ ðs, aÞ is parameterized as Q θ ðs, a ; ωÞ or Q θ ðωÞ for short, with ω denoting the parameters of the action value function. The state value function is defined based on Q θ ðs, a ; ωÞ, i.e., V θ ðsÞ = ∑ a∈A π θ ðs, aÞQ θ ðs, aÞ, and the advantage function is A θ ðs, aÞ = Q θ ðs, aÞ − V θ ðsÞ, which can be regarded as a measure of advantage of taking action a over other actions. According to [22], the gradient of the objective function JðθÞ can be computed by The policy parameters are updated with the policy gradient: where β > 0 is the step size, and ψ θ = ∇ θ log π θ ðs, aÞ is referred to as the score function of policy π θ .

Agents in Time-Varying Networks
Definition 1 (Networked agents). Consider a finite directed graph G t = ðV , C t Þ, where V and C t represent the set of agents and the set of communication relations, respectively. The number of agents is N = jV j, and the set of communication pairs is denoted as C t = fði, jÞ | i ≠ j, i, j ∈ V g, with C t = ½c t ði, jÞ N×N denoting the connection matrix of agents. The multiagent Markov decision process with networked agents is denoted by ðS, There is at least one path for any pair of agents, and each agent can obtain the observations and actions of neighbors.
At time t, the observation of agent i is s i t ∈ S i , where S i ⊆ ℝ K s represents the observation space and K s ∈ ℕ + refers to the dimension of the observation vector. The observations of all agents contribute to the global state where K a ∈ ℕ + represents the dimension of the action space. The joint actions of all agents can be expressed Agent i holds a local estimation of the global state s t,i ∈ S and a local estimation of joint actions a t,i ∈ A (if t is not emphasized, they are denoted as s i and a i , respectively). The reward obtained by agent i is denoted by r i t+1 , which is assumed to be bounded. The joint policy, π : S × A ⟶ ½0, 1, maps the global state onto the joint actions of all agents. Actions among agents are independent, and the relationship of the joint policy and local policies satisfies Journal of Sensors where π θ i ð·, · Þ: S × A ⟶ ½0, 1 is the local policy of agent i, which is parameterized by θ i ∈ Θ i ⊆ ℝ K , with K ∈ ℕ + being the dimension of agent i's parameter space. The policy parameters of all agents are represented as Assumption 2. For any i ∈ V , s i ∈ S, and a i ∈ A, the policy function π θ i ðs i , a i Þ > 0 is continuously differentiable at θ i ∈ Θ i . For any θ ∈ Θ, the Markov chain fs t g t≥0 induced by π θ ð·, · Þ is irreducible and aperiodic.
The policy π θ i ð·, · Þ is differentiable for parameter θ i so as to use deep neural networks. Furthermore, the Markov chain induced by policy π θ ð·, · Þ, being aperiodic and irreducible [23], has stationary distribution.
The task in the multiagent cooperation problem is to find a joint policy to maximize the expectation of the averaged long-term reward over all agents: where R = ð1/NÞ∑R i ðs, aÞ represents the global averaged reward function, and its value at time t takes r t = ð1/NÞ ∑ i∈V r i t . Then, the global action value function Q is defined by The local action value function Q i of agent i is which is parameterized as Since VðsÞ = ∑ a∈A πðs, a ; θÞQðs, a ; ωÞ, the local advantage function A i is defined as where a t,−i represents the joint actions of all agents except for i.
The global value function can be decomposed into a weighted sum of the local value functions over agents, namely, where c i reflects the importance of individual action value function Q i ð·, · Þ. Note that the action value function Q i ð·, · Þ of agent i depends only on its local observation and action.
Assumption 3 is similar to [14], under which the following theorem about the gradient of the global objective function to a single agent holds.
where ψ i = ∇ θ i log π θ i ðs t,i , a t,i Þ is the score function, and A j is the local advantage function defined by (9).
The proof of Theorem 4 follows a similar scheme to [22], and the complete proof is provided in Appendix A. Theorem 4 indicates that the gradient of global objective function JðθÞ can be estimated by the agent's local score function ψ i and local advantage function A j . Although there is a certain deviation between the local observation of a single agent and the global state, in a time-varying network, an agent can improve the accuracy of the global state estimation by exchanging messages. In the next section, a completely decentralized multiagent reinforcement learning algorithm based on message diffusion over a communication network is proposed.

Distributed Actor-Critic Based on Message Diffusion
Suppose that, for a group of intelligent agents with communication abilities, each agent expects to obtain more information from other agents to estimate the value function and optimize its policy more accurately. But it can only obtain partial observations and receive messages via the communication network. In this section, a message diffusion-based distributed MARL algorithm is proposed to ensure agents gain global information and spread their observations efficiently. In the algorithm, at time step t, each agent holds a global state estimation s t,i and joint action estimation a t,i , which are updated by 3 Journal of Sensors where μ i t is an estimation of the long-term return of the agent i with the bootstrapping update, and β ω,t > 0 is the step size. The agent i's global state estimation is updated in two steps. In step one, the corresponding part of the global state estimation is replaced according to (13), where eðiÞ is a unit vector with the dimension of N with only element i being 1. ⊗ represents the Kronecker product. 1 N is an N-dimension vector with all the elements being 1, and 1 K s is defined similarly. The operation "·" in (13) and (14) is the element-wise product. In step two, the global state estimation is updated with message diffusion according to (15). The parameters in the value function of agent i are updated through the following equation: where δ i t denotes the local temporal difference error of agent According to Theorem 4, the policy of agent i is improved via gradient ascend: where β θ,t is the step size, A i ðs t,i , a t,i Þ is the local advantage function defined by (9), and ψ i t = ∇ θ i log π θ i ðs t,i , a t,i Þ is the score function. Algorithm 1 summarizes the steps of the proposed approach. The algorithm works in an on-policy fashion; i.e., transitions are discarded once utilized in the update of policy parameters.

Convergence Analysis
To analyze the convergence of Algorithm 1, we make several assumptions on the policy update and the step sizes. Under these assumptions and linear function approximations, it is shown that both the value function and the policy function in Algorithm 1 converge almost surely.
This assumption is commonly used in the analysis of transient behavior with stochastic approximation [24,25]. In fact, this assumption is only for the convenience of analysis and is not necessary for experiments. Assumption 6. The step sizes β ω,t and β θ, Assumption 6 is essential for stochastic approximation and other stochastic bootstrapping algorithms. It ensures that the correction of each step becomes smaller and smaller; however, the reduction speed is not too fast to converge independently without respect to the initial states.
The action value function of agent i is approximated by a linear function family Q i ðs, a ; ω i Þ = ðω i Þ Τ ϕ i ðs, aÞ, where ϕ i ðs, aÞ = ½ϕ i 1 ðs, aÞ,⋯,ϕ i K ðs, aÞ Τ ∈ ℝ K is the feature corresponding to the state-action pair ðs, aÞ, which is uniformly bounded. The feature matrix Φ ∈ ℝ jSjjAj×K has full column rank, and the kth column of which is denoted by ½ϕ k ðs, aÞ, s ∈ S, a ∈ A Τ . For convenience, ϕðs i , a i Þ is abbreviated as ϕ i .

Theorem 7.
Under Assumptions 2, 3, 5, and 6, for any policy π θ , the value parameter sequence fω i t g generated by (16) converges to ω i θ with probability 1, where ω i θ is related to policy π θ .
Proof. With linear function approximation, the update process of parameter ω i t is The gradient δϕ in the update process can be seen as the error caused by current parameter ω i t . We construct the square of it as Zðω i t Þ = ð1/2ÞðδϕÞ Τ ðδϕÞ. When Zðω i t Þ takes the minimum value, δϕ = 0 holds. Hence, the convergence point of ω i t can be found through optimizing (16) via introducing a random variable ξ t to estimate δ t ϕ t : where β ξ,t > 0 is the update step size, which satisfies Assumption 6. We assume that the second-order moments of ϕ i t and ϕ i t+1 are bounded.
Rewrite the update process in (19) by where 4
Since the feature matrix ϕ has full rank, A is a nonsingular matrix. Let G = A Τ A, and then det ðGÞ = det ðA Τ AÞ ≠ 0, and the eigenvalues of G are nonzero. Denote an eigenvalue of G by λ ∈ ℂ, λ ≠ 0, and the corresponding feature vector by The real part of feature vector λ is Re ðλÞ = − ffiffi ffi η p ∥x 1 ∥ 2 < 0. So _ ρ = hðρÞ has global asymptotically stable equilibria. According to the stochastic approximation theory [26], the update described by (21) converges; thus, fω i t g converges almost surely. Theorem 8. Under Assumptions 2, 5, and 6, the policy parameter θ i updated through (17) converges to the asymptotically stable equilibrium of the following ordinary differential equation (ODE) with the probability of 1: Input: Initialize parameters μ i 0 , ω i 0 , θ i 0 , ∀i ∈ V and step sizes β ω,t , β θ,t . Agent i initializes a i 0 and obtains observation s i 0 , then initializes estimations s 0,i , a 0,i . Initialize the counter t ⟵ 0. 1: repeat 2: for i = 1, ⋯, N do 3: Sample and execute an action a i t from the policy function π θ i ðs t,i , ·Þ, and obtain s i t+1 , r i t+1 ; 4: Update the reward μ i t+1 ⟵ ð1 − β ω,t Þμ i t + β ω,t r i t+1 ; 5: Update the local state s t,i ⟵ eðiÞ ⊗ s i t + ðð1 N − eðiÞÞ ⊗ 1 K s Þ · s t,i ; 6: Update the local action a t,i ⟵ eðiÞ ⊗ a i t + ðð1 N − eðiÞÞ ⊗ 1 K a Þ · a t,i ; 7: Send s t,i and a t,i to neighbors. 8: end for 9: for i = 1, ⋯, N do 10: Update the estimation of the global state s t+1,i ⟵ ∑ j∈N ðiÞ c t ði, jÞs t,j ; 11: Update the estimation of the joint action a t+1,i ⟵ ∑ j∈N ðiÞ c t ði, jÞa t,j ; 12: end for 13: for i = 1, ⋯, N do 14: Calculate the temporal difference error Update the advantage function Update the score function ψ i t ⟵ ∇ θ i log π θ i ðs t,i , a t,i Þ; 18: Update the actor θ i t+1 ⟵ θ i t + β θ,t β θ,t ψ i t ð1/NÞ∑ j∈V A j ðs t,i , a t,i Þ; 19: end for 20: t ⟵ t + 1; 21: until the algorithm converges.
Algorithm 1: Multiagent actor-critic algorithm based on message diffusion.
Proof. Denote the σ field generated by fθ τ , τ ≤ tg as F t,2 = σðθ τ , τ ≤ tÞ. Define two random variables: where The update process of (17) is represented in projection as Due to the convergence of the policy evaluation, 1 , and then fM i t g is a martingale difference sequence. Since fω i t g, fψ i t g, and fϕ i t g are bounded, fM i t g is bounded. Based on Assumption 6, it holds that Furthermore, fM i t g converges according to the convergence theory of the martingale difference sequence [23]. Thus, for any ε > 0,  Agents need to find an optimal geometry to improve the positioning precision collaboratively. Each agent can only access partial observation, i.e., signals received by itself as well as its own position information. An agent can communicate with other agents within 2 km (neighbors), sharing individual observations. The light gray area shows the physical feasible area for all the agents, and the dark gray area shows where the agents can receive valid signals for passive positioning. 6 Journal of Sensors Therefore, (27) satisfies the necessary conditions of the Kushner-Clark theory [28,29], and the policy parameter θ i will converge to the asymptotically stable equilibrium of ODE (24).

Experiments
In this section, from two aspects, our proposed method is evaluated: (i) Ablation study, which investigates the influence of the agent number on the performance (ii) Comparison study, which studies the advantages of the proposed method, compared with existing methods 6.1. Multiagent RL for Passive Location Tasks. The experiments are performed with a passive location task environ-ment, which is a reinforcement learning environment where agents need to automatically find an optimal geometry to improve the positioning precision. The environment is introduced by [27], where all the agents are controlled by a brain that maps the global observation into joint actions. We modified the environment into a multiagent one by limiting the observation so that each agent can only access its own radio signals and position. Furthermore, each agent has a distinct brain that consists of an actor and a critic, learning and making decisions independently.
The whole scheme of the environment is shown in Figure 1. Consider a circular region with a radius of 6 km, at the center of which is a transmitter that emits radio signals all the time. The area within 1 km of the transmitter is a forbidden area, filled with the blue grid. Each agent is equipped with a radio receiver that can intercept wireless signals to estimate the position of the transmitter. All the 7 Journal of Sensors radio receivers are assumed to have the same performance in processing wireless signals. According to the sensitivity of the radio receiver, when agents go beyond a distance of 4 km away from the transmitter, nothing can be received. So, it is better for the agents to optimize the geometry within a closer region to the transmitter, which is shown as the dark gray area in Figure 1. Considering the multipath and interference of electromagnetic propagation, there are three low signal-to-noise ratio (SNR) regions, where signals received by the agents are contaminated by strong noises, leading to low positioning precision. The position of the transmitter is estimated in two steps: firstly, figuring out the time lag of signal propagation of each pair of agents, and secondly, estimating the transmitter's position that satisfies the time lags obtained in the first step with least square algorithms. The task of the agents is to navigate to an optimal geometry configuration step-by-step, avoiding low SNR and forbidden regions, improving positioning precision.
6.2. Setup. We model the passive location task as a multiagent decision-making problem, on which we evaluate our proposed method. According to the key components of reinforcement learning, the observation, action, and reward function are defined as follows.   Journal of Sensors 6.2.2. Action. The action of agent i is the position adjustment of itself, ½Δx i , Δy i . Hence, the agent will move to ½x i + Δx i , y i + Δy i in the next time step. In the experiment, actions are clipped within an interval of ½−50, 50.
6.2.3. Reward. All the agents share the same reward in each time step. The reward function reflects the positioning precision, which is defined by where p å is the position of the transmitter, p k å is the kth estimation of p å , and N est denotes the estimation time in each time step. In the experiment, we have N est = 100.
For each agent, both the actor and the critic are designed as fully connected neural networks, which have two hidden layers of 256 neural units, and each layer is followed by an activation layer of tanh function. The actor maps the observation into actions. Specifically, it takes in the observation, then generates a two-dimensional Gaussian distribution, from which the action is sampled. The structure of the critic is similar to that of the actor, but the output is the value function that is used to optimize the actor according to the policy gradient theory.
The diffusion processes are completed one by one among agents but in a random order, and each agent updates its global state estimation from neighbors only once at each time step. Agents within 2 km are called neighbors. As shown in Figure 1, agent 2 is a neighbor of agent 1. Under the setting of Figure 1, agent 1 is able to access the observation and last action of agent 2 but cannot obtain these information of agent 3.

Ablation Study.
To understand the influence of the agent number on the performance, we performed a series of experiments with different agent numbers (N = 3, 5, 7, 10) on the passive location task described above. Concretely, we focus on these indices that reflect the process of training across all the agents: averaged episode return, averaged episode length, policy entropy, and averaged number of neighbors. The maximum number of steps in an epoch is 800, with each episode executing no more than 100 steps. Figure 2 shows the training curves after 1000 epochs across 5 random seeds ð0,100,200,300,400,500Þ that initialize the agents. The optimizer and learning rate are Adam and 1 × 10 −4 , respectively. Averaged episode return is a key indicator that reflects the ability of agents to accomplish the 10 Journal of Sensors given task. From Figure 2(a), it can be seen that when N = 3, agents cannot master useful skills to obtain a higher episode return. But the performance is improved with the increase of the agent number (N = 5, 7, 10). It is consistent with Figure 2(b), in which the averaged number of neighbors increases as the scale of agents becomes larger. The averaged number of neighbors reflects the connectivity of agents, which determines the diffusion efficiency of messages between agents. When agent number N = 3, agents may be scattered with distances out of the ability to exchange information. In that case, our message diffusion-based method reduced to independent agents, not being able to handle the cooperative task in a partial observation environment. Figures 2(c) and 2(d) demonstrate the trend of mean policy entropy as well as the averaged episode length over agents across 1000 epochs of training. Both these two indicators drop as more agents take part in the task. The decrease of episode length indicates that agents can accomplish the task with fewer steps and find more elegant paths to optimal geometries collaboratively. The decline of policy entropy suggests that the agents become more confident in decision-making.

Comparison Study.
In the comparison study, we investigate the advantages of our proposed method with two kinds of algorithms that are popular in the field of decentralized multiagent reinforcement learning. One is independent agents developed by algorithms such as IQL [16]. But in the passive location task, the action space is continuous, and we have every agent learning with its own actor-critic structure for handling partial observation, called independent actor-critic (IAC). Another similar decentralized algorithm that we make a comparison to is proposed in [20], where the agents update their neural networks by directly combining the parameters of neighbors. We refer to this method by weight sharing (WS).
The proposed method in this paper facilitates the collaboration of agents by message diffusion, which makes individual estimation of the global state more accurate. We performed our method (Diffu for short) and the contrast methods (IAC and WS) with different numbers of agents (N = 3, 5, 7, 10), and the results are demonstrated in Figure 3. In general, our method has superior performance in the experiments. It can be seen that for the proposed method, the more agents there are, the more advantages can be obtained. IAC failed in all the experiments due to the inherent challenge faced with a multiagent environment; i.e., from the perspective of an individual agent, the actions of other agents become nonstationary. WS attempts to address the challenge by introducing weight sharing among agents. It obtained the information of neighbors in an indirect way that combines neural network parameters of neighbors, but it is not so effective in the experiments. In the task of passive location, with more agents, message diffusion becomes easier so that each agent has a more accurate estimation of the global state, which is helpful to tackle the nonstationary problem.
The learned agents are able to perform passive location tasks effectively. Figure 4 shows the trajectories that navigate to an optimal geometry with different numbers of agents (N = 7, 10). In these two scenarios, agents can adjust the geometry collaboratively to improve the positioning precision and even master the skills such as taking a detour to avoid low forbidden regions or sacrifice immediate reward for a better geometry configuration.

Conclusion
This paper investigated the multiagent problem in a timevarying network, where agents only observe partial information and exchange information with neighbors. A fully decentralized actor-critic multiagent reinforcement learning algorithm based on message diffusion is proposed. Each agent is trained to make decisions depending on its own local observations and messages received from neighbors. This completely noncentral training and execution method overcomes the data collection challenges, especially when both the state space and the action space are massive. The convergence of the proposed algorithm with linear function approximation is guaranteed. Experimental results confirmed the convergence and effectiveness of the proposed algorithm. This decentralized method can be used in many other areas, such as packet routing in computer/wireless communications. In future work, more general function approximations will be employed to analyze the convergence of the algorithm. Journal of Sensors ðC:2Þ (5) Sequence fβ t g is a bounded random sequence, and β t ⟶ 0, t ⟶ 0 The iteration of stochastic approximation is presented in (C.1), and its asymptotic property can be captured by the following differential equation: ð Þh x, i ð Þ: ðC:3Þ Assuming (C.3) has a global asymptotically stable equilibrium point, the following two theorems hold.

Data Availability
The data (experiment environment and source code) is developed by our research team, and we will open the source code at an appropriate time. Requests for access to these data should be made to Shengxiang Li, lishengxiangzz@163.com.

Conflicts of Interest
The authors declare that they have no conflicts of interest.