Distributed Policy Evaluation with Fractional Order Dynamics in Multiagent Reinforcement Learning

)e main objective of multiagent reinforcement learning is to achieve a global optimal policy. It is difficult to evaluate the value function with high-dimensional state space. )erefore, we transfer the problem of multiagent reinforcement learning into a distributed optimization problem with constraint terms. In this problem, all agents share the space of states and actions, but each agent only obtains its own local reward.)en, we propose a distributed optimization with fractional order dynamics to solve this problem. Moreover, we prove the convergence of the proposed algorithm and illustrate its effectiveness with a numerical example.


Introduction
In recent years, reinforcement learning [1] has received much attention from the society and succeeded remarkably in many areas such as machine learning and artificial intelligence [2]. As we all know, in reinforcement learning, an agent determines the optimal strategy under the feedback of rewards via constantly interacting with the environment. e function of the policy maps possible states to possible actions. Although reinforcement learning has made great achievements in single agent, it remains challenging in the application of multiagent [3]. e goal of the multiagent system is to enable several agents with simple intelligence, but it is easy to manage and control to realize complex intelligence through mutual cooperation. While reducing the complexity of system modeling, the robustness, reliability, and flexibility of the system should be improved [4,5].
In this paper, the objective of this paper is to investigate multiagent reinforcement learning (MARL), where each agent exchanges information with their neighbors in network systems [6]. All agents share the state space and action except local rewards. e purpose of the MARL is to determine the global optimal policy, and a feasible way is to construct a central controller, where each agent must exchange information with the central controller [7], which makes decisions for all of them. However, with the increase of state dimensions, the computation of the central controller becomes extensively heavy. e whole system would collapse if the central controller was attacked. en, we try to replace the centralized algorithm mentioned above with distributed control [8,9]. Consistency protocol based on design enables all agents to achieve the same state [10][11][12][13]. In [14], Zhang et al. proposed a continuous-time distributed version of the gradient algorithm. As far as we know, most of the gradient methods use integer order iteration. In fact, fractional order has been developed for 300 years and used to solve many kinds of problems such as control applications and systems' theory [15][16][17]. In comparison with the traditional integer order algorithm, the fractional order algorithm has more design freedom and potential to obtain better convergence performance [18,19].
Hereinafter, the contributions of the paper are listed: (1) We transform the multiagent strategy evaluation problem into a distributed optimization problem with a consensus constraint (2) We construct the fractional order dynamics and prove the convergence of the algorithm (3) We take a numerical example to verify the superiority of the proposed fractional order algorithm e rest organization of this paper is listed as follows. Section 2 introduces some problems of formulation on MARL and fractional order calculus. Section 3 transforms the multiagent strategy evaluation problem into the optimization problem with a consensus constraint, proposes an algorithm with fractional order dynamics, and proves that the algorithm asymptotically converge to an exact solution. Section 4 presents a simulation example, and we summarize the work in Section 5.

Notations.
Let R, R n , and R n×m represent the real number set, n-dimensional real column vector set, and n × m real matrix set, respectively. AT represents the transpose of , c) represents a multiagent Markov decision process (MDP), where S is the state space and A is the joint action space. P a is the probability of transition from s t to s t+1 when the agent takes the joint action a and [P π ] s,s′ � E a∼π(·|s) [P a ] s,s′ , R i (s, a) is the local reward when agent i takes joint action a at state s and c ∈ (0, 1) is a discount parameter. π(a|s) represents the condition of probability when the agent takes joint action a at state s. e reward function of agent i is defined when follows a joint policy π at state s as follows: where the right-hand side of the equation means that there is a probability for all possible choices of action a, and we calculate the expected value for all rewards of agent i: where R π c (s) represents the average of the local rewards.

Graph theory.
e graph is expressed as G(V, E), where G represents a graph, V is the set of vertices, and E is the set of edges in G. If any edge in the graph is undirected, the graph is named as undirected graph [20].
is the degree matrix with d i � n j�1 a ij and Laplacian matrix is L � D − A. Moreover, if the graph is connected, L has the following two properties: (1) Laplacian matrix is a semipositive definite matrix (2) e minimum eigenvalue is 0 because the sum of every row of the Laplace matrix is 0 e minimum nonzero eigenvalue is defined as the algebraic connectivity of the graph.

Assumption 1.
e undirected graph mentioned in the following text is connected.

Policy Evaluation.
To measure the benefits of agents in its current state, we establish the following value function, which represents the value of the cumulative return obtained by agents starting from the state s t , adopting a certain strategy π: We construct Bellman equation based on V π ∈ R |S| and R π c ∈ R |S| : It is difficult to evaluate V π directly if the dimension of the state space is very large. erefore, we use V θ (s) � ϕ T (s)θ to approximate V π , where θ ∈ R d is the vector and ϕ(s): S ⟶ R d , which is a particular function for state s. Indeed, solving equation (6) is equivalent to obtain the vector θ via V θ ≈ V π . In other words, it means to minimize the mean square error about 1/2‖V θ − V π ‖ 2 D , where D � diag μ π (s), s ∈ S , ∈∈R ‖S‖×‖S‖ is a diagonal matrix determined by the stationary distribution. We construct the equation as follows: where ρ is a regularization parameter and Π Φ is a projection operator in the column subspace of Φ. It is not difficult to (7): where . e minimum value of θ in equation (8) is unique if A is a full rank matrix and C is a positive definite matrix. In practice, it is difficult to get the expectations in the compact form when the distribution is unknown. We replace expectation with the average as follows: where We assume that the sample size p approaches infinity to make sure its confidence level. In these sequences, each state is attached at least once. en, we reconstruct equation (8) as follows: Noteworthy, in a shared space, the agent observes the states and actions of the neighbors, but only observes the local rewards of its own. In other words, we get A and C except b. So, we define en, we rewrite equation (10) as follows:

Fractional Order Dynamics for Policy Evaluation
Hereinbefore, the aim of policy evaluation becomes to minimize the object function. Now, we rewrite (11) as follows: We define θ ∈ R n d as a factor concatenating all θ i : θ � [θ T 1 , θ T 2 , . . . θ nT ] T ∈ R n d and the aggregative function f as f(θ) � n i�1 f(θ i ). As we all know, the consensus constraint (12) is expressed as T n ] T ∈ R nd , L ∈ R n×n , L � L ⊗ I d ∈ R nd×nd , A� A ⊗ I n ∈ R nd×nd , and C� C ⊗ I n ∈∈R nd×nd . Based on (13), we formulate the following the augmented Lagrangian: where λ ∈ R n d is the Lagrange multiplier. It is feasible to design a fractional order continuous-time optimization algorithm from primal-dual viewpoint, gradient descend for primal variable θ, and gradient ascent for dual variable λ via (14). Both of them are updated according to the fractional order law:
Proof. We obtain the detailed dynamics of θ(t) and λ(t): where I is an identity matrix. We consider the equilibrium of (16): en, we combine (16) and (17), and according to the rough Lemma 1, we reconstruct (18) as follows: and We construct the Lyapunov function as follows: en, We obtain the result according to the Lasalle invariance principle.
Proof. Under the condition α 1 � 1 + α 1 , we rewrite the dynamics with the condition of eorem 1 as follows: Due to α 1 � 1 + α 1 and α 1 + α 2 � 2, Under the condition of (23) and (24), we obtain the frequency distributed model by Lemma 1 as follows: We construct the Lyapunov function: en, rough the LaSalle invariance principle, we obtain the result.

Experimental Simulation
In this section, we provide an example to illustrate the effectiveness of the proposed algorithm.
ere are 20 states in the multiagent reinforcement learning. We set d � 5, regularization parameter ρ � 0.1, and discount parameter c � 0.5.
ere are 4 agents in the connected network in Figure 1. State s is a randomly generated 5dimensional column vector, the dimension of ϕ(s) is a cosine function, and P is a randomly generated 5-dimensional matrix. en, we randomly generate the matrices A , C, b i as follows: Before the simulation, it is necessary to obtain the solution of the multiagent reinforcement learning: We show the comparison about the fractional order algorithm with the conventional integer order one. In Figures 2 and 3, the curve illustrates almost the same convergence performance as the conventional integer order when α is 0.995. In Figures 4 and 5, the fractional order algorithm achieves a faster convergent rate than that of the integer order algorithm. Simulation results illustrate the convergence about the integer order and the fractional order. Furthermore, the proposed distributed Security and Communication Networks algorithm with fractional order dynamics has more design freedom to achieve a better performance than that of the conventional first-order algorithm.

Conclusion
In this paper, the value function problem of the multiagent reinforcement learning was transformed as a distributed optimization problem with a consensus constraint. en, we proposed a distributed algorithm with fractional order dynamics to solve this problem. Besides, we proved the asymptotic convergence of the algorithm by Lyapunov functions and illustrated the effectiveness of the proposed algorithm with an example. In the future, we will consider applying reinforcement learning to the recommendation system, so as to get better results [23].

Data Availability
e .m and .slx data used to support the findings of this study have been deposited in the Github repository (97weiD/ data_DPEFOD).       Security and Communication Networks