Intelligent L2-L∞ Consensus of Multiagent Systems under Switching Topologies via Fuzzy Deep Q Learning

The problem of intelligent L2-L∞ consensus design for leader-followers multiagent systems (MASs) under switching topologies is investigated based on switched control theory and fuzzy deep Q learning. It is supposed that the communication topologies are time-varying, and the model of MASs under switching topologies is constructed based on switched systems. By employing linear transformation, the problem of consensus of MASs is converted into the issue of L2-L∞ control. The consensus protocol is composed of the dynamics-based protocol and learning-based protocol, where the robust control theory and deep Q learning are applied for the two parts to guarantee the prescribed performance and improve the transient performance. The multiple Lyapunov function (MLF) method and mode-dependent average dwell time (MDADT) method are combined to give the scheduling interval, which ensures stability and prescribed attenuation performance. The sufficient existing conditions of consensus protocol are given, and the solutions of the dynamics-based protocol are derived based on linear matrix inequalities (LMIs). Then, the online design of the learning-based protocol is formulated as a Markov decision process, where the fuzzy deep Q learning is utilized to compensate for the uncertainties and achieve optimal performance. The variation of the learning-based protocol is modeled as the external compensation on the dynamics-based protocol. Therefore, the convergence of the proposed protocol can be guaranteed by employing the nonfragile control theory. In the end, a numerical example is given to validate the effectiveness and superiority of the proposed method.


Introduction
In recent years, the coordination control of MASs has attracted considerable attention for their broad applications in many fields [1,2], such as formation control, cooperative attack, and attitude alignment. e MAS consists of a series of agents, which can communicate and interact with each other to realize multiple missions and adapt to the complex environment [3,4]. In particular, much attention has been paid to the problem of consensus of MASs because of their great potential applications in both economic and military. e purpose of MASs is to construct a relationship between the agents to achieve an agreement for the state/output. In the past decades, fruitful research studies have emerged to contribute to the development in theory and applications. To mention a few, the problem of distributed formation control for MASs is studied in [5], the time-varying formation design for MASs with disturbances is proposed in [6], and the problem of finite-time consensus for switched nonlinear MASs is investigated in [7].
In practical applications, it is well known that the communication topology among the agents may change dramatically over time to adjust to multiple missions and complex environments [8,9], such as the MASs can realize obstacle avoidance and higher flight efficiency by formation transformation [10,11]. e design flexibility, security, and performance of convergence will be improved, which motivated the studies on the switching topologies of MASs [1,12]. Recently, because of the broad potential applications of switching topologies, considerable significant research studies have been proposed by scholar at home and abroad. e communication topologies among interacting agents will change according to the flight conditions and missions, which can be modeled as switched systems. e switched systems consist of a series of continuous-time (or discretetime) subsystems and a switching signal, which determines the switching strategy between subsystems. It provides an efficient approach to deal with the problem of fast timevarying conditions. erefore, it can be inferred that the switching of topologies can be viewed as the switching between subsystems, and it is essential to study the problem of consensus protocol design to make sure the state/output can converge to the given value. In [13], the problem of timevarying formation control of MASs is investigated. e communication topologies switching among given connected topologies and the switching signal depend on the Markovian process.
e Lyapunov function method is utilized to analyze the convergence. In the work of [14], the problem of event-triggered leader-following consensus problem for multiagent systems with external disturbances is addressed under switching topologies. A novel distributed event-triggered protocol is proposed to realize disturbance rejection based on extended state observer. e average dwell time (ADT) method is utilized to ensure the stability of the event-triggered protocol. In [15], the time-varying practical formation problem is studied for spacecraft, where switching topologies and time-delays are taken into consideration. Sufficient conditions are provided to ensure that the error system is convergent, which are derived based on the ADT method. It is well known that the research studies mentioned above are proposed to deal with the problem of switching topologies. However, the convergence is guaranteed based on the ADT method. It can be inferred that the common parameters are applied for all subsystems in the ADT method, which will lead to conservativeness. To obtain tighter bounds on dwell time and improve the design flexibility of the algorithm, MDADT is applied during last decades. In [16], the MDADT method and multiple discontinuous Lyapunov function (MDLF) method are combined to analyze the stability of switched systems with unstable modes. e sufficient conditions are established, and the results in existing literature are covered as a special case. e fast switching and slow switching in the framework of MDADT are applied to unstable modes and stable modes. In [17], the global adaptive control algorithm for switched systems is proposed based on the MDADT method. e different properties of subsystems are taken into consideration. en, the adaptive tracking controller is applied to the nonlinear switched systems with external disturbance and unmodeled dynamics, which illustrates the effectiveness and superiority of the MDADT method. In the work of [18], the event-triggered sliding mode controller is proposed. By employing the MDADT method and event-triggered strategy, less conservative and more practical results are obtained. Sufficient conditions are given to ensure stochastically exponential stability by the aid of the LMI technique. e literature mentioned has provided fruitful results on consensus protocol design for MASs under switching topologies. However, stability and convergence are ensured by the traditional ADT method. e different properties of subsystems cannot be considered, which will lead to conservativeness. erefore, how to obtain less restrictive results is still an open and challenging problem, which has been fully investigated, and it has an important value and potential applications in practice.
Moreover, in practical environment, there always exist uncertainties and disturbances, which will lead to performance degradation and even instability [19,20]. erefore, it is essential to investigate the robust consensus problem to improve the performance in the uncertain environment [21][22][23]. In the work of [24], the problem of distributed H∞ containment control for MASs with switching topologies is studied. An observer-based containment control scheme is proposed. e external disturbance and time delay in the environment are taken into consideration, which is more applicable than the traditional method. By employing the Lyapunov function method and LMIs technique, the sufficient existing conditions and solutions of control protocol are given in the form of LMIs. In [25], the problem of timevarying formation of second-order discrete-time MASs under switching topologies and the time delay is investigated. e sufficient conditions are given to ensure MASs accomplish the mission of time-varying formation based on the state transformation method. e time delay and uncertainties are considered. Compared with the existing literature, the proposed can overcome the undesirable response caused by time delay and improve the transient performance. In the work of [26], the problem of formation control for tail-sitters in flight mode transitions is studied. e nonlinear dynamics and uncertainties are considered, and the robust time-varying formation control protocol is proposed. It is proven that the tracking errors can converge to the origin in finite time. e problem of L 2 -gain robust protocol for time-varying output formation-containment of MASs is addressed in [27]. e PID-based output-feedback control protocol is provided to ensure that all followers can track a time-varying formation reference, where communication delays and external disturbance are taken into consideration. e asymptotic stability of MASs is proved by the Lyapunov function method. However, as well known, the transient performance and robustness cannot achieve simultaneously. erefore, we need to make comprise of the transient performance and robustness, which still remains an open and challenging problem.
In addition, with the development of computing ability, the intelligent technique has been an attractable problem during the last decades [28][29][30]. It is widely applied in the areas of target recognition, machine vision, robotic systems, and controller design [31,32]. It provides an efficient method to improve the autonomy and design flexibility of the system [33]. e most widely used methods are the deep learning and reinforcement learning. As a combination of deep learning and reinforcement learning, the advantages of deep learning and reinforcement learning are adopted, which include the characteristics of self-fitting and selflearning. In the work of [34], the automatic completion of multiple peg-in-hole assemble tasks is realized. Because the traditional method requires an accurate contact model and complex analysis, the intelligent control method is formulated by constructing the task as a Markov decision process. e deep deterministic policy gradient (DDPG) algorithm is proposed to accomplish the task to achieve optimal policy and avoid risky actions. In [35], a noninteger PID controller is proposed based on the DDPG algorithm. e measurement noises and external disturbances are taken into consideration. e kinematic controller and dynamic controller are proposed to achieve optimal performance. e DDPG algorithm is given to compensate for the uncertainties and disturbances in the framework of actor-critic. A numerical example is given to illustrate the effectiveness of the proposed method. Cheng et al. [36] proposed the real-time controller for the problem of fuel-optimal moon landing. Because the traditional method cannot meet the demand of high requirements of real-time performance and autonomy, the deep reinforcement learning algorithm is proposed for the real-time optimal control based on actor-indirect method architecture. e deep neural networks are applied for initial guesses, and the efficiency of training data is guaranteed. e literature mentioned above has provided considerable meaningful results in the area of machine learning. However, to the best of the authors' knowledge, the intelligent consensus design for MASs with considerations of stability, robustness, and optimal transient performance has not been fully studied yet. It is essential and important to achieve optimal comprise of robustness and transient performance.
Based on the statement above, it can be inferred that the problem of the improvement of autonomy and design flexibility for the system needs to be studied. e problem of consensus protocol design for MASs under switching topologies has not been fully investigated yet. e design flexibility can be improved by employing tighter bounds on dwell time because less conservative results can be obtained, and it leaves more room to ensure the switching logic stays in the subsystems with better performance for long enough time. Moreover, it is of great importance to combine the advantages of the traditional method and intelligent technique, which can ensure convergence, robustness, and transient performance simultaneously. erefore, the problem of intelligent L 2 -L ∞ consensus design of MASs under switching topologies is investigated. e convergence and robustness are guaranteed by the Lyapunov function method and the MDADT method, which are more applicable. e transient performance is improved by fuzzy deep Q learning, in which the fuzzy reward function is proposed for the complex scheduling process.
e main contributions of this study can be summarized as follows: (1) e L 2 -L ∞ consensus protocol of MASs under switching topologies is designed. e problem of L 2 -L ∞ consensus of MASs is converted into the problem of stability analysis for switched systems, which is more applicable than the traditional method. e MDADT method and multiple Lyapunov function method are combined to guarantee the stability and prescribed attenuation performance index, which can obtain tighter bounds on dwell time and less conservative results.
(2) e consensus protocol is composed of the dynamics-based consensus protocol and learningbased consensus protocol. Compared with the traditional method, the proposed strategy can ensure the stability, robustness, and transient performance simultaneously. (3) e fuzzy reward function is utilized to improve the efficiency of the deep reinforcement learning algorithm. e design of reward function for the traditional method mainly depends on the experience of designer, which will lead to complexity. e fuzzy reward function can improve the data efficiency and ensure optimal performance. e rest of the study is organized as follows: the preliminaries and problem statement are provided in Section 2; in Section 3, the main results of the study are given; the numerical example is given in Section 4, which is followed by the conclusion in Section 5.

Preliminaries and Problem Statement
In this study, it is supposed that MASs are composed of a leader labelled as 0 and n followers labelled as 1, 2, . . ., n.
e connection topology among n followers can be described as a time-varying model with N f topologies. We define G σ(k) � (G 1 , G 2 , . . . , G N f ) as undirected connected graph, respectively. H � (1, 2, . . . , n), n > 1 represents the set of finite nodes. s � σ(k): [0, ∞) ⟶ R � 1, 2, . . . , N f denotes the switching signal, which is a piecewise continuous function of time and takes value in the finite set ij ) n×n and L σ(k) � (l σ(k) ij ) n×n are the adjacency matrices of the undirected graph G σ(k) and the Laplacian matrix at time instant k, where a σ(k) ij stands for the element of adjacency matrix, where a σ(k) ij � 1 represents that the node i can obtain information from node j, and l σ(k) ij is defined in the following equation.
en, for given node i ∈ H, we can define the neighbors of node i as to indicate the information transformation between the leader and the followers with n nodes. Define i � 1 stands for that the node i ∈ H can obtain information from the leader; otherwise, we define θ σ(k) i � 0. erefore, MASs with leader-followers can be described as in the following equations: Computational Intelligence and Neuroscience 3 where A, B, C, and D are the system matrices with appropriate dimensions, stands for the output of the i th follower, and ω i (k) ∈ R m denotes the external disturbance belonging to L 2 [0, ∞). It is supposed that the agent i can obtain information from its neighbors and leader. erefore, we define υ i (k) as relative state measurements of the i th agent, which can be described as follows: In this study, the control input of the i th agent to ensure the consensus of leader-followers is proposed.
where K σ(k) is the control parameter to be determined by robust control theory, and K c,σ(k) is the compensated parameter obtained by deep Q learning. In this study, the gained parameters K c,σ(k) are supposed to vary in a finite set with given bounds. e K c,σ(k) can be viewed as additional perturbance of K σ(k) , which can be described as follows: where M σ(k) ∈ R l×l Δ and N σ(k) ∈ R q Δ ×q are the known matrices with appropriate dimensions, and F σ(k) ∈ R l Δ ×q Δ are the unknown matrices with F T σ(k) F σ(k) ≤ I. For the i th agent, the error of state is defined as e i (k) � x i (k) − x 0 (k). en, the closed-loop system can be rewritten as To facilitate the proof, the definitions and lemmas are given as follows.
Definition 1 (see [37]). For given switching signal σ(k) and k 1 > 0, define N σ,s (0, k 1 ) as the number of switching instants over the time interval (0, k 1 ). T σs (0, k 1 ) is set to be the activated time of undirected graph G s during (0, k 1 ). ere exist constant scalars N 0 ≥ 0 and τ as > 0, such that en, τ as is called the mode-dependent average dwell time and N σ,s (0, k 1 ) is the mode-dependent chatter bound, respectively. In this study, we set N 0 � 0.
Definition 2 (see [37]). If there exist control protocol in equation (5), all agents asymptotically track the state trajectory of the leader, such that Definition 3 (see [38]). For given constant scalars 0 < δ < 1 and c > 0, the prescribed L 2 − L ∞ attenuation performance c is satisfied such that (1) e MASs in equations (2)-(3) are asymptotically stable when ω(k) � 0. (2) e following inequation holds for all nonzero Lemma 1 (see [35]). The matrices L σ(k) + Θ σ(k) are symmetric and positive definite if and only if the graphs G σ(k) are connected for t ≥ 0. Moreover, there exist a transformation matrix T σ(k) , such that the following equation holds.
where λ σ(k) i , i ∈ H are the nonzero eigenvalues of matrices Lemma 2 (see [39]). For given constant a > 0 and real matrices Θ, U, V, W, it is concluded that equation (12) is equivalent to equation (13).
Lemma 3 (see [39]). For given symmetric matrix T and matricesM, N, if there exist constant scalar ε > 0, such that en, the following equation holds for any appropriate F with F T F ≤ I.

L 2 -L ∞ Consensus Protocol Design.
In this section, the L 2 -L ∞ consensus protocol is proposed, and the stability and prescribed performance are guaranteed.

Lemma 4.
For given constant scalars 0 < δ < 1, c > 0. e system in (7) with control input in (5) where Proof. Substituting equation (17) to (7), one can obtain equation (16). It can be inferred that the transformation matrix T σ(k) is unique; therefore, we have the following equations.
It is obvious that the problem of robust consensus protocol design can be converted to the controller design of (16). □ Remark 1.
e system in equation (16) consists of the independent system in equation (20). erefore, the stability of equation (7) is equivalent to the stability of n subsystems in equation (20); the attempt to ensure the prescribed attenuation performance of (7) can be converted to guarantee the attenuation performance of (16).
In eorem 1, the sufficient conditions to guarantee the stability and prescribed attenuation performance index are presented.

Theorem 1. For given constant scalars μ
and class functions κ 1 , κ 2 , the switched systems in equation (20) with MDADT satisfying equation (25) are globally uniformly asymptotically stable with prescribed L 2 -L ∞ attenuation performance c, such that Proof. e entire proof can be divided into two steps.
Together with (22), we can conclude that Based on equations (26)- (27), the following equation can be obtained by iteration.
Computational Intelligence and Neuroscience 5 Combining with Definition 1, we have en, we can obtain (29) based on (21).

Together with equations (22)-(23), one has
6 Computational Intelligence and Neuroscience en, one can obtain the equation as follows by iteration.
Proof. e Lyapunov function V i,s (e i (k)), i ∈ H, is defined as follows: According to (20) and (39), we can conclude that (38) is equivalent to (24). Along the trajectory of V i,s (e i (k)), one has Together with equations (40)-(41), we have According to eorem 1, we can conclude that the system in (20) with MDADT satisfying (25) is globally uniformly asymptotically stable with prescribed L 2 -L ∞ attenuation performance c.
Based on eorem 1 and Corollary 1, the solutions of consensus protocol are given in eorem 2. □ Theorem 2. For given constant scalars μ s > 1, 0 < δ s < 1, c > 0, a s > 0, and ε s > 0, if there exist positive-definite matrices P i,s ∈ R p×p , matrices X s ∈ R l×l , Y s ∈ R l×q , the MASs in (2)- (3) with control input in equation (5) are asymptotically stable with prescribed L 2 -L ∞ attenuation performance c such that equation (43) holds. 8 Computational Intelligence and Neuroscience e parameters of control protocol can be derived in (43). where Proof. According to Schur complement, it is obvious that equation (43) is equivalent to equation (45).
Together with Lemma 2, we have Moreover, based on Lemma 3, one has According to Schur complement, it is obvious that (46) is equivalent to (37), which completes the proof.

Compensated Consensus Protocol Design Based on FDQL
In this section, the learning-based consensus protocol is proposed based on deep reinforcement learning, where fuzzy deep Q learning is utilized. e stability and prescribed attenuation performance are guaranteed by the robust control theory, and the learning-based control protocol is introduced to improve the transient performance and realize optimal control policy. e output of the learning-based control protocol can be viewed as an additional variation of robust consensus protocol. e online scheduling of control protocol is established as a Markovian process. erefore, the advantages of robust control theory and deep reinforcement learning are combined.
It is well known that reinforcement learning is composed of state, action, agent, and environment. e state of k th step is defined as s k , and the chosen action is supposed to be a k ; then, the reward function r k and the state s k+1 are generated based on the interaction with the environment. erefore, the optimal control policy can be obtained by maximum the reward function.
To improve the convergence of consensus protocol, the state is defined as s k � e i (k) z i (k) and the action is defined as a k � [K c,σ(k) ].
In Q learning, the deep neural network is utilized to approximate the action-state value function Q * (s k , a k ), which can be described as where f(s k , a k , ω) denotes the function of deep neural networks. e action is chosen based on the maximum Q value: ere exist two neural networks in the deep Q learning algorithm, whose structures are the same and can be called as the critic neural network and target neural network. e parameters of the critic neural network are updated based on temporal-difference learning. e output of the critic neural network is defined as Q(s k , a k , ω) and the output of the target neural network is defined as Q(s k , a k , ω − ). erefore, the parameters of the critic neural network are updated based on the equation as follows: where L r is the learning rate, c s denotes the discount factor, R represents the reward of state transition from s k to s ′ through action a k , and max a′ (Q(s ′ , a ′ , ω − )) stands for the maximum Q value of the target neural network. It can be inferred that the reward function has an important influence on the final performance. e design of traditional deep Q learning mainly depends on the experience of designers, which can not achieve optimal performance and will improve the computational complexity. In this study, the reward function is applied to design the Computational Intelligence and Neuroscience 9 reward function. e input value of fuzzy reward function can be divided into five categories, which can be described as VB, B, N, G, and VG. e five categories represent very bad, bad, normal, good, and very good. In this study, it is supposed that there are four followers. erefore, the inputs of the fuzzy reward system are set to be |e 1 |, |e 2 |, |e 3 |, and |e 4 |. It can be inferred that each fuzzy set includes 25 rules, and the total number of the fuzzy rules is 75. e output of the fuzzy reward function is limited in the interval [− 1,0), and the defuzzifier of the fuzzy reward function is defined as Based on the statement above, the learning-based consensus protocol design algorithm can be summarized as follows: e FDQN algorithm proposed in this study can improve the transient convergence performance of MASs. e output of the deep Q network is supposed to be variation of parameters of consensus protocol. As well known, the design of reward function in the traditional method depends of the experience of the designers. To overcome the problem, the fuzzy reward function is developed to improve the learning efficiency in this study.

Numerical Example
In this section, an example is provided to illustrate the effectiveness of the method. e model of MASs is constructed as follows: e external disturbance is e switching topologies are shown in Figure 1. en, we can obtain the Laplace matrices as follows: e parameters of switching topologies are given as follows: erefore, we can obtain MDADT according to (25).
It is well known that the ADT method can be viewed as a special case of the MDADT method. erefore, it can be inferred that τ a � max τ ai � 0 · 4266. It is obvious that tighter bounds on dwell time and less conservative results can be obtained. en, we define the attenuation performance index c � 0 · 9, and we can obtain the parameters of consensus protocol based on eorem 2. e switching logic is shown in Figure 2. In order to illustrate the effectiveness and superiority of the proposed method, the traditional ADT method and MDADT method are given as comparisons. From the statement above, we have realized that MDADT can obtain tighter bounds and less conservative results. Moreover, the comparisons of state response of the ADT method and MDADT method are shown in Figures 3-6. e state responses of MASs with ADT switching topologies are shown in Figures 3-4. e state responses of MASs with MDADT switching topologies are shown in Figures 5-6. We can see that the transient performance of the ADT method is better than that of the MDADT method because the different characteristics of subsystems are taken into consideration, which will no doubt improve the design flexibility and make it more applicable for practical conditions.
Validate the superiority of the proposed method. e state response of the proposed method is shown in Figures 7-11. e state responses of the proposed method are shown in Figures 7-8. We can conclude that the transient performance can be improved by the aid of fuzzy deep Q learning. e advantages of the traditional method and intelligent method are combined. Compared with the traditional method, the transient performance can be improved, and compared with the intelligent method, stability and training efficiency can be guaranteed. e attenuation performance index is shown in Figure 9, from which we can see that the robustness of the proposed is ensured. e episodes reward response is shown in Figure 10, and we can see that the reward function of the fuzzy deep Q learning algorithm can converge to the neighbor of the origin, which demonstrates the effectiveness of the algorithm in this study. In addition, the response of the action is shown in Figure 11, from which we can see that the learning-based consensus protocol is provided to compensate the additional input caused by the uncertainties.
Based on the statement above, we can conclude that the convergence, robustness, and prescribed attenuation performance index are guaranteed. e less conservative results and tighter bounds on dwell time can be obtained by the MDADT method. e transient performance of the system can be improved based on the fuzzy deep Q learning algorithm. It is worth mentioning that the traditional robust cannot make comprised of robustness and transient performance, and the intelligent method always cannot guarantee convergence. By employing the proposed method, convergence, robustness, and transient performance are guaranteed simultaneously.
(1) Design the dynamics-based consensus protocol according to eorem 2 (2) Define the bounds of learning-based consensus protocol parameters (3) Initialize the weights of the Q value network (4) Initialize the weights of the target Q value network (5) Initialize the replay buffer R, episode � 0 (6) for episode � 1 to M do (7) Initialize a random state s 1 and receive the initial observation (8) For t � 1 to K do (9) Select an action based on the state and reward function (10) Execute the action a k , and then, one can obtain the reward function r k and the state s k+1 (11) Store the pair (s k , a k , r k , s k+1 ) in the replay buffer (12) Sample a random minibatch of transitions (s m , a m , r m , s m+1 ) from the replay buffer (13) Update the target Q value function (14) Update the weights of the target Q value network (15) end for (16)   Computational Intelligence and Neuroscience  Time (s) Figure 4: e state response of (x) 2 under the ADT method.
x 11 x 21 x 31 x 41 x 51   Time (s) Figure 7: e state response of (x) 1 with the proposed method.
x 12 x 22 x 32 x 42 x 52 Time (s) Figure 11: e response of the action.

Conclusions
e problem of intelligent L 2 -L ∞ consensus design for MASs under switching topologies is investigated in this study. e switching topologies of MASs are modeled as switched system theory by employing linear transformation. en, the problem of consensus protocol design can be converted to the problem of L 2 -L ∞ control. To ensure the convergence, robustness, and transient performance simultaneously, the proposed consensus protocol is composed of dynamicsbased consensus protocol and learning-based consensus protocol, which provides baseline and compensation of uncertainties. e baseline of consensus protocol is obtained by dynamics-based consensus protocol, which is provided based on the MDADT method and MLF method. e scheduling interval of learning-based protocol is given by nonfragile control theory. en, the learning-based consensus protocol is proposed based on the fuzzy deep Q learning algorithm to improve the transient performance and achieve optimal policy, where the fuzzy reward function is introduced to improve the learning efficiency.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.