A Stable Distributed Neural Controller for Physically Coupled Networked Discrete-Time System via Online Reinforcement Learning

The large scale, time varying, and diversification of physically coupled networked infrastructures such as power grid and transportation system lead to the complexity of their controller design, implementation, and expansion. For tackling these challenges, we suggest an online distributed reinforcement learning control algorithm with the one-layer neural network for each subsystem or called agents to adapt the variation of the networked infrastructures. Each controller includes a critic network and action network for approximating strategy utility function and desired control law, respectively. For avoiding a large number of trials and improving the stability, the training of action network introduces supervised learning mechanisms into reduction of long-term cost. The stability of the control system with learning algorithm is analyzed; the upper bound of the tracking error and neural network weights are also estimated. The effectiveness of our proposed controller is illustrated in the simulation; the results indicate the stability under communication delay and disturbances as well.


Introduction
The increasing interconnection of physical systems through cybernetworks or physical networks has been observed in many infrastructures, such as power grid [1,2], transportation networks, and unmanned systems.One critical issue of these called cyberphysical systems is complexity of the system when it grows very large, especially the control problem.Consequently, distributed schemes are suggested for reducing the communication and computational cost compared with centralized control scheme [3].However, the coupling of subsystems and nonstatic environment in both cybernetworks and physics networks bring many challenges, such as physical interference among subsystems, time-varying plant parameters, communication delay, and expansibility of the cyberphysical system.
To increase expansibility of the cyberphysical system, the multiagent concept is usually introduced.The cyberphysical system can be divided into many agents.Each agent has its own control policy and a unified framework for pursuing its target [4].The expansion of the cyberphysical system turns into simply duplicating agents without accommodating control policy.To deal with the physical coupling of networked system, one common approach is to decouple subsystems in control design [5][6][7][8].Each subsystem may utilize state information of neighbored subsystems for mitigating their physical interference, or the designer treats their physical interference as random disturbance [9,10].On the other hand, for addressing nonstatic environment with timevarying plants, online supervised learning, adaptive control, and reinforcement learning algorithm are suggested; they all enable adaptively adjusting their control parameters online, while the combination of neural network and reinforcement learning usually leads to better control performance compared with conventional supervised learning and adaptive control scheme [11].Reinforcement learning constructs a long-run cost-to-go function to predict the consequence cost;, each control action takes the estimated future result into account [12], while, compared with adaptive control, the adaptive ability is limited in the number of time-varying 2 Complexity parameters; the number of time-varying parameters of plant model may very large in practice.
Recently, many researches are focused on reinforcement learning with neural network.These researches are classified into two categories.The first category is to simply utilize neural network to approximate unknown part about system model or control strategy, such as cost-to-go function and optimal control law.Prokhorov and Wunsch discussed three families of reinforcement learning control design [13], Heuristic dynamic programming (HDP), dual heuristic programming (DHP), and globalized dual heuristic programming (GDHP) and their application in optimal control.Xu et al. focus on experimental studies of real-time online learning control for nonlinear systems using kernelbased ADP methods [14].Lee et al. focus on a class of reinforcement learning (RL) algorithms, named integral RL (I-RL), that solve continuous-time (CT) nonlinear optimal control problems with input affine system dynamics [15].The second category is to combine the approach in the first category with supervised learning algorithm for guaranteeing convergence of the learning system; the supervised reinforcement learning also reduces a large number of trials by employing the error signal with domain knowledge [16][17][18].It generates instinct feedback for correcting the control actions.Xu et al. suggest a novel adaptive-critic-based neural network (NN) controller which is investigated for nonlinear pure-feedback systems [19].Liu et al. were concerned with a reinforcement learning-based adaptive tracking control technique to tolerate faults for a class of unknown multipleinput multiple-output nonlinear discrete-time systems with less learning parameters [20].Besides these, researchers try to employing multilayer/deep neural network for approximating the functions in control, so that the precision of model is enhanced and the performance can be improved in a consequence [21,22].However, it is hard to analyze its stability of learning algorithm.Moreover, the learning rate may be slow as the number of tuned parameters is very large in the deep neural network [23].
In this paper, we suggest a distributed neural controller for the physically coupled networked discrete-time system via online reinforcement learning.We model each subsystem as an agent; each agent can obtain its state and some physical neighbored subsystem state information to figure out optimal control action.One-layer adaptive critic neural network and action neural network are proposed for modeling the cost function and optimal action law.With deterministic learning algorithm, we incorporated supervised learning into our reinforcement learning algorithm for accelerating convergence rate.The stability of the learning algorithm is analyzed and the boundary of each parameter is also estimated.The contribution of this paper is two-fold.
(1) We propose a distributed online reinforcement learning algorithm for controlling physically coupled networked discrete-time system.
(2) Sufficient condition for guaranteeing learning algorithm stability and system stability are derived and the upper bound of parameters is estimated.
The rest of the paper is organized as follows: We model the physically coupled networked system and control system

Physical connection Cyberconnection
Figure 1: A physical-coupling networked system structure.
in a mathematical dynamic equation in Section 2, and some assumptions are made for simplifying the analysis; then, control system design via online reinforcement learning algorithm is depicted in Section 3; the stability analysis is detailedly discussed in Section 4; simulation results for illustrating the effectiveness and advantage of our algorithm are elaborated in Section 5. Section 6 is the conclusion part.

Physically Coupled Networked Control System and Problem Statement
In the physically coupled networked system, their subsystems may physically interfere with neighbored subsystems and change its state trajectory or dynamic.The structure is shown in Figure 1.In order to improve the control system performance, some cyberconnections of communication infrastructures are installed for exchanging the states of neighbored subsystems [3].The topology of cyberconnections and physical connections may not be the same for probably practical constraints in cyberresources.
max is a positive real number, it means the magnitude of disturbances are bounded.Assumption 2 is made for simplifying the analysis of action network which will be discussed in next section.
The control objective is to track the state target vector   ; then we have the error equation  , () =  , () −  , ( +  − ) . ( Therefore, the subsystem dynamic in a form of error is (3)

Distributed Control System and Control
Objective.Distributed control system is more flexible and scalable than centralized control.Moreover, it divides a large system controller into many small subsystems controllers, which lead to the system state dimension reduction in a controller, so that much computational resource and time can be saved [24].The control objective is to decrease the error vector   as fast as possible and bound in a small region for a given bounded disturbance.For subsystem controller, usually, an exponential damping rate of error is expected with a form of where ‖Γ  ‖ < 1.Therefore, the desired control input of subsystem  can be in a form of ( , ()) () is the cyberconnected neighbor set of subsystems , which means the controller of subsystem  utilizes the received state information from neighbored subsystems via communication network.However,   (  ()),   ( , ()), and   (  ()) are unknown.A reinforcement learning scheme with neural network is proposed for approximating the desired control strategy and strategy utility function about long-term cost.

Control System Design by Reinforcement Learning and Neural Network
The proposed distributed control scheme with reinforcement learning consists of three parts: the first part will introduce a strategy utility function (also called long-term cost function); the second part depicts the critic neural network and online training algorithm; the last part of this section elaborates the action neural network and parameter updating algorithm.

Strategy Utility Function.
The utility function defined for subsystem  is based on the current filtered state error   (); it is formulated as where  = 1, 2, . . ., ,    () ∈ , and    is a given constant positive scalar threshold for lth element of state error vector   for subsystem .   () is also an indicator of current tracking performance; if    () equals 1, it means the control system has a bad state, and the state deviates the desired value a lot.On the other hand, if    () equals 0, it indicates well-tracking performance and the lth state error is in a small bounded region.
The long-term cost is the sum of utility function at each sampling time.Based on the utility function    (), strategic utility function is defined as where 0 <  < 1,   () ∈   , and N is stage number.If  is infinite or very large, the strategy utility function is defined in a rolling horizon with a fixed number of stages.It is obvious that the control objective is to minimize   () which improve the control performance.

Critic Network Design.
In our proposed scheme, onelayer neural network is considered for approximating strategy utility function   .For simplifying the stability analysis, only output layer weights of neural network are designed to be adjustable in online training.A one-layer network is suggested to approximate strategy utility function; it is The basis function  , ( − ) is a Gaussian vector function which is defined as where  is communication latency,  ,,ℎ ∈   is the Gaussian function center vector, and the centers should cover the system operation state region as much as possible. , is width of Gaussian function.The approximation error  , would be very small if the dimension of basis function  , ∈  ×1 is large enough [11].The relation between kth and ( + 1)th optimal control action is where   () ∈   is control action for subsystem .We estimate the strategy utility function by The prediction error of approximated strategy utility function Ĵ for critic NN is We define the objective function of critic NN for minimization at th sampling as One common way to decrease the objective function is to update critic NN parameters along its gradient direction.Applying chain rule, partial derivative of objective function (13) with respect to Ŵ, () is Therefore, updating law for critic NN of subsystem  is is a given scalar, representing updating step size.The choice of   is very important.If   is too large, the online learning may diverge.

Action Neural Network Design.
Our control objective is to minimize the tracking error   and also to minimize the long-term cost function/strategy utility function   .They depend on the control action in each step.The desired control action ( 5) is an expected strategy for approaching this objective, and an action neural network is suggested for approximating the desired control action.The desired control action  , can be equal to where  , is the optimal weighting matrix for neural output which minimizes the residual  , ;  , ∈  ×1 is the basis function which has the same form as (9). , would be very small if the dimension of  , is very large.However,  , and  , are unknown; the desired control action is proposed to be estimated by where Ŵ, is the estimated weighting matrix for  , .And we have the estimated error ũ for desired control action.
where W, = Ŵ, − , , and we denote   () = W,  , (−), which causes dynamic (3) to be () and   () are the neighbor subsystem sets of subsystem  which are connected to subsystem  in physical way and cyberway.In our proposed scheme, supervised learning is incorporated into the action neural network training for accelerating the convergence rate of online updating.The objective of the policy is not only to minimize long-term cost   but also to approximate the desired control output  , with supervised learning.Thus, the error vector of action network is defined as where  , is the desired utility function value for subsystem , it can be set as 0 [20], and √   is principal mean square root.
The following cost function is defined for each step: Then, the partial derivative of ( 21) with respect to W, is obtained by chain rule.
Therefore, with gradient descent principle, the action NN weight matrix is updated by is the updating step size for online learning of action neural network.The choice of   will be discussed in the next section, which is associated with the stability of the online learning algorithm.

Stability Analysis
This section discusses the stability of online learning algorithm and the tracking performance.It is necessary for control design.The upper bound of error and weight parameter of neural networks are analyzed.Firstly, a theorem about the stability of this scheme is proposed.
Proof of Theorem 3.For the dynamic system described in (3), (15), and ( 23), we first define a Lyapunov function which consisted of quadratic of tracking error, action network weight error, and the error of critic neural network.It is where

Simulation Results
This simulation illustrates the effectiveness and advantage of our proposed control scheme in four aspects: (1) The effectiveness of our proposed control scheme of physical coupling networked control system in tracking sine wave signal with disturbances; (2) its effectiveness with communication delay; (3) its advantages compared with conventional reinforcement learning; (4) its effectiveness in multicontrol input system.The first simulation considers a networked system called system I as shown in Figure 2. System I consisted of four subsystems, each subsystem physically coupled with other subsystems.Each subsystem is a nonlinear system.Their equations are The details of other functions and variables are listed in Table 1. Figure 2 illustrates both the physical connection and the cyberconnection of system I.The communication network can send state information from subsystems 1 to 2, 2 to 3, 3 to 4, and 4 to 1.The parameters of the proposed controller are illustrated in Table 2.
The simulation results are shown in Figures 3 and 4. From Figure 3, it is obvious that all of the subsystems converge to The interference function on subsystem 2 from subsystem 1 0.1 1,2  23 The interference function on subsystem 3 from subsystem 1 0.01 2,1 + 0.05 2,2  34 The interference function on subsystem 3 from subsystem 1 0.03 3,1 + 0.1 3,2  41 The interference function on subsystem 1 from subsystem 4 0.1 4,1 + 0.002 4,2  1 The disturbance on subsystem 1 Gaussian noise with magnitude of 0.01  2 The disturbance on subsystem 2 Gaussian noise with magnitude of 0.01  3 The disturbance on subsystem 3 Gaussian noise with magnitude of 0.01  4 The disturbance on subsystem 4 Gaussian noise with magnitude of 0.01 the target state with small errors.The curves converge to the target curves at about 125th control actions, which mean the online learning algorithm successfully obtained the desired action network and critic network.From Figure 4, it can be seen that the fluctuant of control output is decreased along the time during the online learning process.They also illustrate the effectiveness of our proposed control scheme.
In order to present the advantage of our suggested control scheme, we select conventional reinforcement learning without supervised learning scheme; the updating of action network solely depends on the backpropagation of critic network with the objective of minimizing the output of critic network [12].The result is shown in Figure 5.The results explicitly indicate the divergence of the learning algorithm  because of the fast changing of target signal.And the conventional reinforcement learning may need off-line learning in advance.The results illustrate our proposed control scheme is more stable and has more powerful online learning ability than the conventional method.
In practice, the controller usually encounters action delay or communication delay.It is also modeled in our suggested model.To illustrate the effectiveness of our proposed control scheme under communication delay, we chose three communication delay values  = 3, 5, 10 to carry out the simulation.The simulation results are shown in Figures 6-8.These results show our proposed control scheme is stable under communication delay.However, static error increases with the communication delay.It is clear that the error of simulation with  = 3 is relatively smallest and the error of simulation with  = 10 is largest in the results.
For further demonstrating the effectiveness of our suggested scheme with multiple control input, we choose another system called system II for simulation.(39) Other model parameters are illustrated in Table 3.The controller parameters are set as shown in Table 4.The simulation results are shown in Figure 9.They show that all the subsystem states converge to the target signals within a small number of time steps (it is about 120).The tracking errors are small, and each of the state variables converges to its corresponding target signal.It illustrates the effectiveness of our suggested control scheme in application of multicontrol input systems with a relatively larger dimension compared with the previous simulation.

Conclusion
This paper suggests online reinforcement learning with one-layer neural network for controlling physically coupled networked system.It is a distributed learning control scheme.The networked system is divided into many subsystems; each system is an individual agent with controller and reinforcement learning algorithm.The reinforcement learning algorithm consists of the learning of critic network and action network.The critic network approximates the strategy utility function and the action network approximate the defined desired optimal controller.The action network weights updating decreases long-term cost with supervised learning mechanism by incorporating the desired control error  with long-term cost function .The effectiveness of our proposed controller is illustrated in the simulation part.The simulation results also indicate that the proposed control scheme improves the tracking performance compared with Complexity 13 The disturbance on subsystem 2 Gaussian noise with magnitude of 0.05  3 The disturbance on subsystem 3 Gaussian noise with magnitude of 0.05  4 The disturbance on subsystem 4 Gaussian noise with magnitude of 0.05 Figure 9: The state curves of system II with the proposed control scheme.

Figure 3 : 8 [
Figure 3: The state curves of system I with the proposed control scheme.

Figure 4 :
Figure 4: Control outputs of subsystem neural controllers with the proposed control scheme.

Figure 5 :Figure 6 :
Figure 5: The state curve of subsystem with conventional reinforcement learning.

Figure 8 :
Figure 8: The state curves of subsystems with control delay  = 10 under the proposed control scheme.
and    =   (   /  ,   / , ).Then, there exist upper bounds for ‖  ()‖ 2 ‖  ()‖ 2 ‖ J ()‖ 2 , when  → +∞, and they are The stability of this system depends on the control parameters Γ  ,   ,   , and , system functions   and   , and the communication networks which affect parameter   .It is obvious that if subsystem can obtain all state information from physically connected neighbors, the parameter   would be smaller, it improves the system performance because the absolute value of  1 and  2 will be larger, and it decreases the upper bound of   () and   .Moreover, the sign of  5 and  6 cannot be necessarily definite, as they are not the coefficients of the estimated variable in the following Lyapunov function variation expression (34).

Table 1 :
Function and parameters in networked system I.

Table 2 :
The parameters of controllers.

Table 3 :
Model parameters of system II.