Neural Network-Based Intelligent Computing Algorithms for Discrete-Time Optimal Control with the Application to a Cyberphysical Power System

Adaptive dynamic programming (ADP), which belongs to the field of computational intelligence, is a powerful tool to address optimal control problems. To overcome the bottleneck of solving Hamilton–Jacobi–Bellman equations, several state-of-the-art ADP approaches are reviewed in this paper. First, two model-based offline iterative ADP methods including policy iteration (PI) and value iteration (VI) are given, and their respective advantages and shortcomings are discussed in detail. Second, the multistep heuristic dynamic programming (HDP) method is introduced, which avoids the requirement of initial admissible control and achieves fast convergence. ,is method successfully utilizes the advantages of PI and VI and overcomes their drawbacks at the same time. Finally, the discrete-time optimal control strategy is tested on a power system.


Introduction
Adaptive dynamic programming (ADP) [1][2][3][4], which integrates the advantages of reinforcement learning (RL) [5][6][7][8] and adaptive control, has become a powerful tool in solving optimal control problems. With decades of development, ADP has also provided many approaches to solve other control problems, such as robust control [9,10], optimal control with input constraints [11,12], optimal tracking control [13,14], zero-sum games [15], and non-zero-sum games [16]. Furthermore, ADP methods have been widely applied to the real-world systems, such as water-gas shift reaction [17], battery management [18], microgrid systems [19,20], and Quanser helicopter [21]. ese aforementioned papers were all inspired and developed by the basic works of ADP-based optimal control; i.e., optimal control is the core research topic of ADP. e bottleneck of solving the nonlinear optimal control problems is to obtain the solutions of Hamilton-Jacobi-Bellman (HJB) equations. However, these equations are generally difficult or even impossible to be solved analytically. To overcome this difficulty, ADP has given several important iterative learning frameworks, such as policy iteration (PI) [2,22,23] and value iteration (VI) [24][25][26]. PI algorithm starts from an initial admissible control policy and then proceeds the policy evaluation step and the policy improvement step successively till convergence. e main advantage of PI is that it ensures all the iterative control policies are admissible and achieves fast convergence. e drawback of PI is also obvious. e requirement of initial admissible control is a strict condition in practice, which seriously limits its applications. Different from PI, VI can start from an arbitrary-positive semidefinite value function, which is an easy-to-realize initial condition. Although the easier initial condition makes VI more practical, it also leads to a longer iteration learning process; that is, VI achieves convergence much slower than PI. us, it is desired to develop a new method, which avoids the requirement of initial admissible control and gets convergence faster than the VI algorithm. To realize these purposes, the multistep heuristic dynamic programming (HDP) approach [27] is presented to integrate the merits of PI and VI algorithms and overcome their drawbacks. is paper reviews the state-of-the-art ADP algorithms for the optimal control of discrete-time (DT) systems. e rest of this paper is arranged as follows. In Section 2, the problem formulation is derived. ree iterative model-based offline learning algorithms along with comprehensive comparisons are presented in Sections 3 and 4. e proposed DT optimal control strategy is tested on a power system in Section 5. Finally, a brief conclusion is drawn in Section 6.

Problem Formulation
In this paper, we consider the general nonlinear DT system: where x(k) ∈ R n represents the system state, u(k) ∈ R m denotes the control input, and f(x) ∈ R n and g(x) ∈ R n×m are the system functions. e purpose of the optimal control issue is to find out a state feedback control policy u(x(k)), which can not only stabilize system (1) but also minimize the following performance index function: where r(x, u) � x T Qx + u T Ru. e matrices Q and R determine the performance of system states and control inputs, respectively. Given the admissible control policy u(x(k)), the value function can be described by According to the definition of optimal control, the optimal value function can be defined by By using the stationarity condition [28], the optimal control policy can be derived as where ∇V * (x(k + 1)) � zV * (x(k + 1))/zx(k + 1).
e key to obtaining the optimal control policy u * (x(k)) is to solve the following DT HJB equation [27]: Remark 1. Figure 1 provides the relationship and difference between discrete-time and continuous-time optimal control. e real-world systems generally exist in the continuoustime forms. After mathematical modeling, they are formulated by the continuous-time system models. rough sampling and discretization, the continuous-time system models are converted into the discrete-time ones. erefore, the associated performance indexes and HJB equations of discrete-time systems are in the discretization forms compared with the continuous-time systems. e key to solving the discrete-time optimal control issue is the discrete-time HJB equation, which is a nonlinear partial difference equation.
e existing works regarding continuous-time systems are much more than the ones regarding discretetime systems. In order to overcome this bottleneck, several ADP learning algorithms along with their neural network (NN) implementations will be introduced.

Model-Based PI Algorithm for the Optimal Control Problem of DT Systems
In this section, the model-based PI algorithm along with its NN implementation will be introduced in detail. e modelbased PI algorithm [2,23] is shown in Algorithm 1. e actor-critic dual-network structure with the gradient-descent updating law is employed to implement Algorithm 1. First, construct the critic NN to approximate the iterative value function: where W l,q c and ϕ c (x) denote the NN weights and NN activation functions of the critic network and q is the iteration index for the following gradient-descent method.
Define the error function for the critic NN: where In order to minimize the error performance E l,q c (k) � (V l,q (x(k))) 2 /2, the gradient-descent-based updating law for the critic NN is given by 2 Complexity where β c is the learning rate of the critic NN.
Similar to the design of critic NN, the actor network, which is used to approximate the iterative control policy, is expressed as e error function for the actor NN is defined as  Figure 1: Relationship and difference between discrete-time and continuous-time optimal control.
Step 1: (Initialization) Let the iteration index l � 0. Select an initial admissible control policy u 0 (x). Choose a small enough computation precision ϵ.
Step 2: (Policy Evaluation) Step 3: (Policy Improvement) Step 4: if ‖V l − V l− 1 ‖ ≤ ε, stop and the optimal control policy u l+1 (x) is acquired; Else, let l � l + 1 and go back to Step 2. where u l (x(k)) can be attained according to Algorithm 1. To minimize the error performance E l,q a (k) � u l,qT (x(k))u l,q (x(k))/2, using the chain derivation rule, the updating law for the actor NN is designed by where β a is the learning rate of the actor NN.
Remark 2. Figure 2 displays the NN implementation diagram of PI algorithm. First, NN weights of the actor network should be chosen to generate admissible control. Second, critic and actor networks are updated via the gradient-descent-based learning law to realize policy evaluation and improvement steps, respectively. After iteration, critic and actor networks achieve convergence, where the NN-based approximate optimal control can be obtained. Many stability proofs of the NN implementation procedure have been given in the existing works. Here, we introduce the following rigorous proof to demonstrate the optimality and convergence.
Theorem 1. Let the target iterative value function and control policy be described by V l (x(k)) � W lT c ϕ c (x(k)) and u l (x(k)) � W lT a ϕ a (x(k)), respectively. Let the critic and actor NNs be updated via (9) and (12) (9) and (12), it can be acquired that where (x(k)). Construct the following Lyapunov function candidate: e difference of the Lyapunov function (14) can be derived as If the learning rates are selected to satisfy β c ≤ 2/ ‖ϕ c (x(k))‖ 2 and β a ≤ 2/‖ϕ a (x(k))‖ 2 , then one has ΔP(W

Model-Based VI Algorithm and Multistep HDP Algorithm
With the help of the initial admissible control, the PI algorithm achieves fast convergence. However, the weakness of the PI algorithm is obvious. e PI algorithm requires the initial control policy to be admissible, which is a strict condition. How to find out an initial admissible control policy is still an open problem, which limits the real-world applications of the PI algorithm. To relax the strict condition, the model-based VI algorithm [24][25][26] is shown in Algorithm 2, where the initial condition becomes much easier.
Remark 3. Different from the PI algorithm, the VI algorithm does not require the initial admissible control, and one only needs to provide a specific initial value function, which makes the VI algorithm more practical in the real-world applications. However, without the help of the initial admissible control, the VI algorithm generally suffers from the low convergence speed. From the aforementioned content, it can be observed that the PI and VI algorithms have their own advantages and disadvantages. e PI algorithm can achieve fast convergence, while it requires an initial admissible control policy. e VI algorithm can start from an easy-to-realize initial condition, while it generally suffers from the low convergence speed. us, it is expected to design a new approach, which can make the trade-off between the PI algorithm and the VI algorithm.
at is, it is desired to develop an algorithm, which achieves convergence faster than the VI algorithm and does not require an initial admissible control policy. To realize this goal, the multistep HDP method [27] will be introduced in Algorithm 3.

Complexity
Construct the critic and actor NNs to approximate the iterative value function and control policy as follows: where W l c and W l a are the NN weights and ϕ c (x) and ϕ a (x) are the associated NN activation functions.
According to Algorithm 3, using the NNs to estimate the solutions will yield the following error: Let To minimize r(x(k), u l (x(k))) + V l (x(k + 1)), the gradient-descent-based updating law for the actor NN is given by Step 1: (Initialization) Let the iteration index l � 0. Select an initial value function V 0 (x). Choose a small enough computation precision ϵ.
Step 4: if ‖V l+1 − V l ‖ ≤ ε, stop and the optimal control policy u l (x) is acquired; Else, let l � l + 1 and go back to Step 2.  Complexity 5 Remark 4. From Table 1 and Figure 3, we can see the performance comparison and relationship among PI algorithm, VI algorithm, and multistep HDP. Due to the existence of initial admissible control, the PI algorithm gets fast convergence. However, the condition of initial admissible control is difficult to realize. Different from the PI algorithm, the initial condition of VI algorithm is easyto-realize. However, the initial condition may not be admissible, which may lead to the low stability. Multistep HDP follows the initial condition of VI algorithm and develops the multistep policy evaluation step to obtain more history data. erefore, multistep HDP is easy-torealize and achieves fast convergence at the same time; that is, multistep HDP successfully combines the advantages of PI and VI algorithms.

Application to a Benchmark Power System
e benchmark power system investigated in this paper is illustrated in Figure 4. is power system can be regarded as a microgrid, which is composed of nonpolluting energy (subsystems I and II), load demand sides (subsystem III), and regular generations (subsystem IV). e core control unit is the management center, which maintains the frequency stability against load variations. Figure 5, first, the realworld power system can be formulated by a state space function via mathematical modeling. After sampling and discretization, the system model can be controlled by computers.

System Model and Application. In
rough iterative ADP learning, the approximate optimal control can be obtained. Substituting the approximate optimal control into the system model will yield simulation results. To test the effectiveness of the proposed DT optimal control strategy, let us consider the following power system [19,20]: where Δξ f is the frequency deviation; Δξ t denotes the turbine power; Δξ g represents the governor position value; T t , T g , and T p denote the time constants of turbine, governor, and power system, respectively; α p represents the gain of power system; α s is the speed regulation coefficient; u denotes the control input; and x is the state variable. Let x � [Δψ f , Δψ t , Δψ g ] T , where x 1 � Δψ f , x 2 � Δψ t , and x 3 � Δψ g . en, the system (21) can be discretized as the form of (1). Set the matrices in the performance index function: Q � 2I 3 and R � 1.

Simulation Results.
Simulation results are shown in Figure 6. Figure 6(a) implies the system states cannot be stabilized without control. en, we apply the optimal control strategy into the system. Figure 6(b) indicates the system states can be stabilized after 8 time steps under optimal control. Comparing the trajectories of the system states, the superior control performance of optimal control strategy can be observed. Figure 6(c) shows the 2D plot of convergence trajectory in detail. Figure 6(d) provides the evolution of the control input. e aforementioned simulation results demonstrate the high stability, fast convergence, and low control cost of the DT optimal control strategy.
Let the iteration index l � 0. Select an initial value function V 0 (x). Choose a small enough computation precision ϵ.
Step 4: if ‖V l+1 − V l ‖ ≤ ε, stop and the optimal control policy u l (x) is acquired; Else, let l � l + 1 and go back to Step 2.

Conclusions
In this paper, several state-of-the-art ADP-based methods have been reviewed to address the optimal control problem of DT systems. A comprehensive comparison has been made between PI and VI. A novel multistep HDP method has been introduced to integrate the advantages of PI and VI algorithms with either strict requirement of initial admissible control or longer interaction learning process. e simulation results have demonstrated the effectiveness of our proposed schemes.

Data Availability
Data are available upon request to the corresponding author.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.