Adaptive dynamic programming (ADP), which belongs to the field of computational intelligence, is a powerful tool to address optimal control problems. To overcome the bottleneck of solving Hamilton–Jacobi–Bellman equations, several state-of-the-art ADP approaches are reviewed in this paper. First, two model-based offline iterative ADP methods including policy iteration (PI) and value iteration (VI) are given, and their respective advantages and shortcomings are discussed in detail. Second, the multistep heuristic dynamic programming (HDP) method is introduced, which avoids the requirement of initial admissible control and achieves fast convergence. This method successfully utilizes the advantages of PI and VI and overcomes their drawbacks at the same time. Finally, the discrete-time optimal control strategy is tested on a power system.

Adaptive dynamic programming (ADP) [

The bottleneck of solving the nonlinear optimal control problems is to obtain the solutions of Hamilton–Jacobi–Bellman (HJB) equations. However, these equations are generally difficult or even impossible to be solved analytically. To overcome this difficulty, ADP has given several important iterative learning frameworks, such as policy iteration (PI) [

This paper reviews the state-of-the-art ADP algorithms for the optimal control of discrete-time (DT) systems. The rest of this paper is arranged as follows. In Section

In this paper, we consider the general nonlinear DT system:

The purpose of the optimal control issue is to find out a state feedback control policy

According to the definition of optimal control, the optimal value function can be defined by

By using the stationarity condition [

The key to obtaining the optimal control policy

Figure

Relationship and difference between discrete-time and continuous-time optimal control.

In this section, the model-based PI algorithm along with its NN implementation will be introduced in detail. The model-based PI algorithm [

Let the iteration index

Select an initial admissible control policy

Choose a small enough computation precision

With

With

Else, let

The actor-critic dual-network structure with the gradient-descent updating law is employed to implement Algorithm

Define the error function for the critic NN:

In order to minimize the error performance

Similar to the design of critic NN, the actor network, which is used to approximate the iterative control policy, is expressed as

The error function for the actor NN is defined as

To minimize the error performance

Figure

NN implementation diagram of PI algorithm.

Let the target iterative value function and control policy be described by

Let

Construct the following Lyapunov function candidate:

The difference of the Lyapunov function (

If the learning rates are selected to satisfy

This completes the proof.

With the help of the initial admissible control, the PI algorithm achieves fast convergence. However, the weakness of the PI algorithm is obvious. The PI algorithm requires the initial control policy to be admissible, which is a strict condition. How to find out an initial admissible control policy is still an open problem, which limits the real-world applications of the PI algorithm. To relax the strict condition, the model-based VI algorithm [

Let the iteration index

Select an initial value function

Choose a small enough computation precision

With

With

Else, let

Different from the PI algorithm, the VI algorithm does not require the initial admissible control, and one only needs to provide a specific initial value function, which makes the VI algorithm more practical in the real-world applications. However, without the help of the initial admissible control, the VI algorithm generally suffers from the low convergence speed. From the aforementioned content, it can be observed that the PI and VI algorithms have their own advantages and disadvantages. The PI algorithm can achieve fast convergence, while it requires an initial admissible control policy. The VI algorithm can start from an easy-to-realize initial condition, while it generally suffers from the low convergence speed. Thus, it is expected to design a new approach, which can make the trade-off between the PI algorithm and the VI algorithm.

That is, it is desired to develop an algorithm, which achieves convergence faster than the VI algorithm and does not require an initial admissible control policy. To realize this goal, the multistep HDP method [

Construct the critic and actor NNs to approximate the iterative value function and control policy as follows:

According to Algorithm

Let

To minimize

To minimize

Let the iteration index

Select an initial value function

Choose a small enough computation precision

With

With

Else, let

From Table

Performance comparison among PI algorithm, VI algorithm, and multistep HDP.

Methods and performance | Computation stability | Initial difficulty | Convergence speed |
---|---|---|---|

PI algorithm | High | Difficult | High |

VI algorithm | Low | Easy | Low |

Multistep HDP | Medium | Easy | Medium |

Relationship and difference among PI algorithm, VI algorithm, and multistep HDP.

The benchmark power system investigated in this paper is illustrated in Figure

The benchmark power system considered in this paper.

In Figure

Application to the benchmark power system.

Simulation results are shown in Figure

Simulation results: (a) system states without control; (b) system states with optimal control; (c) 2D plot of

In this paper, several state-of-the-art ADP-based methods have been reviewed to address the optimal control problem of DT systems. A comprehensive comparison has been made between PI and VI. A novel multistep HDP method has been introduced to integrate the advantages of PI and VI algorithms with either strict requirement of initial admissible control or longer interaction learning process. The simulation results have demonstrated the effectiveness of our proposed schemes.

Data are available upon request to the corresponding author.

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This work was supported by the Science and Technology Foundation of SGCC (Grant no. SGLNDK00DWJS1900036).