Optimal Tracking Control for a Discrete Time Nonlinear Nuclear Power System

Recently, increasing attention has been paid to nuclear power control with the appeals of clean energy and demands of power regulation to integrate into the power grid. However, a nuclear power system is a discrete-time (DT) nonlinear and complicated system, where the parameters entangle with intrinsic states. Furthermore, the need for huge computational ability due to the high-level order property in the nuclear reactor model causes many difficulties in the power control of nuclear industries. In this study, a new scheme of optimal tracking control for DTnonlinear nuclear power systems is provided to accomplish the power control of a 2500-MW pressurized water reactor (PWR) nuclear power plant. The proposed approach based on the value iteration method is a novel algorithm in the human intelligence community, which has a basic actor-critic structure with neural networks (NNs). The new approach has some modifications, where the cost function is redefined by leveraging the higher-order polynomial to substitute neural networks in the entire actor critic architecture. Simulation results of the 2500-MW PWR nuclear power plant are given to demonstrate the effectiveness of the developed method.


Introduction
Considering the issues of environmental deterioration, e.g., air pollution due to excessive fossil fuel consumption, it is signi cant that humans develop clean energy technology to ease this situation. Nuclear energy is almost the most rapidly developing clean power to provide power to the power grid. However, currently adopted control strategies have problems such as the intrinsic nonlinearity of nuclear reactor systems and varying parameters following the power level. In fact, over decades of development for nuclear power industry technology and control policy, many outstanding researchers have made excellent progress in this eld.
Since the last century, one mature control strategy called PID control policy has deeply a ected the power control in the nuclear industry [1,2]. With the advancement in control technology, some model predictive control (MPC) and multimodel adaptive control theories focused on local linearization to approximate the nonlinearity of nuclear power systems and have also been applied in this area [3][4][5]. e control algorithm extensively applied in nuclear power control is fuzzy control or its combinations with other control theories to address di erent demands. Wu leveraged the parallel distribution compensation (PDC) method-based T-S fuzzy control to restrict the nonlinearity of a nuclear power system [6], and Eliasi designed an appropriate controller for the UTSG water level in nuclear power plants using a fuzzy control policy and MPC algorithm [7]. Many researchers have used different methods to tackle various problems. For example, Gang employed a radial basis function neural network (RBFN) to guarantee the correctness in identifying the nuclear steam generator process dynamics [8]. Wang applied the adaptive control method and guaranteed a cost control method in nuclear power control problems [9]. e aforementioned approaches mostly relate to the linearization procedure, which largely omits the numerical error toward a nonlinear model. e intrinsic nonlinearity of the high-order nuclear model is ignored. To better satisfy the demands of tracking control problems, it is necessary to propose a new control strategy in this area. With the development in the intelligent control community, researchers are pursuing reinforcement learning (RL) algorithms to solve nonlinear problems in practice. Adaptive dynamic programming (ADP), which was proposed by Werbos, plays an important role in reinforcement learning-based control policy [10][11][12][13], and it is well known as a self-learning optimal control policy. e well-studied iteration method is the policy iteration (PI) algorithm and value iteration (VI) algorithm. Among the iteration methods, the value iteration algorithm is one type of the most crucial iterative ADP algorithms. It has been studied in many types of research [14][15][16]. To find the optimal control policy of discrete-time affine nonlinear systems, Al-Tamimi and Lewis used heuristic dynamic programming (HDP) to fulfill the design purpose [17]. Wei. Q. [18] proposed a new value iteration method, which mainly focuses on optimal control for DT nonlinear systems. is study also provided detailed proof of the iterative control policy and illustrated that the value function was monotonically nondecreasing, which implies that it will converge to the extremum.
To satisfy the demands of industrial systems, optimal tracking control ADP methods have been deeply investigated.
ere are also ADP techniques [19][20][21] to obtain solutions of optimal tracking problems with various system dynamics, such as partially unknown system models or completely unknown system models. Related optimal tracking control techniques have been applied in many industrial plants in recent years [22][23][24][25][26].
In this study, a value-iteration optimal tracking control method is developed for DT nonlinear systems. e main contributions of this study are summarized as follows: (1) Compared with the traditional control methods dealing with DT nonlinear models of the 2500 MW pressurized water reactor (PWR) systems [1,2,4,5], a self-learning optimal tracking controller is designed to satisfy complex nonlinear behaviors of the 2500 MW PWR nuclear system (2) e developed value-iteration method guarantees the control law converges to a near-optimal control solution and the admissibility of iterative control laws is analyzed In this study, our major work is to design an optimal tracking controller for a 2500 MW PWR nuclear power plant by combining the properties of the value iteration and actor critic algorithm. e 2500-MW PWR nuclear power plant is introduced in Section 2, and the discrete definition is given. In Section 3, the details of this proposed algorithm are thoroughly described. e implementation of the proposed method and simulation works are provided in Section 4. Finally, the conclusions are drawn in Section 5.

Nonlinear 2500-MW PWR Nuclear
Power Plant e famous nuclear system model is based on Mann's model without xenon poisoning, which consists of a core full lump and two coolant lumps. e discrete version and its transformation are also given in this section.

Nonlinear 2500-MW PWR Nuclear Power Plant.
is fifth-order nonlinear PWR model includes the point kinetics equations, six delayed neutron groups, two equations for the lumped coolant outlet temperature and average fuel temperature, and the reactive equation of the control rod [27,28]. Multiple sets of delayed neutron point reactor dynamic equations are described as follows: To reduce the computational work caused by six delayed neutron point reactor dynamic equations, the simple method is to use single delayed neutron point kinetics equations to approximate multiple sets of delayed neutron point reactor dynamic equations [29]. us, the entire PWR model is summarized as follows: where n r is the neutron density relative to the initial equilibrium density, %, c r is the delay neutron density relative to its initial equilibrium density, %, T f is the average fuel temperature,°C, T l is the coolant temperature at the core outlet,°C, ρ r is reactivity contributed by the control rod movement, and Z r is the speed of the control rod. e remaining specifications are illustrated in Nomenclature [30]. T e always approximates T e0 in the PWR model. e state n r can be described as a percentage factor of the full power level, since the reactor power is expressed as P(t) � P 0a n r .
In addition, five parameters vary with n r , which causes severe instability of the nuclear reactor power model and increases the control complexity. e remaining parameters and specific relation are shown in Tables 1 and 2. With the lifting (lowering) load of the PWR model, the varying parameters will lead to a sharp difference in the solutions of the dynamic model. us, the model will be uncontrollable, and the solutions of this dynamic model may become divergent. Linearizing nuclear systems to realize various control goals has been a common method in traditional control policies in the past few years. However, there should be a new approach in nonlinear systems to solve these problems.

System Discretization and Transformation.
e optimal tracking control problem can be considered minimizing the real dynamic trajectory with the desired trajectory. Depending on the model, we let thus, the control-oriented nuclear power system can be defined as where can be derived as As the above descriptions show, there is only one controllable signal Z r , i.e., the speed of the control rod. We let control variable u � Z r .
According to the definition, x k . � x k+1 − x k /ΔT, and we have the discretization version of the PWR power model: Basically, we define the optimal tracking control problem as obtaining an optimal control strategy so that the system can track the reference state, and the desired trajectory x d can be expressed as e tracking error of the state is defined as ε k � x k − x d,k . us, the dynamic error system can be described as follows: Additionally, we must confirm an initial steady-state control policy u d , and the error of the controller can be defined as u ε,k � u k − u d,k , so the dynamic tracking error system can also be written as where We define the utility function as follows: us, the tracking error cost function is written as From the principle of optimality, the DT Hamiltonian function is derived as us, the HJB equation can be written as en, the optimal tracking control law for the error system is derived: Finally, we obtain the standard optimal control law as For linear systems, the HJB equation is reduced to the Riccati equation. However, due to the nonlinearity of the nuclear power system, it is extremely intractable to solve the HJB equation (13) for the nonlinear system. us, the value iteration (VI) method based on the actor critic NN algorithm will be adopted to find an approximate optimal solution of the HJB equation (13), which implies that nothing is required about the knowledge of the model drifts or the command generator.
Mathematical Problems in Engineering 3

Algorithm Analysis
e detailed convergence properties of this proposed value iteration algorithm are illustrated in this section, and the details of actor critic NN will also be discussed.

Analysis of the Value Iteration Algorithm for Tracking
Control of Nonlinear Systems. Considering the nonaffine nonlinear system, for an infinite time optimal tracking problem, the goal is to obtain an optimal controller such that the state x k tracks the specified reference trajectory x d,k .
Remark 1. For many nonlinear systems, there is a feedback control u d,k that makes (9) work. For example, with regard to DT nonlinear systems (7) with invertible g(x d,k ), the desired control u d,k can be derived as From equation (11) of Section 2.2, the quadratic cost function of tracking errors ε k is defined as where u ε,0 � (u ε,0 , u ε,1 , . . .) and Q(·) and R(·) are positive definite functions.
To obtain an optimum tracking control law that tracks the reference state x d,k and minimizes the tracking error cost function (18), we can redefine the optimal tracking error cost function as follows: Based on Bellman's principle of optimality, I * (ε k ) satisfies Bellman equation: en, the optimal tracking control law is obtained by Given the above formulation, we can derive the tracking error performance index function as follows: Let φ(ε k ) be a positive definite function for ε k ∈ R 5 , and the initial tracking value function is e optimal control law v 0 (ε k ) can be obtained by v 0 ε k � arg min where I 0 (ε k+1 ) � φ(ε k ). For i � 1, 2, 3 . . ., in this iterative value function algorithm, the value function is updated through and the control policy is improved by Theorem 1. For the tracking error cost function I i (ε k ) and control law π i (ε k ) obtained by (22)- (25), we have α, β, c, and η satisfying 0 < η ≤ c < ∞ and 0 ≤ α ≤ β < 1, respectively. If are satisfied uniformly; thus, the iterative value function I i (ε k ) satisfies Initial equilibrium inlet temperature of coolant (°C) T l0 Initial equilibrium outlet temperature of coolant (°C).
According to Lyapunov stability principle, I i (ε k ) is a Lyapunov function. Since the utility function U(ε k , π k ) is a positive definite function and I i (0) � 0, it should be noticed that I i (ε k ) is a positive definite function as well. We let where the error tracking control law π k is admissible.

Actor Critic NN Implementation of the Value Iteration
Algorithm for DT Nonlinear Systems. e actor critic NN has been employed in various fields to approximate the cost function and optimal controller. For example, the optimal tracking applied on partially unknown DTnonlinear systems [31] and rigorous proof for this method are provided. e actor critic structure also has good performance in the tracking control problem for continuous-time nonlinear systems [32,33].
With regard to optimal tracking control algorithms of DT nonlinear systems, it is quite difficult to directly obtain solutions by solving (13).
us, the actor critic network structure with the flowchart of the nuclear power system is given in Figure 1, which describes the inner procedures of the method.
While the tracking error system is fed with a specific initial state and desired trajectory, the error will be calculated by a utility function. Simultaneously, the critic network will be trained in a way that minimizes the utility function. Under the training process, the actor network will behave like an optimal controller. To avoid overfitting, a specified threshold is given at the beginning of the training procedure so that it can be stopped in time.
Inspired by Abu-Khalaf [34], an optimal control algorithm with a high-order polynomial was proposed to substitute the neural unit. is technique is introduced in this actor critic NN structure to obtain a better approximate effect.
To solve (24) and (25), we let the tracking error cost function V i be approximated by a critic NN: where we have the approximate activation function Θ(ε) ≜ [Θ 1 (ε k ), Θ 2 (ε k ), . . . , Θ L (ε k )] T , the weight vector , and L is the number of neural units in the hidden structure of the critic NN. en, the iterative formulation can be obtained as follows: For each sample ε k related to x k , we formulate the definition as follows: where Z � Θ(ε) i and ζ i � U(ε k , u ε,k ) + W Vi Θ i (ε). en, the weights of critic network are obtained. We let u i (ε) be approximated by an actor NN: where we have the approximate activation function δ(ε) ≜ [δ 1 (ε k ), δ 2 (ε k ), . . . , δ M (ε k )] T and weight vector W ui � [w 1 a , w 2 a , . . . , w M a ]. M is the number of neural units in actor NN.
According to (24) and (30), we will tune the weights of the critic NN at each iteration of this VI algorithm. Our goal is to minimize the residual error between each V i (ε) to obtain a new target function as follows: Similarly, the actor NN is applied to evaluate and approximate the optimal tracking control policy. We will tune the weights of action NN to solve (20) at each iteration of this VI algorithm. According to u i (ε k , W ui ), from (33), we can rewrite (14) as   where ε k+1 � f ε (ε k ) + g ε (ε k )u i (ε k , δ) and δ are updated by the same method as the weights of the critic NN. For getting the approximation of the weights of actor critic NNs, the least square (LS) method is utilized to solve the weights of NNs.

Numerical Simulations
In this section, numerical results are given to demonstrate the validity of this value iteration optimal tracking method. Experimental simulations of the performance index and weights of actor critic NNs are provided. is developed method is an offline policy with an initial random control policy.

Actor Critic NN's Implementation of the Value Iteration
Algorithm. It is generally known that NNs can be leveraged to approximate any functions on prescribed compact sets. We choose an error compact set to train this actor critic NN to obtain an offline tracking policy. As predefined in Section 3, the critic NN is approximated as I(ε) � W cL ϕ(e) with 15 neurons (L � 1, 2 At the beginning of this algorithm execution, generally, the matrices are Q � 10I 5×5 and R � 0.1I 1×1 , where I is the identity matrix. e tracking error compact sets are randomly set as the difference between the initial state and the desired trajectory. We chose the control period as 1s   Figure 2, is monotonically nondecreasing and converges to I * ε , which is identical to our analysis results above.
Additionally, when the tracking error performance index function converges, the training process of the actor-critic NN will stop. As shown in Figure 3 and 4, the weights of the actor-critic NN converge to a steady solution, which implies that a perfect approximator of the optimal tracking controller is obtained.

Application on the PWR Nuclear Power
System. In this section, we apply the calculated control law on this DT nonlinear PWR power model, and the implementation results are shown in Figure 5-10.
Generally, the optimal tracking control of PWR power plants focuses on power-level adjustment, but there is high interference among different states in this nuclear power model. us, it is necessary to track all states in the PWR power model, and the tracking objects should guarantee the stability of each state and safety of nuclear plants.
As shown in Figures 5-9, we give a %20 step increase signal to this PWR power system. ese 5 figures demonstrate that the 5 states catch the desired trajectory in less than 50 s. We use a 5th-order DTnonlinear PWR power system in this study, which implies that Xenon poison is without consideration. Although the PWR power model is quite simplified, all states of this model are difficult to track.
As shown in Figure 5, the power level tracks the desired states without overshoot and oscillation. e average temperature of the reactor core and coolant outlet approximate the desired temperature; compared with the reference temperature, the maximum deviation is 0.031°C and 0.002°C, respectively. In addition, the reactor coefficient of the control rod tracks the desired curve in less than 40 s. Based the aforementioned results, we also can see that our tracking method applied this nuclear system has no steady-state error and less regulation time. As shown in Figure 10, the tracking errors progressively converge to 0 with the proposal of this value iteration optimal tracking method.

Conclusion
It is well known that optimal tracking power-level control for DT nonlinear nuclear power systems is crucial for both regular operation and safety problems, and manual control is inefficient. However, the intrinsic nonlinearity and parameters that vary with the states cause difficulties in powerlevel control, and there are tracking issues.
In this study, a value iteration-based actor critic NN algorithm is designed to obtain an optimal tracking control policy for the DT nonlinear nuclear power plant. e proposed algorithm performs well in tracking states, as shown in the simulation results, and can also swiftly calculate the optimal control law. us, we formulate the tracking control problem as HJB equations to solve it.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.