Optimal Control of Complex Systems Based on Improved Dual Heuristic Dynamic Programming Algorithm

When applied to solving the data modeling and optimal control problems of complex systems, the dual heuristic dynamic programming (DHP) technique, which is based on the BP neural network algorithm (BP-DHP), has difficulty in prediction accuracy, slow convergence speed, poor stability, and so forth. In this paper, a dual DHP technique based on Extreme Learning Machine (ELM) algorithm (ELM-DHP) was proposed. Through constructing three kinds of network structures, the paper gives the detailed realization process of theDHP technique in the ELM.The controller designed upon the ELM-DHPalgorithmcontrolled a molecular distillation system with complex features, such as multivariability, strong coupling, and nonlinearity. Finally, the effectiveness of the algorithm is verified by the simulation that compares DHP and HDP algorithms based on ELM and BP neural network. The algorithm can also be applied to solve the data modeling and optimal control problems of similar complex systems.


Introduction
With the increase of the dynamic complexity of controlled objects and their widespread application, some dynamic models of controlled objects can be obtained by mathematical methods.But some of the controlled objects are too complex to establish an accurate mathematical model, such as a robot system and a chemical generation process.Even though the mathematical model of complex systems has been established, it will be usually a high-order nonlinear time-varying complex differential equation.So, it could not describe the system accurately, and it is too difficult to analyze and process the data, and hence the unknown model must be learned by the observational data [1].This is because of the great fault tolerance, self-adaptation, self-organization, and learning and memory ability of neural networks, which provides a new method for the modeling of complex systems [2][3][4][5].However, BP networks, RBF networks, and SVM have some defects such as slow convergence, which causes a big gap between the approximation ability and the actual demand of complex systems [6,7].In order to solve this problem, Huang GB proposed single hidden layer feedforward network (SLFN) training-Extreme learning Machine (ELM)-in 2006 [8].The ELM gives the weights and thresholds of the weights randomly and then calculates the output weights by the regularization principle, which can still approach any continuous system [6,7].It has been proved that the SLFN hidden layer node parameter randomly accessed does not affect the convergence ability, and it also makes the speed of learning the ELM thousands of times faster than the traditional BP network and SVM.
In recent years, dynamic programming has been used to solve the optimal control problem, the "curse of dimensionality" problem, so no optimal solution could be obtained [9].In 1977, Werbos proposed a heuristic dynamic programming (HDP) and dual heuristic programming (DHP) concept and proposed a method of approximate dynamic programming (adaptive/approximate dynamic programming, ADP) to solve the "curse of dimensionality" problem [10].Werbos defines "intelligence" as the brain's ability to learn a utility function maximally in a complex, unknown, nonlinear environment [11].ADP is the general scheme for learning approximate optimal action strategies.Therefore, ADP can be regarded as a key method which is able to design the intelligent system of a brain.According to the basic principle, realization structure, and current development of the ADP method, Lewis and others gave a summary and prospect of the research and pointed out that ADP is an effective datadriven method [12][13][14][15].ADP can realize the optimization control of the nonlinear system by using neural networks based on online data and control information to approximate the performance index function of the optimal control law, without a mathematical model of a nonlinear control system [16].To develop neural dynamic programming results and relax the system dynamic requirements, Zhong et al. proposed a new goal representation ADP online optimization control structure for nonlinear systems.But the network could not output the derivative function information of the cost function directly based on the implementation structure of HDP, and the control effect of the HDP structure needed improvement [17].In fact, some studies show that DHP and GDHP can be controlled better than HDP in the structure of ADP method to some extent [18,19].
In general, the study on the optimal control of nonlinear systems of ADP based on the traditional neural network has made great progress.But the ADP still has problems of slow response and poor stability.In this paper, ELM algorithm gives random input weights and thresholds to improve the response speed of the DHP algorithm and the stability of the DHP algorithm is improved by calculating the output weights by the regularization principle.In order to verify the validity of the algorithm, the ELM-DHP was designed to control the molecular distillation system with multivariability, nonlinearity, strong coupling, and large delay.

Algorithm Principle
2.1.DHP Algorithm Principle.The discrete-time nonlinear dynamic system is described as follows: In formula (1), x() ∈ R  represents the state vector of the system, u() ∈ R  represents control variables, and  represents the system function.
The performance index function (also called the cost function) corresponding to the system is where  is the utility function,  is the discount factor (0 <  ≤ 1),  is the cost function of state x(), and  depends on the initial time  and the initial state x().For DHP, the purpose of dynamic programming is to select a control sequence u(),  = ,  + 1, . . ., , which minimizes the function [x()]/x().
The DHP structure is shown in Figure 1, which contains three neural networks: model network, critic network, and action network.The neural network has a powerful function of universal approximation, so the model network can be used to model the unknown nonlinear or complex nonlinear system and make the DHP method widely used.The input of the critic network is a state variable.The output of the critic network is approximation performance index function J on the state x derivative, which is also known as the costate.The action network, also known as "Actor," represents the mapping between system state variables and control variables [20][21][22].
[x()]/x() is based on the iteration of the derivative for performance index function and utility function to state.
In (3), u[x()] is a feedback control variable, and costates [x()]/x() and [x( + 1)]/x() are the outputs of the critic network.If the weight of the critic network is set to , the right type of formula (1) is set to At the same time, the left type of formula (1) can be written as [x(), ]/x().By adjusting the weights  of the critic network, the least-mean-square-error function is as follows: So, the optimal quantity is obtained: In formula (7),  * [x( + 1)]/x( + 1) is the optimal costate, satisfying formula (5).
From (1) to (7), we can conclude that the optimal control quantity of the DHP method can be obtained directly by the costate.Compared with the HDP method which obtained the optimal control by the relationship between the weights () of the critic network and the input-output, the method of DHP has more computational efforts, but better control effect [9].

ELM Algorithm Principle.
For a standard SLFN with  hidden layer neurons learning  arbitrary distinct samples (x  , t  ), x  = [ 1 ,  2 , . . .,   ]  ∈   , t  = [ 1 ,  2 , . . .,   ]  ∈   and activation function (⋅) are mathematically modeled as [23] where o  = [ 1 ,  2 , . . .,   ]  ∈   is the model output of the network, w  = [ 1 ,  2 , . . .,   ]  is the input weight matrix between the input layer neuron and the th hidden layer neuron,   is the output weight matrix between the th hidden layer neuron and the output layer neurons,   is the threshold of the th neuron in the hidden layer, and w  ⋅ x  is the inner product of w  and x  .
The learning objective of the SLFN is to minimize the output error.Error can be expressed as The presence of w  , x  , and   makes So, (10) can be written as H = T, where H is a hidden layer output matrix of ELM.So, the training of ELM is equivalent to the least-squares solution β of linear system H = T.
In (12),  = 1, 2, . . ., , (12) is equivalent to minimizing the loss function Huang et al. [23] proved that the minimum value of the least-squares solution of the linear system satisfies the following.
(1) Minimum Training Error.The special solution β = H −1 T is one of the least-squares solutions of a general linear system H = T,H −1 which is a generalized inverse matrix of H.
(2) Smallest Norm of Weights and Best Generalization Capability.Further, the special solution β = H −1 T has the smallest norm among all of the least-squares solutions of H = T : ‖ β‖ = ‖H −1 T‖ ≤ ‖‖, ∀ ∈ { : ‖H − T‖ ≤ ‖Hz − T‖, ∀z ∈ R × }.The generalization ability of SLFN with minimum weight is independent of the number of parameters [24].The smaller the weight, the stronger the generalization ability of SLFN.
(3) Special Solution.The least-squares solution of H = T is unique.

Proof the Stability of ELM-DHP.
The stability of the ELM-DHP algorithm is proved (i.e., the output error of the system is 0).The discrete nonlinear system is controlled by the ELM-DHP algorithm, and the three networks of the ELM-DHP algorithm are all based on the fixed ELM implementation.Therefore, it just needs to be proved that ELM can approximate the discrete nonlinear system by 0 error.
ELM learns a large number of samples generally, and the number of neurons in the hidden layer is far less than the number of samples,  ≪ .So, we only need to prove that the learning error of ELM was 0 when  ≤ .Huang et al. [7,23,25] proved in detail that the SLFN with  neurons can approximate any arbitrary sample (x  , t  ) at any small error; that is, The work above proves that the learning error of ELM is 0 (i.e., the stability of the ELM-DHP algorithm).

Implementation of the ELM-DHP Algorithm
The ELM-DHP algorithm includes three networks: model network, critic network, and action network.The hidden layer of the three networks is a sigmoidal bipolar function and the output layer is a purelin linear function.The realization process of the ELM-DHP algorithm is studied by using the discrete-time nonlinear dynamic programming of dimensional state vector and -dimensional control vector as the research object.

Network Model.
The model network adopts (+)−  − structure.The  +  inputs are the  components of the state vector x() in the  moments and the  components of the predicted output u() of the action network to state x() in the system of ( − 1) moments.The  output is the  components of the prediction vector x( + 1) to the state vector x( + 1) in the system of ( + 1) moments.The model network has   hidden layer neurons.The structure of the model network is shown in Figure 2.
The model network is trained offline, and the calculation process is as follows.
The input layer to the hidden layer weight matrix W 1 and the hidden layer threshold matrix B = [ 1 ,  2 . . .,    ] are randomly generated.Define the input vector M() and the expected output vector x() of the model network in  moments: Calculate the output matrix m ℎ2 () of the hidden layer in the model network where  ℎ1 () is the input of the th node in the model network hidden layer,  ℎ2 () is the output of the th node in the model network hidden layer, and Calculate the weights W 2 () from the hidden layer to the output layer: According to the idea of the ELM, the error is minimized as In equality (18),   ( + 1) is the expected output th output layer neurons of the model network.
W 2 () is equivalent to solving the least-squares solution Ŵ2 () of the linear system m ℎ2 () × W 2 () = x( + 1): The special solution Ŵ2 () of the weight matrix of the hidden layer and output layer in the model network is as follows: where m −1 ℎ2 () is a generalized inverse matrix of m ℎ2 () in  moments.

Critic Network.
The critic network is composed of  −   − .The  inputs are the  components of the state vector x(), and the output is the estimation of the state

Input
Hidden Output layer layer layer () = ()/x(), () = ( + 1) + ().  is the number of hidden layer neurons in the critic network.In the critic network, the weight matrix from the input layer to the hidden layer, the weight matrix from the hidden layer to the output layer, and the hidden layer threshold matrix of  time are, respectively, defined as Figure 3 shows the structure of the critic network.
The critic network uses the least-squares method of ELM, whose forward calculation process is where  ℎ1 is the input of the th node in the critic network hidden layer,  ℎ2 is the output of the th node in the critic network hidden layer, c ℎ2 () = [ ℎ21 ,  ℎ22 , . . .,  ℎ2  ], and () is the output of the critic network output layer.The inputs x () of the critic network come from the output of the model network and the outputs of the critic network are costate function J()/x() in the DHP.  () is expressed to the expected output of the critic network, which can be written as The training error of DHP critic network is minimized based on the idea of ELM.

󵄩 󵄩 󵄩 󵄩 𝐸
where   () is the error of the critic network in  moments and ‖  ‖ is the error of all the time points in the critic network.
According to the DHP structure and the definition of the expected outputs   () of the critic network, we can obtain In formula (24),  + ()/x() and  + ( + 1)/x() represent the notion that () and ( + 1) take the derivative of composite function x().

Input
Hidden Output layer layer layer

Action Network.
The action network uses the structure of −  −. inputs are the  components of the state vector x() of the system at  moments. outputs are the  components of the control vector u() corresponding to the input state vector x().  represents the number of neurons in the action network hidden layer.W 1 and W 2 are, respectively, the weight matrix from the input layer to the hidden layer and the weight matrix from the hidden layer to the output layer in the action network.d() = [ 1 (), . . .,    ()] is the hidden layer threshold matrix of the action network.Figure 4 is the structure of the action network.The calculation process of the action network is as follows: where  ℎ1 () is the input of the th node and  ℎ2 () is the output of the th node in the action network hidden layer and a ℎ2 () = [ ℎ21 ,  ℎ22 , . . .,  ℎ2  ].According to the idea of weight adjustment of ELM, the weight matrix W 2 from the hidden layer to the output layer is obtained: In (30), a −1 ℎ2 () is a generalized inverse matrix of a ℎ2 () and u() is the expected output of the action network.The weights of the network will be corrected if u() can be got.The inverse sigmoidal function is defined as (⋅).The calculation process of u() is as follows: [ u () In (33), u() is the first  rows of matrix So, W 2 can be got:

Training Strategy.
In this paper, the model network of the DHP algorithm is trained by an offline method at first to obtain the weight matrix of the model network.
Then, the action network and the critic network are trained simultaneously.Training strategies are as follows: (1) First, the model is trained by an offline method and the weight matrix of the model network is obtained.
(2) Taking x() into the action network, u() can be obtained.
(7) Next, calculate and update the weights of the critic network.

Simulation Example Analysis.
The molecular distillation technology was also called short films.When enough energy is obtained, the average free path that escapes from the surface of a liquid of light molecules differs from that of heavy molecules, which achieve the nonequilibrium liquid-liquid separation process under high vacuum conditions [26].The molecular distillation technology has advantages of low temperature distillation, short heating time, and high separation efficiency, and it is conducive to separate the material, that is, high boiling point, heat sensitivity, and high viscosity material separation.This technology is widely used in food, medicine, oil processing, and petrochemical industry [27][28][29].Molecular distillation equipment can be divided into four types: stationary, falling film, scraped film, and centrifugal type [30].At present, wiped film molecular distillation is the most widely used technology in scientific research and industrial production.The evaporation effect of the molecular distillation system is not only related to the size and shape of the evaporator and space, the distance to the surface evaporation condensation, the manufacturing process, and other types of equipment, but also connected with the pressure within the parameters of the feed flow rate, temperature of the evaporator, scraping, and other devices running the motor speed film process parameters [31].In order to enhance the purification effect of molecular distillation, Wang et al. found that the head wave has an effect on the separation efficiency of molecular distillation by the study of the head wave [32].Micov et al. studied the separation factors of the wiped film molecular distillation process and established a one-dimensional mathematical model [30].Cvengros and Tkac established a mathematical equation which can be used to calculate the one-dimensional analysis mathematical equation of micro unit movement velocity in distillation equipment through the DSMC method and summarized the effects of evaporation temperature, distance, and vacuum degree and other related factors on the separation results [33].Wu studied the simulation of the temperature, pressure, and reflux ratio on yield and purity by using the central response surface method combined with thin film evaporation and rectification coupling technology [34].Although much research has been made, there are still many problems in molecular distillation system with multivariability, nonlinearity, strong coupling, and large delay.Therefore, the effectiveness of the ELM-DHP algorithm was verified by controlling the scraping film molecular distillation system.
The current state variables of the molecular distillation system are determined by the amount of state variables in the preceding section of the system and the control variables in the previous stage.So, distillation temperature, evaporation pressure, wiper motor speed, feeding speed, and Schisandra yield and purity of the front section were used as the input of the ELM-DHP controller, and the current Schisandra yield and purity were used as the output of the ELM-DHP controller.

Simulation Comparison.
The structures of the model network, critic network, and action network were set as 6-20-2, 2-14-2, and 2-5-4 through experiment, respectively.In the process of system identification, the weight values of the three networks between the input layer and the hidden layer are selected in the range [−0.1, 01].600 groups of data are collected to study, and 150 groups of data were used as the test set.Firstly, we need to train the model network offline; the least-squares solutions were calculated as the weight matrix between the hidden layer and the output layer.Then, we complete the training of the model network and keep its weight unchanged.The 50 time steps of the model network are shown in Figures 5 and 6.
Figures 5 and 6 show that the predicted values of the BP network and the ELM algorithm are in good agreement with the expected values.Figures 7 and 8 show that the maximum error of the BP network in the prediction of the state is 0.4, but the maximum error of ELM for the state prediction is about 0.06.Thus, it can be concluded that ELM has higher prediction accuracy and better generalization ability.
Parameter setting will affect the convergence speed of the algorithm to a certain extent.After the experiment, the discount factor was chosen as  = 0.9.Next, the weights of the critic network and the action network from the hidden layer to the output layer are calculated.Then, the training of the critic network and action network is set to 150 steps with 100 training epochs for each step.
In addition, in order to compare with the HDP and DHP technology based on BP neural network, controllers designed by BP-HDP, BP-DHP, and ELM-DHP were proposed.Four controllers are used to control the wiped film molecular distillation system, respectively, and the 50 time steps of the simulation results are shown in Figures 9-14.Figures 9-12 show that the control quantities of BP-HDP, BP-DHP, ELM-HDP, and ELM-DHP controllers achieve stable control in 45 steps, 35 steps, 18 steps, and 7 steps individually.Thus, it can be concluded that the HDP and DHP algorithms based on ELM can achieve faster response speed.There will be a larger fluctuation when the controlled variables of the HDP controller achieve stability.So, it can be concluded that the DHP algorithm has a higher stability.The results of Figures 13 and 14 are shown in Table 1.The purification effect increases with yield and purity and the best purification effect is 100%, but it is impossible to achieve.It can be seen in Table 1 that the optimal state quantities derived by ELM-HDP and ELM-DHP were 5% higher than BP-HDP and BP-DHP, and the optimal state of ELM-DHP is slightly higher than that of ELM-HDP.In the above analysis, the superiority and effectiveness of the ELM algorithm can be demonstrated clearly.

Summary
For those problems which the BP-DHP algorithm has, such as poor prediction accuracy, slow convergence speed, and poor Mathematical Problems in Engineering  stability, the ELM-DHP algorithm was studied in this paper to solve the data modeling and optimal control problem of the wiped film molecular distillation system with complex features such as multivariability, strong coupling, nonlinearity, and large time delay as an example.The ELM-DHP controller was designed to control the molecular distillation system and a simulation verification was carried out.When compared with the ELM-HDP, BP-HDP, and BP-DHP algorithms, the prediction accuracy of ELM is higher than that of the BP neural network, and the response speed and stability of the ELM-HDP and ELM-DHP algorithms are higher than those achieved by the BP network, which shows the superiority of ELM.Compared with other algorithms, the response speed of ELM-DHP is more than two times that of the other algorithms, and the optimal state achieved by ELM-DHP is closer to the ideal result.Thus, the ELM-DHP algorithm is better than BP-HDP, BP-DHP, and ELM-HDP algorithms.The ELM-DHP algorithm does not depend on the specific mechanism model and is only in accordance with the relevant experimental data, so the algorithm can also solve the optimal control problem of similar complex

Figure 1 :
Figure 1: The structural diagram of the DHP algorithm.

Figure 2 :
Figure 2: The structure of the model network.

Figure 3 :
Figure 3: The structure of the critic network.

Figure 4 :
Figure 4: The structure of the action network.

Figure 9 :
Figure 9: The molecular distillation temperature of optimum control quantity.

Figure 10 :Figure 11 :
Figure 10: The variable feed rate of optimum control quantity.

Figure 12 :Figure 13 :
Figure 12: The molecular distillation pressure of optimum control quantity.

Table 1 :
Optimal state of controller.