Research Article A Direct Reinforcement Learning Approach for Nonautonomous Thermoacoustic Generator

For nonautonomous nonlinear systems, the optimal control design is affected by the terms of partial derivative. If a reinforcement learning (RL) strategy is developed to approximate the optimal control scheme in nonautonomous nonlinear systems, then the closed control system might be unstabilizing. Therefore, in this article, the approach of direct RL law for a nonautonomous thermoacoustic generator (TAG) is investigated. We establish the mathematical model of TAG by partial differential equations (PDEs) and then transforming them into time varying nonlinear systems. The direct RL technique with Newton–Leibniz formula is implemented to consider the partial derivative term from classical policy iteration (PI) method by modifying the computation using data collection between the two sampling times. Finally, several simulation studies with some comparisons are conducted to validate the theoretical analyses.


Introduction
As a key problem of energy mission, the thermoacoustic generator (TAG) has been attracted by many scholars [1]. e consideration of transforming from high-temperature source to the high-e cient heat engine process as well as the process of converting heat to electricity were investigated in [1]. However, the main shortcoming is that the control design problem of TAG has not been much discussed. In the actual engineering system, optimal control is a remarkable technique, which gives their essential e ectiveness for balancing the tracking problem and performance. Regarding the theoretical problem, for achieving optimal control design, one needs to solve the Hamilton-Jacobi-Bellman (HJB) equation in nonlinear systems or Riccati equation in linear systems subject to a user-de ned performance index [2]. However, due to the challenge of analytically solving these equations as well as the existence of uncertainties, adaptive reinforcement learning (RL)-based techniques have been employed to approximate optimal control solution [2][3][4]. On the other side, it should be noted that some di erent methods can be developed without RL technique, such as the approach in [5] to solve robust LQR in the presence of constraint by Min-Max optimization.
As an important approach in the modern control technique, the RL-based control method aims to obtain the minimization of performance index while achieving the stability of closed systems. Up to now, many important researches have been implemented for linear systems using Kronecker product [6] to conveniently compute the quadratic form as well as for nonlinear systems by approximating with neural networks [7] to describe Bellman function. However, due to the time varying description of closed systems, it is challenging to investigate RLbased control strategy for practical systems such as robotics and generators. To overcome this challenge, the model transformation method [8] and direct RL technique for nonautonomous systems [9][10][11] are introduced. e technique of transforming closed systems is implemented by considering the desired trajectory as new state variables and changing the cost function by a di erent form [8]. After obtaining the autonomous systems, the online actor-critic strategy was discussed in [8] by generating the adaptation law of weights in actor/ critic neural networks, which approximated the optimal control and Bellman function. anks to the property of Hamiltonian, it implies the method of training was achieved by minimizing square of Hamiltonian term. e classical actor-critic can also be extended for nonlinear continuous time-delayed dynamical systems [12] by adding the time delay-based integral term into optimal value functional. Moreover, it leads to the modification of the integral temporal difference error (ITDE) depending on time delay [12]. However, the implementation of traditional actor-critic strategy [8] usually requires certain model in the computation. For nonlinear systems containing unmodeled dynamics, Yang et al. in [13] proposed a robust online actor-critic strategy with the addition of robustifying term combining with fuzzy logic systemsbased approximator. Likewise, RL method is developed for general HJB problem with the additional variable [3], HJI problem under the disturbance influence [6,[14][15][16], and modified Hamiltonian [7]. Unlike the traditional online actor/critic study RL method by simultaneous tuning, the sequential learning using value iteration (VI) algorithm was discussed in [3] with the optimal function to be directly computed from the previous steps without solving Lyapunov equation. Furthermore, it should be noted that VI algorithm [3] does not require the admissible control in the first step as described in policy iteration (PI) algorithm. Regarding the modified Hamiltonian [7], this method was extended to deal with input constraint by equivalent map using special function and developed actor/critic learning for the modified value function and the equivalent modified Hamiltonian. For the perturbed systems in the presence of disturbance, RL algorithm is developed under the generalized disturbance attenuation criterion [14]. A remarkable approach of Q learning algorithm is introduced to study completely uncertain systems with the consideration of using two-variable optimal value function [17]. But, it can be seen that Q learning technique is only appropriate for linear systems with quadratic form Bellman function [17]. Recently, adaptive dynamic programming (ADP) control technique with the consideration of event-triggered mechanism (ETM) is proposed for complicated systems, such as discrete-time boiler-turbine systems [18] and Roller Kiln temperature field using partial differential equations (PDE) [19]. However, the proposed RL control is only implemented for time invariant systems without considering the effect of desired trajectory depending on time [18]. It can be concluded that the above results of RL control design rely on autonomous or time invariant models. To deal with nonautonomous closed systems using RL control systems, time varying references are avoided or the transform techniques to autonomous systems are employed. ere have been only very few studies of considering the direct RL control solution for time varying systems [9][10][11] due to the existence of the term zV(x, t)/zt. Authors in [9] improved conventional policy iteration (PI) technique with the addition of partial derivative with respect to time in each step. Because of nonlinear property, two NNs are also utilized in both Actor NN and Critic NN with the weights to be trained using the addition of partial derivative in terms of states [9]. However, the direct RL techniques in [9][10][11] are only purely mathematical analyses. A different approach of direct RL control can be mentioned as offpolicy technique [20,21]. Due to the property of keeping the input control while computing RL algorithm, the offpolicy technique is able to address to completely uncertain systems in linear systems [20] and nonlinear systems [21]. erefore, it can be determined that direct RL-based controllers for time varying closed practical systems are challenging issues and these motivate us to study this problem in TAG systems.
Inspired by the above analysis, we investigate the application of direct RL procedure in control system of time varying TAG systems, which are described by PDE. We use some appropriate modifications to transform PDE of TAG systems into a time varying dynamic equation. e major contributions of this article are given as follows: (1) First, it is obviously different from [2,22] realizing RL procedure for robotic systems using differential equations; the structure of time varying RL is presented to obtain the optimal control for time varying TAG systems to be expressed by PDE. (2) Second, differential from researches of TAG systems [1], optimal control strategy is investigated with the related comparison to be discussed through theoretical analysis and simulation studies.
e remainder of this article is organized as follows. Section 2 gives the problem statement of this article. Section 3 focuses on analysis of mathematical modelling of TAG systems. e direct RL procedure-based control scheme for TAGs is discussed in Section 4. Simulation studies in a TAG control system are presented in Section 5 and the conclusion remarks are presented in Section 6.

Preliminaries and Problem Statement
e thermoacoustic generator (TAG) is a device that can generate thermal energy or consume acoustic energy to transfer heat from low-temperature to high-temperature sources; from there, electricity can be obtained through electromechanical converters. In this paper, we investigate a fundamental structure of thermoacoustic generator (TAG), as shown in Figure 1, which includes 5 parts: regenerator (REG), heat exchanger (HHX, CHX), alternator (ALT), stub (STUB), and feedback pipe (FBP). According to thermoacoustic theory and the partial element equivalent circuit (PEEC) method [23], we construct a theoretical model for thermoacoustic generator from the equivalent parts model. en, we apply the adaptive dynamic programming method to TAG from the obtained mathematical model. Moreover, based on physical phenomena of TAGs, some assumptions are required to represent TAGs by several partial derivative equations (PDEs) as well as equivalent circuits in next sections.

2
Mathematical Problems in Engineering e necessary assumptions to apply the linear thermoacoustic theory into modelling the thermoacoustic generator are as follows: (1) e material's surface is smooth, the heat radiation is negligible, and the plates are rigid and stationary. (2) e acoustic pressure is x direction only and the viscosity is independent of temperature. (3) e length of the plates is small compared to the size of the resonator. e control objective is to establish the control input u(x) of TAG systems for minimizing the performance index under the consideration of time varying systems. e control design is implemented by approximating with RL technique after obtaining the model of TAG.

Remark 1.
It is worth emphasizing that unlike the work in [2,22] studying RL technique for time varying closed control system of robotics by indirect methods, the proposed method in this article develops the direct RL method for time varying TAG systems.

Mathematical Modelling of Thermoacoustic Generator Systems
In this section, we proceed to model the thermoacoustic generator based on the operating principles combined with the partial element equivalent circuit (PEEC) method [23] to give the analogous circuit structure of the object. en, we provide the mathematical model based on assumptions about the thermoacoustic generator's operating conditions and mathematical transformations.

Regenerator.
According to the linear thermoacoustic theory [1], the interaction between the acoustic and temperature fields can be given by the following equation (1) as where p, U, T are the pressure (Pa), flow (m 3 /s), and temperature of the gas (K), respectively; ρ, c are specific weight (kg/m 3 ) and the adiabatic ratio of gases; ω, a are the angular frequency (s − 1 ) and the speed of sound (m/s); f k , f v are viscous function and spatially averaged thermal function, respectively. For the convenience of analysis, we can separate equation (1) into the two following dynamic equations [24]: From the PEEC method and the mathematical transformations, we can obtain the formula for the equivalent current source and the resistance of regenerator as follows: where T h , T c are the temperatures of the hot heat source and cold heat source (K); r 0 , l stack are the diameter (m 2 ) and length of the stack (m); and μ is the absolute viscosity coefficient of the gas (Pa · s).

Alternator.
A simple linear model [25] characterizing the loudspeaker as a linear alternator is shown in Figure 2. e acoustic wave imposes an oscillating pressure on the diaphragm, which has an effective area S, as shown in Figures 2(a) and 2(b) and shows the equivalent circuit of the physical model, as shown in Figure 2(a). e diaphragm and the coil, with a total mass, M m , are subjected to oscillatory motion. e loudspeaker has a mechanical stiffness, K m , and a mechanical resistance, R m . e coil has an electrical inductance, L e , and an electrical resistance, R e . e force factor is Bl. A pure electrical resistance, R L , is connected as a load to extract electrical power in this model. e voltage on the load resistance is V L , and the current is I L .
Assume that all parameters are linear and frequencyindependent. Ignoring hysteresis losses, the alternator's impedance in Figure 2 can be written as [22].
Notice that the pressure loss is much less than in traditional systems since the reactive component of the impedance is absolutely small even when the alternator is off resonance by a few H z [26]: erefore, we can rewrite alternator's equivalent impedance as

Stub.
e stub is a piston placed perpendicular to the central conduit and shown in Figure 1. e principal objective of stubs is to accommodate for the pressure and flow phase shift after passing through alternator. Considering a straight acoustic duct, the relationship between the input Z 1 and the output acoustic impedance Z 2 can be expressed as [26] Here, l is the length of duct (n), k � 2π/λ is the acoustic propagation coefficient, and Z 0 � ρ M a/A is the characteristic impedance of duct. e stub is basically a closed-end portion of an acoustic duct, Z 2 � ∞. erefore, the input acoustic impedance of the stub can be approximately written as where

Heat Exchanger and Feedback
Pipe. e heat exchangers are utilized as heat recovery equipment and are suitable for intake and return air systems. Two heat exchangers are employed in the thermoacoustic generator construction to recover and maintain the temperature of the hot and cold heat sources. e heat exchanger has a very low porosity while having a relatively long length. Compared to the thermoacoustic core's enormous cross-sectional area, the heat exchanger effectively adds a long but short cross section channel into the loop locally. As a result, from an acoustic standpoint, the heat exchanger exhibits a considerable inertance effect and the average acoustic resistance ( Figure 3) e remainder of the feedback pipe is just a lossy acoustic waveguide. Each unit section can be modelled as a mixture of resistance R fb , inductance L fb , and capacitance C fb , all of which can be determined using equations (8) and (9).
After modelling the five elements of thermoacoustic generator, we merge the results into the closed-circuit shown in Figure 4. (1) e gas movement is optimal, neglecting all of the friction with the duct. As a result, the energy of the gas flow is conserved, allowing us to ignore the influence of the feedback pipe (FBP) component throughout the circuit survey.     Mathematical Problems in Engineering (2) In the heat exchanger, the equivalent inductance is very small compared to the impedance; at the same time, the impedance value can be adjusted through the solvent flow in the device. erefore, the heat exchanger can be equivalent to a rheostat in the analogous circuit for modelling convenience. Step 1: Step 2: Solving the admissible control v i (·) from optimization problem: Step 3: Solving a positive definite value function V i+1 (. . .) from the admissible control v i (·) in Step 2: Step 4: If ‖V i+1 (. . .) − V i (. . .)‖ ≥ ϵ or ‖v i+1 (·) − v i (·)‖ ≥ ϵ then go to Step 2. Else, go to Step 5.
Step 1: Initializing the starting control signal u 0 and the starting state x 0 , let i � 0 and ϵ > 0.
Step 2: Solving the optimization problem to find l i and then obtain the admissible control v i (·) [4]: With constraint: l T ψ(x) > 0, ∀x, t.
ALGORITHM 2: Data collection-based RL for time varying systems.

Mathematical Problems in Engineering
Denote u � T h − T c /T c , x is the pressure of gas (Pa), y is the output power on alternator (W), and the thermoacoustic generator's mathematical model is as follows: e phase difference angles are defined as follows: where Remark 2. It is worth emphasizing that the thermoacoustic generator system can be enabled when the temperature differential between the hot and cold sources is positive. It means that the input control signal is always positive, and it is the system's input constraint. Furthermore, the pressure of gas (state) is always positive when the system is performed.

Direct Reinforcement Learning for Nonautonomous TAG Systems
In this section, we consider the extension from classical RL algorithm [2,22,27] to time varying RL for the general class of time varying nonlinear systems as With the initial value, x(t 0 ) � x 0 , and the associated cost function is defined as where x(τ) ∈ R n is the arbitrary states vector and u(τ) ∈ R m is the vector of input signals. Additionally, f: R n × R m × [t 0 , ∞) ⟶ R n is a continuously differentiable and positive definite map and the partial derivatives in terms of x, u, and t of L: e optimal control objective is to minimize the cost function (14) by the input signal u(t), which is belonged to the class of admissible signals to be defined as follows.
Definition 1. (see [4]) e control signal u: R n ⟶ R m is considered as an admissible control signal if and only if (1) u(x(t)) is a piecewise continuous function with respect to time.
In the light of [10], the improvement of traditional Algorithm 1 for the case of uncertain nonlinear systems can be completed after computing the deviation of two time varying value functions V i (x(t), t) at the two different sampling times t k and t k+1 by approximating with Newton-Leibniz formula and combining with function approximation theory. erefore, Step 3 with Algorithm 1 in classical Algorithm 1 is transformed into the computation using data collection by two following steps. First, this deviation of two time varying value functions is calculated using Newton-Leibniz formula under the control input u 0 in time interval [t k , t k+1 ]: Second, since f(·), V i (·) , V i x (·) , V i t (·) and v i− 1 (·) (i � 1, 2, . . .) are unknown functions, by using the basis function method [29], Step 3 in traditional Algorithm 1 can be easily realized with updation law of weights as follows [10]: 6 Mathematical Problems in Engineering where e i− 1 1 (x(t)), e i 2 (x(t), t), e i 3 (·), e i 4 (·) are approximate errors, which converge to zero as the iteration step comes to infinity. Based on the above analysis and transformation, a numerical RL algorithm can be proposed.
Following that, we proceed to apply Algorithm 2 to the mathematical model of thermoacoustic generator (11) as analyzed: With the associated cost function, where x 0 , u 0 , y 0 are the initial state, input, and output signal, respectively. From the HJB equation, we employ the basis function method to approximate the unknown functions as (18) Furthermore, the terms P k,1 , Q i k,1 , R k,1 are obtained from the collected data as follows: Afterward, the optimal weight parameter l i can be obtained by solving the problem as Algorithm 2 and calculating the control signal v i via (18).

Remark 3.
e requirement of convergences of control policy v i (x(t)) as well as value function V i (x(t), t) in Algorithm 2 is proved as described in [10] with the following steps. First, it is necessary to point out the property of admissible control input [4] v i (x(t)) and positive definite optimal function V i (x(t), t) which are kept after each iteration of Algorithm 1. Second, the decrement of optimal function V i (x(t), t) is determined and the convergences of control policy v i (x(t)) and value function V i (x(t), t) in Algorithm 2 are guaranteed after obtaining the estimation of error between the control policies v i (x(t)) and v i (x(t)) as  well as optimal functions V i (x(t), t) and V i (x(t), t) to be obtained in Algorithms 1 and 2. However, unlike the work in [10], the time varying RL is completely developed for a practical TAG system (19). Furthermore, it is obviously different from the method of time varying RL in [9] developing the actor/critic technique in [22]; RL procedure is implemented under the consideration of data collection in Algorithm 2. Additionally, to fulfill the term of partial derivative in HJB equation of nonautonomous systems, it is obviously different from indirect RL method studying the equivalent systems by adding more state variables [8,22], the direct RL control for TAGs is able to keep the dynamic model without transforming systems.

Simulation Results
In this section, a TAG system with the parameters is to be given for control system as follows: T c � 300, t 0 � 0, the sampling time t s � 0.05 s, the initial state and initial control signal are x 0 � 1085000(Pa), u o � 0.2, respectively, the approximation error ϵ � 0.01, the designed value of state x m � 1, 1 · 10 5 (Pa), and the desired output y m � 65(W). To validate the effectiveness of the proposed Algorithm 2 with time varying RL, we implement two simulations using Algorithm 2 (Figures 4-6) and traditional PID control scheme (Figures 7-9). It is seen that the high performance of control input, state variable, and output signal is implemented by time varying RL control for TAG in comparison with traditional PID controller. e control signal of TAGs (Figure 7) is oscillated with high frequency in traditional PID controller. e disadvantage follows the oscillators in output and state signals of TAGs (Figures 8 and 9). In contrast to traditional PID control scheme, the control signal and state are stable under RL algorithm (Figures 4 and 5) as well as the average of output signal convergences to desired value ( Figure 6).

Conclusion
is article proposes an application of direct time varying RL strategy to solve the optimal control for nonautonomous TAG systems subject to unknown system model parameters.  e PDE-based mathematical model of TAGs is transformed into time varying nonlinear dynamical systems. After that by collecting the data between two sampling times under the control signals and Newton-Leibniz approximation, the conventional RL technique is modified to handle for nonautonomous TAG systems. Numerical simulations and the comparison with traditional method of TAG control system verify the high performance of the proposed method.

Data Availability
is publication is supported by multiple datasets, which are available at locations cited in the reference section.

Conflicts of Interest
e authors declare that they have no conflicts of interest.