Identification of LTI Time-Delay Systems with Missing Output Data Using GEM Algorithm

This paper considers the parameter estimation for linear time-invariant (LTI) systems in an input-output setting with output error (OE) time-delay model structure. The problem of missing data is commonly experienced in industry due to irregular sampling, sensor failure, data deletion in data preprocessing, network transmission fault, and so forth; to deal with the identification of LTI systems with time-delay in incomplete-data problem, the generalized expectation-maximization (GEM) algorithm is adopted to estimate the model parameters and the time-delay simultaneously. Numerical examples are provided to demonstrate the effectiveness of the proposed method.


Introduction
The advanced process control theories have enjoyed rapid development in the past several decades to meet the growing demands of closed-loop system performances, such as improved process safety and efficiency of plant operation, consistent product quality, and economic optimization [1].These control strategies have improved the process automation and stability through providing control solutions for the process operated under the abnormal working conditions, such as process fault [2][3][4][5], network transmission delay [6], data packet dropouts [7], and modeling error.Generally, the implementation of these control strategies relies on the understanding of the process dynamics and the availability of an accurate mathematical model of the process.In view of the difficulties and complexities imposed by modeling using first principle method, the data-driven modeling method, in which the process model is retrieved from the process data, has become a main modeling method.
Typically, the process data used in process modeling are generated by performing an identification experiment, in which a testing signal is designed and utilized to excite the process.Most of the conventional parameter estimation methods, such as prediction error method (PEM), instrumental variable (IV) method, and subspace method, assume that the identification data are sampled regularly and recorded properly.However, this is not always true in practical industry.For example, in the development of an inferential model for the sulfur content in the gas oil product, the sulfur concentration cannot be measured directly and the lab analysis is required which takes a long time.The process variable can be sampled in every minute, but the sulfur concentration is only available in every twelve hours.Another example is the industrial process with data transmission through the network.The recorded process data are corrupted by many network-induced problems, such as transmission delay and packet dropout or missing.Therefore, parameter estimation with irregular data has not been extensively investigated in the literature.
Time-delays are commonly encountered in various engineering systems, such as chemical processes, mechanical systems, network control systems, transmission line, and economic systems [8].Since the existence of time-delay usually causes performance degradation of the inferential model and is frequently a source of instability of the closedloop system, it should be handled carefully in the modeling process.Common methods to estimate the time-delays are nonparametric methods (e.g., step test or correlation analysis) and grid searching method.For example, Wang and Zhang [9] considered the robust identification problems of linear continuous time-delay systems from step responses.A linear regression equation was derived from the solution of the output time response and its various-order integrals and solved by using IV-least squares (LS) method.The parameters of the transfer function were then recovered from the LS solution.Weyer [10] considered to build a model for the open water channel.The model parameters were estimated using the grid searching method in which one model was established for each time-delay in a range.The final model was selected as the model which gave the best prediction performance for a validation data set.The time-delay estimation methods mentioned above are to estimate the time-delay and model parameters in a separate way.
Missing data problem is very common in process industry.A special example is the irregularly sampling system.Many critical parameters, such as the product concentration, steam quality, and 90% boiling point, cannot be measured directly by using the sensors.These parameters are measured through lab analysis, so only the slow rate data are available.However, the process variables, such as the temperature, pressure, and flow rate, can be measured on-line in fast rate by using the sensors.Therefore, we can treat the data samples between the slow rate data as missing data.Another example is the network control system in which data transmit via the wireless network or internet.Data missing occurs due to data packet dropout or missing.Other reasons for missing data are sensor fault, data recording system malfunction, and so forth.Several methods have been reported in the literature to handle missing data problem in system identification.For example, F. Ding and J. Ding [11] proposed an auxiliary model-based approach to cope with the problems of parameter estimation and output estimation with irregularly missing output data using the PEM method.The outputs of the auxiliary model were used in the identification process.Zhu et al. [12] considered the identification of systems with slowly and irregularly sampled output data.The output error method was employed to estimate the fast rate model based on the fast input and slow output data.However, the methods mentioned above just used part of the process data, which may lead to information missing.Moreover, the statistical properties of the model parameters and the process noise cannot be given in these methods.
The work introduced in this paper aims at handling the identification problem of the LTI systems with missing output data in the presence of time-delay.The identification problem is formulated under the scheme of the generalized expectation-maximization (GEM) algorithm and the timedelay and missing output data are handled simultaneously.The GEM algorithm consists of expectation step (E-step) and maximization step (M-step).In the M-step, the maximization problem is transformed into an equivalent minimization problem and this problem is solved by using a general numerical optimization algorithm.
The rest of this paper is organized as follows.The problem statement is presented in Section 2. A brief revisit of the GEM algorithm and the mathematical formulation of the identification of LTI time-delay systems with incomplete data set are given in Section 3. Numerical examples are presented in Section 4 to show the effectiveness of the proposed method.The conclusions are given in Section 5.

Problem Statement
Consider the LTI system described by the following output error (OE) time-delay model: where  is the time-delay which is assumed to be integer multiples of the sampling period,   is the Gaussian white noise with zero mean and variance  2 , and   and   are the output and input, respectively.The transfer function ( −1 ) has the following form: Here, we assume that the model orders   and   are known a priori and the time-delay  is uniformly distributed in a known range of The identification data {  ,   } =1,2,..., are collected.We denote {  } =1,2,..., as  and {  } =1,2,..., as .Since part of the output data are missing completely at random (MCAR), the output data set  can be divided into  obs = {  } = 1 ,...,  and  mis = {  } = 1 ,...,  .Therefore, the identification problem is to estimate the parameters  = { 1 , . . .,    ,  1 , . . .,    }, the noise variance  2 , and the time-delay  based on the identification data  obs and .

Parameter Estimation Using the GEM Algorithm
3.1.GEM Algorithm Revisit.The GEM algorithm is a generalpurpose iterative optimization algorithm to derive the maximum likelihood (ML) estimate and it has attracted great attentions of the researcher due to its flexibility in handling the missing data or hidden state [13].Denote the missing data set by  mis and the observed data set by  obs .The main idea of the GEM algorithm is that, instead of optimizing the likelihood of the observed  obs , the conditional expectation of the complete data likelihood function with respect to the missing data set is calculated in the E-step and the maximization problem is solved in the M-step.The procedures of the GEM algorithm to calculate the ML estimate can be described as follows [13]: E-step: given the  obs and the parameter estimate Θ  in previous iteration, the -function can be calculated by M-step: find the Θ +1 to increase (Θ | Θ  ) over its value at Θ  ; that is, The E-step and M-step alternate until the relative change of the parameter estimate between neighboring iterations is smaller than a prespecified arbitrary small constant or the maximal iteration number is achieved.

LTI Time-Delay System Identification with Missing Output
Using GEM Algorithm.Here, we treat the time-delay  as a hidden state variable.The observed data set  obs is constructed as  obs = { obs , } and the missing data set  mis is constructed as  mis = { mis , }.The parameter vector Θ is constructed as Θ = {,  2 }.
Based on the Bayesian property, the likelihood function of the complete data set can be decomposed into The Based on (1) and ( 2),   depends only on the previous input sequence  −1 : 1 = { −1 , . . .,  1 }, the time-delay , and the parameter vector Θ.Therefore, (6) can be rewritten as Since the time-delay  is uniformly distributed in the range [ 1 ,  2 ], the probability of  taking any value in this range is a constant.Since the input  is measurable data and it is independent of the parameter vector Θ, the term ( | Θ) is a constant.Therefore, the last two terms of (5) will not play a role in the following derivations.The complete data likelihood function can be further written as where Therefore, the conditional expectation of the log complete data density (Θ | Θ  ) in (3) can be written as The expectation is firstly taken with respect to the discrete variable ; then we have The expectation is then taken with respect to the continuous variable  miss , so we have

Mathematical Problems in Engineering
In order to calculate (Θ | Θ  ), the unknown terms should be calculated firstly.Consider Therefore, the -function can be rewritten as In the M-step of the GEM algorithm, the unknown parameters should be estimated to increase the -function by solving an optimization problem.Taking the gradient of the -function (13) with respect to the  2 and setting it to zeros, we have Substituting σ2 into the -function (13), we get Based on the monotonicity of the log function, the problem is transformed into optimizing the following cost function: Here, we introduce the variable    denoting the noise-free output with time-delay .Based on (1) and ( 2), we have where

. , 𝑢 𝑡−𝜏−𝑛 𝑏 ]
.Therefore, the cost function (18) with   − substituted by φ − can be optimized by using the damped Newton algorithm, where where  is a constant.The time-delay  can be selected as the delay in the range with maximal posterior probability.That is, The E-step and M-step alternate until the convergence condition of the GEM algorithm is met.

Simulation Examples
The input data and output data are generated by simulation and the noise   with zero mean and variance 0.01 is added to the output.The input and output data are shown in Figure 1.In the simulation, 12.5% output data are randomly missing.The parameter range of the time-delay is set to [1,5].The method proposed in this paper is used to estimate the parameters and the time-delay.The parameter estimate trajectories of the model parameters and the noise variance are shown in Figures 2 and 3, respectively.The estimated time-delay is 3 which is consistent with the true timedelay.To further verify the effectiveness of the proposed method, the simulations are also performed with 25% output data missing and 50% output data missing.The estimated parameters after 13 iterations are summarized in Table 1.It can be seen from these figures and the table that the proposed GEM algorithm has a good identification performance.algorithms and the first principle model of the CSTR is described as [14]

The Continuous
where the product concentration   and the temperature  are output variables and the coolant flow rate   is the input variable.The steady state values of the process variables can be found in Gopaluni [14].In this simulation, the CSTR is operated at a steady state working point which is at   = 97 L/min,   = 0.0795 mol/L, and  = 443.4566K.The task here is to build a first-order model between   and   .The input and output data are generated through simulation and the noise with zero mean and variance 1 × 10 −7 is added to the output data.Since time is needed to measure the concentration   , so the measurement delay with 1.5 minutes is also added to the output data.The input and output data is shown in Figure 4.In this simulation, 25% output data are randomly missing and the parameter range of the time-delay is set to [0.9, 2.1].The proposed GEM algorithm is used to estimate the unknown parameters.The estimated parameters are â = −0.468,b = 0.0015, σ2 = 1.5 × 10 −7 , and τ = 1.5.The self-validation and the cross-validation results are shown in Figures 5 and 6.It can be seen from these results that the proposed method has a good identification performance and the estimated model can capture the dynamic behavior of the CSTR.

Conclusion
This paper considers the identification problem of LTI systems with irregular data set.The time-delay and the missing data are commonly encountered problems in process industry and the existence of these problems makes the process modeling a challenging task.The identification problem with incomplete data set in the presence of time-delay is formulated under the scheme of the GEM algorithm and the model parameters and the time-delay are estimated simultaneously in this algorithm.Numerical examples are presented to demonstrate the efficacy of the proposed method.

Figure 1 :
Figure 1: The input and output data.

Figure 3 :
Figure 3: The estimated noise variance in each iteration.

Figure 4 :Figure 5 :
Figure 4: The input and output data.

Figure 6 :
Figure 6: The cross-validation results.The blue line is the real process data and the red line is the simulated output of the estimated model.
s Figure 2: The estimated model parameters in each iteration.