Hidden Semi-Markov Models for Predictive Maintenance

Realistic predictive maintenance approaches are essential for condition monitoring and predictive maintenance of industrial machines. In this work, we propose Hidden Semi-Markov Models (HSMMs) with (i) no constraints on the state duration density function and (ii) being applied to continuous or discrete observation. To deal with such a type of HSMM, we also propose modifications to the learning, inference, and prediction algorithms. Finally, automaticmodel selection has beenmade possible using the Akaike Information Criterion. This paper describes the theoretical formalization of the model as well as several experiments performed on simulated and real data with the aim of methodology validation. In all performed experiments, the model is able to correctly estimate the current state and to effectively predict the time to a predefined event with a low overall average absolute error. As a consequence, its applicability to real world settings can be beneficial, especially where in real time the Remaining Useful Lifetime (RUL) of the machine is calculated.


Introduction
Predictive models that are able to estimate the current condition and the Remaining Useful Lifetime of an industrial equipment are of high interest, especially for manufacturing companies, which can optimize their maintenance strategies.If we consider that the costs derived from maintenance are one of the largest parts of the operational costs [1] and that often the maintenance and operations departments comprise about 30% of the manpower [2,3], it is not difficult to estimate the economic advantages that such innovative techniques can bring to industry.Moreover, predictive maintenance, where in real time the Remaining Useful Lifetime (RUL) of the machine is calculated, has been proven to significantly outperforms other maintenance strategies, such as corrective maintenance [4].In this work, RUL is defined as the time, from the current moment, that the systems will fail [5].Failure, in this context, is defined as a deviation of the delivered output of a machine from the specified service requirements [6] that necessitate maintenance.
Models like Support Vector Machines [7], Dynamic Bayesian Networks [8], clustering techniques [9], and data mining approaches [10] have been successfully applied to condition monitoring, RUL estimation, and predictive maintenance problems [11,12].State space models, like Hidden Markov Models (HMMs) [13], are particularly suitable to be used in industrial applications, due to their ability to model the latent state which represents the health condition of the machine.
Classical HMMs have been applied to condition assessment [14,15]; however, their usage in predictive maintenance has not been effective due to their intrinsic modeling of the state duration as a geometric distribution.
To overcome this drawback, a modified version of HMM, which takes into account an estimate of the duration in each state, has been proposed in the works of Tobon-Mejia et al. [16][17][18][19].Thanks to the explicit state sojourn time modeling, it has been shown that it is possible to effectively estimate the RUL for industrial equipment.However, the drawback of their proposed HMM model is that the state duration is always assumed as Gaussian distributed and the duration parameters are estimated empirically from the Viterbi path of the HMM.
A complete specification of a duration model together with a set of learning and inference algorithms has been given firstly by Ferguson [20].In his work, Ferguson allowed 2 Mathematical Problems in Engineering the underlying stochastic process of the state to be a semi-Markov chain, instead of a simple Markov chain of a HMM.Such model is referred to as Hidden Semi-Markov Model (HSMM) [21].HSMMs and explicit duration modeles have been proven beneficial for many applications [22][23][24][25].A complete overview of different duration model classes has been made by Yu [26].Most state duration models, used in the literature, are nonparametric discrete distributions [27][28][29].As a consequence, the number of parameters that describe the model and that have to be estimated is high, and consequently the learning procedure can be computationally expensive for real complex applications.Moreover, it is necessary to specify a priori the maximum duration allowed in each state.
To alleviate the high dimensionality of the parameter space, parametric duration models have been proposed.For example, Salfner [6] proposed a generic parametric continuous distribution to model the state sojourn time.However, in their model, the observation has been assumed to be discrete and applied to recognize failure-prone observation sequence.Using continuous observation, Azimi et al. [30][31][32] specified an HSMM with parametric duration distribution belonging to the Gamma family and modeled the observation process by a Gaussian.
Inspired by the latter two approaches, in this work we propose a generic specification of a parametric HSMM, in which no constraints are made on the model of the state duration and on the observation processes.In our approach, the state duration is modeled as a generic parametric density function.On the other hand, the observations can be modeled either as a discrete stochastic process or as continuous mixture of Gaussians.The latter has been shown to approximate, arbitrarily closely, any finite, continuous density function [33].The proposed model can be generally used in a wide range of applications and types of data.Moreover, in this paper we introduce a new and more effective estimator of the time spent by the system in a determinate state prior to the current time.To the best of our knowledge, a part from the above referred works, the literature on HSMMs applied to prognosis and predictive maintenance for industrial machines is limited [34].Hence, the present work aims to show the effectiveness of the proposed duration model in solving condition monitoring and RUL estimation problems.
Dealing with state space models, and in particular of HSMMs, one should define the number of states and correct family of duration density, and in case of continuous observations, the adequate number of Gaussian mixtures.Such parameters play a prominent role, since the right model configuration is essential to enable an accurate modeling of the dynamic pattern and the covariance structure of the observed time series.The estimation of a satisfactory model configuration is referred to as model selection in literature.
While several state-of-the-art approaches use expert knowledge to get insight on the model structure [15,35,36], an automated methodology for model selection is often required.In the literature, model selection has been deeply studied for a wide range of models.Among the existing methodologies, information based techniques have been extensively analyzed in literature with satisfactory results.
Although Bayesian Information Criterion (BIC) is particularly appropriate to be used in finite mixture models [37,38], Akaike Information Criterion (AIC) has been demonstrated to outperform BIC when applied to more complex models and when the sample size is limited [39,40], which is the case of the target application of this paper.
In this work AIC is used to estimate the correct model configuration, with the final goal of an automated HSMMs model selection, which exploits only the information available in the input data.While model selection techniques have been extensively used in the framework of Hidden Markov Models [41][42][43], to the best of our knowledge, the present work is the first that proposes their appliance to duration models and in particular to HSMMs.
In summary, the present work contributes to condition monitoring, predictive maintenance, and RUL estimation problems by (i) proposing a general Hidden Semi-Markov Model applicable for continuous or discrete observations and with no constraints on the density function used to model the state duration; (ii) proposing a more effective estimator of the state duration variable   (), that is, the time spent by the system in the th state, prior to current time ; (iii) adapting the learning, inference and prediction algorithms considering the defined HSMM parameters and the proposed   () estimator; (iv) using the Akaike Information Criterion for automatic model selection.
The rest of the paper is organized as follows: in Section 2 we introduce the theory of the proposed HSMM together with its learning, inference, and prediction algorithms.Section 3 gives a short theoretical overview of the Akaike Information Criterion.Section 4 presents the methodology used to estimate the Remaining Useful Lifetime using the proposed HSMM.In Section 5 experimental results are discussed.The conclusion and future research directions are given in Section 6.

Hidden Semi-Markov Models
Hidden Semi-Markov Models (HSMMs) introduce the concept of variable duration, which results in a more accurate modeling power if the system being modeled shows a dependence on time.
In this section we give the specification of the proposed HSMM, for which we model the state duration with a parametric state-dependent distribution.Compared to nonparametric modeling, this approach has two main advantages: (i) the model is specified by a limited number of parameters; as a consequence, the learning procedure is computationally less expensive; (ii) the model does not require the a priori knowledge of the maximum sojourn time allowed in each state, being inherently learnt through the duration distribution parameters.

Model Specification.
A Hidden Semi-Markov Model is a doubly embedded stochastic model with an underlying stochastic process that is not observable (hidden) but can only be observed through another set of stochastic processes that produce the sequence of observations.HSMM allows the underlying process to be a semi-Markov chain with a variable duration or sojourn time for each state.The key concept of HSMMs is that the semi-Markov property holds for this model: while in HMMs the Markov property implies that the value of the hidden state at time  depends exclusively on its value of time  − 1, in HSMMs the probability of transition from state   to state   at time  depends on the duration spent in state   prior to time .
In the following we denote the number of states in the model as , the individual states as  = { 1 , . . .,   }, and the state at time  as   .The semi-Markov property can be written as where the duration variable   () is defined as the time spent in state   prior to time .
Although the state duration is inherently discrete, in many studies [44,45] it has been modeled with a continuous parametric density function.Similar to the work of Azimi et al. [30][31][32], in this paper, we use the discrete counterpart of the chosen parametric probability density function (pdf).With this approximation, if we denote the pdf of the sojourn time in state   as (,   ), where   represents the set of parameters of the pdf relative to the th state, the probability that the system stays in state   for exactly  time steps can be calculated as ∫  −1 (,   ).Considering the HSMM formulation, we can generally denote the state dependent duration distributions by the set of their parameters relative to each state as Θ = { 1 , . . .,   }.
Many related works on HSMMs [31,32,44,45] consider (,   ) within the exponential family.In particular, Gamma distributions are often used in speech processing applications.In this work, we do not impose a type of distribution function to model the duration.The only requirement is that the duration should be modeled as a positive function, being negative durations physically meaningless.
HSMMs require also the definition of a "dynamic" transition matrix, as a consequence of the semi-Markov property.Differently from the HMMs in which a constant transition probability leads to a geometric distributed state sojourn time, HSMMs explicitly define a transition matrix which, depending on the duration variable, has increasing probabilities of changing state as the time goes on.For convenience, we specify the state duration variable in a form of a vector d  with dimensions  × 1 as The quantity   () can be easily calculated by induction from  −1 () as where   () is 1 if   =   , 0 otherwise.If we assume that at time  the system is in state   , we can formally define the duration-dependent transition matrix as The specification of the model can be further simplified by observing that, at each time , the matrix A d  can be decomposed in two terms: the recurrent and the nonrecurrent state transition probabilities.
The recurrent transition probabilities P(d  ) = [  (d  )], which depend only on the duration vector d  and the parameters Θ, take into account the dynamics of the selftransition probabilities.It is defined as the probability of remaining in the current state at the next time step, given the duration spent in the current state prior to time : ( The denominator in (5) can be expressed as ∑ ∞ =1 P( + ̸ =   ,  +−1 =   , . . .,  −  ()+2 =   |  −  ()+1 =   ,  −  () ̸ =   ), which is the probability that the system, at time , has been staying in state   for at least   () − 1 time units.The above expression is equivalent to 1 − (  () − 1,   ), where (⋅,   ) is the duration cumulative distribution function relative to the the state   , that is, (, ) = ∫  −∞ (, ).As a consequence, from (5) we can define the recurrent transition probabilities as a diagonal matrix with dimensions  × , as The usage of the cumulative functions in (6), which tend to 1 as the duration tends to infinity, suggests that the probability of self-transition tends to decrease as the sojourn time increases, leading the model to always leave the current state if time approaches infinity.The nonrecurrent state transition probabilities, A 0 = [ 0  ], rule the transitions between two different states.It is represented by a  ×  matrix with the diagonal elements equal to zero, defined as A 0 must be specified as a stochastic matrix; that is, its elements have to satisfy the constraint ∑  =1  0  = 1 for all .As a consequence of the above decomposition, the dynamic of the underlying semi-Markov chain can be defined by specifying only the state-dependent duration parameters Θ and the nonrecurrent matrix A 0 , since the model transition matrix can be calculated, at each time , using ( 6) and (7): where I is the identity matrix.If we denote the elements of the dynamic transition matrix A d  as   (d  ), the stochastic constraint ∑  =1   (d  ) = 1 for all  and  is guaranteed from the fact that P(d  ) is a diagonal matrix and A 0 is a stochastic matrix.
For several applications it is necessary to model the absorbing state which, in the case of industrial equipment, corresponds to the "broken" or "failure" state.If we denote the absorbing state as   with  ∈ [1, ], we must fix the th row of the nonrecurrent matrix A 0 to be  0  = 1 and  0  = 0 for all 1 ≤  ≤  with  ̸ = .By substituting such A 0 matrix in (8), it is easy to show that the element   (d  ) = 1 and remains constant for all , while the duration probability parameters   are not influent for the absorbing state   .An example of absorbing state specification will be given in Section 5.
With respect to the input observation signals, in this work we consider both continuous and discrete data, by adapting the suitable observation model depending on the observation nature.In particular, for the continuous case, we model the observations with a multivariate mixture of Gaussians distributions.This choice presents two main advantages: (i) a multivariate model allows to deal with multiple observations at the same time; this is often the case of industrial equipments modeling since, at each time, multiple sensors' measurements are available, and (ii) mixture of Gaussians has been proved to closely approximate any finite and continuous density function [33].Formally, if we denote by x  the observation vector at time  and the generic observation vector being modeled as x, the observation density for the th state is represented by a finite mixture of  gaussians where   is the mixture coefficient for the th mixture in state   , which satisfies the stochastic constraint ∑  =1   = 1 for 1 ≤  ≤  and   ≥ 0 for 1 ≤  ≤  and 1 ≤  ≤ , while N is the Gaussian density, with mean vector   and covariance matrix U  for the th mixture component in state .
In case of discrete data, we model the observations within each state with a nonparametric discrete probability distribution.In particular, if  is the number of distinct observation symbols per state and if we denote the symbols as  = { 1 , . . .,   } and the observation at time  as   , the observation symbol probability distribution can be defined as a matrix  = [  ()] of dimensions  ×  where (10) Since the system in each state at each time step can emit one of the possible  symbols, the matrix  is stochastic; that is, it is constrained to ∑  =1   () = 1 for all 1 ≤  ≤ .Finally, as in the case of HMMs, we specify the initial state distribution  = {  } which defines the probability of the starting state as From the above considerations, two different HSMM models can be considered.In the case of continuous observation,  = (A 0 , Θ, , , , ), and in the case of discrete observation the HSMM is characterized by  = (A 0 , Θ, , ).An example of continuous HSMM with 3 states is shown in Figure 1.

Learning and Inference Algorithms.
Let us denote the generic sequence of observations, being indiscriminately continuous vectors or discrete symbols, as x = x 1 x 2 ⋅ ⋅ ⋅ x  ; in order to use the defined HSMM model in practice, similarly to the HMM, we need to solve three basic problems.As in case of HMM, solving the above problems requires using the forward-backward [13], decoding (Viterbi [46] and Forney [47]) and Expectation Maximization [48] algorithms, which will be adapted to the HSMM introduced in Section 2.1.In the following, we also propose a more effective estimator of the state duration variable   () defined in (2).

The Forward-Backward Algorithm.
Given a generic sequence of observations x = x 1 x 2 ⋅ ⋅ ⋅ x  , the goal is to calculate the model likelihood, that is, P(x | ).This quantity is useful for the training procedure, where the parameters that locally maximize the model likelihood are chosen, as well as for classification problems.The latter is the case in which the observation sequence x has to be mapped to one of a finite set of  classes, represented by a set of HSMM parameters  = { 1 , . . .,   }.The class of x is chosen such that (x) = arg max ∈ P( | ).
To calculate the model likelihood, we first define the forward variable at each time  as Hidden states  Contrarily to HMMs, for HSMMs the state duration must be taken into account in the the forward variable calculation.Consequently, Yu [26] proposed the following inductive formula: that is, the sum of the probabilities of being in the current state   (at time ) for the past   time units, (with 1 ≤   ≤  and  the maximum allowed duration for each state), coming from all the possible previous states   , 1 ≤  ≤ , and  ̸ = .The disadvantage of the above formulation is that, as discussed in Introduction, the specification of the maximum duration  represents a limitation to the modeling generalization.Moreover, from (13), it is clear that the computation and memory complexities drastically increase with , which can be very large in many applications, in particular for online failure prediction.
To alleviate this problem, Azimi et al. [30][31][32] introduced a new forward algorithm for HSMMs that, by keeping track of the estimated average state duration at each iteration, has a computational complexity comparable to the forward algorithm for HMMs [13].However, the average state duration represents an approximation.Consequently, the forward algorithm of Azimi, compared with (13), pays the price of a lower precision in favor of a (indispensable) better computational efficiency.
To calculate the forward variable   () using Azimi's approach, the duration-dependent transition matrix, defined in (8), is taken in consideration in the induction formula of ( 13), which becomes [30] To calculate the above formula, the average state duration of (2) must be estimated, for each time , by means of the variable d = [ d ()], defined as where E denotes the expected value.To calculate the above quantity, Azimi et al. [30][31][32] use the following formula: where ⊙ represents the element by element product between two matrices/vectors and the vector  t = [  ()] (the probability of being in state   at time  given the observation sequence and the model parameters) with dimensions  × 1 is calculated in terms of   () as Equation ( 16) is based on the following induction formula [30][31][32] that rules the dynamics of the duration vector when the system's state is known: where, for each ,   () is 1 if   =   , 0 otherwise.
A simple example shows that ( 18) is incorrect: assuming an HSMM with three states and considering the state sequence ( 1 ,  1 ,  2 , . ..) the correct sequence of the duration vector is where the superscript  denotes vector transpose.If we apply (18), we obtain which is in contradiction with the definition of the state duration vector given in (2).
To calculate the average state duration variable d () we propose a new induction formula that estimates, for each time , the time spent in the th state prior to  as The derivation of ( 20) is given in Appendix.The intuition behind (19) is that the current average duration is the previous average duration plus one, weighted with the "amount" of the current state that was already in state   in the previous step.
Using the proposed ( 20), the forward algorithm can be specified as follows: (1) initialization, with 1 ≤  ≤ : where P( d ) is estimated using (6); (2) induction, with 1 ≤  ≤  and 1 ≤  ≤  − 1: where   ( d ) are the coefficients of the matrix A d ; (3) termination: Similar considerations as the forward procedure can be made for the backward algorithm, which is implemented by defining the variable   () as Having estimated the dynamic transition matrix A d for each 1 ≤  ≤  using (24), the backward variable can be calculated inductively as follows.
(1) Initialization: (2) Induction: Although the variable   () is not necessary for the calculation of the model likelihood, it will be useful in the parameter reestimation procedure, as it will be explained in Section 2.2.3.[46,47] (also known as decoding) allows determining the best state sequence corresponding to a given observation sequence.

The Viterbi Algorithm. The Viterbi algorithm
Formally, given a sequence of observation x = x 1 x 2 ⋅ ⋅ ⋅ x  , the best state sequence  * =  * 1  * 2 ⋅ ⋅ ⋅  *  corresponding to x is calculated by defining the variable   () as The procedure to recursively calculate the variable   () and to retrieve the target state sequence (i.e., the arguments which maximize the   ()'s) for the proposed HSMM is a straightforward extension of the Viterbi algorithm for HMMs [13].The only change is the usage, in the recursive calculation of   (), of the dynamic transition matrix A d = [  ( d )], calculated through (24).The Viterbi algorithm for the introduced parametric HSMMs can be summarized as follows: (1) initialization, with 1 ≤  ≤ : (2) recursion, with 1 ≤  ≤  and 2 ≤  ≤ : (3) termination: where we keep track of the argument maximizing (31) using the vector   , which, tracked back, gives the desired best state sequence: We use the modified Baum-Welch algorithm of Azimi et al. [30][31][32].However, in our implementation we do not make assumption on the density function used to model the state duration, and we consider both continuous and discrete observations.
Being a variant of the more general Expectation-Maximization (EM) algorithm, Baum-Welch is an iterative procedure which consists of two steps: (i) the expectation step, in which the forward and backward variables are calculated and the model likelihood is estimated and (ii) the maximization step, in which the model parameters are updated and used in the next iteration.This process usually starts from a random guess of the model parameters  0 and it is iterated until the likelihood function does not improve between two consecutive iterations.
Similarly to HMMs, the reestimation formulas are derived by firstly introducing the variable   (, ), which represents the probability of being in state   at time , and in state   at time  + 1, given the model and the observation sequence, as However, in the HSMM case, the variable   (, ) considers the duration estimation performed in the forward algorithm (see Equation ( 24)).Formulated in terms of the forward and backward variables, it is given by From   (, ) we can derive the quantity   () (already defined in (17)) representing the probability of being in state   at time  given the observation sequence and the model parameters: Finally, the the reestimation formulas for the parameters  and A 0 are given by where  = [  ] is a square matrix of dimensions × where   = 0 for  =  and   = 1 for  ̸ = , ⊙ represents the element by element product between two matrices, ∑ − For the matrix A 0 , being normalized, the stochastic constraints are satisfied at each iteration, that is, ∑  =1  0  = 1 for each 1 ≤  ≤ , while the estimation of the prior probability   inherently sums up to 1 at each iteration, since it represents the expected frequency in state   at time  = 1 for each 1 ≤  ≤ .
With respect to the reestimation of the state duration parameters Θ, firstly, we estimate the mean  , and the variance  2 , of the th state duration for each 1 ≤  ≤ , from the forward and backward variables and the estimation of the state duration variable where (41) can be interpreted as the probability of transition from state   to   with  ̸ =  at time  weighted by the duration of state   at , giving the desired expected value, while in (42) the same quantity is weighted by the squared distance of the duration at time  from its mean, giving the estimation of the variance.
Then, the parameters of the desired duration distribution can be estimated from  , and  2 , .For example, if a Gamma distribution with shape parameter ] and scale parameter  is chosen to model the state duration, the parameters ]  and   for each 1 ≤  ≤  can be calculated as ]  =  2 , / 2 , and   =  2 , / , .

Mathematical Problems in Engineering
Concerning the observation parameters, once the modified forward and backward variables, accounting for the state duration, are defined as in ( 22) and ( 28), the reestimation formulas are the same as for Hidden Markov Models [13].
In particular, for continuous observations, the parameters of the Gaussians' mixture defined in (9) are reestimated by firstly defining the probability of being in state   at time  with the probability of the observation vector x  evaluated by the th mixture component, as By using the former quantity, the parameters   ,   , and U  are reestimated through the following formulas: , , where superscript  denotes vector transpose.For discrete observations, the reestimation formula for the observation matrix   () is where the quantity   (), which takes into account the duration dependent forward variable   (), is calculated through (17).
The reader is referred to Rabiner's work [13] for the interpretation on the observation parameters reestimation formulas.

AIC-Based Model Selection
In the framework of the proposed parametric HSMMs, the model selection procedure aims to select the optimal number of hidden states , the right duration distribution family, and, in the case of mixture observation modeling, the number of Gaussian mixtures  to be used.In this work, we make use of the Akaike Information Criterion (AIC).Indeed, it has been seen that in case of complex models and in presence of a limited number of training observations, AIC represents a satisfactory methodology for model selection, outperforming other approaches like Bayesian Information Criterion.
In general, information criteria are represented as a twoterm structure.They account for a compromise between a measure of model fitness, which is based on the likelihood of the model, and a penalty term which takes into account the model complexity.Usually, the model complexity is measured in terms of the number of parameters that have to be estimated and in terms of the number of observations.The Akaike Information Criterion is an estimate of the asymptotic value of the expected distance between the unknown true likelihood function of the data and the fitted likelihood function of the model.In particular, the AIC can be expressed as where ( λ) is the likelihood of the model with the estimated parameters λ, as defined in (25),  is the number of model parameters, and  is the length of the observed sequence.The best model is the one minimizing equation (46).Concerning , the number of parameters to be estimated for a parametric HSMM with  states is  =  ℎ +   , where  ℎ are the parameters of the hidden states layer, while   are those of the observation layer.
Concerning   , a distinction must be made between discrete and continuous observations: (i) in the case of discrete observations with  possible observable values,   = ( − 1) ⋅ , which accounts for the elements of the observation matrix ; (ii) if the observations are continuous and a multivariate mixture of  Gaussians with  variates is used as observation model, where each term accounts, respectively, for the mean vector , the covariance matrix , and the mixture coefficients .

Remaining Useful Lifetime Estimation
One of the most important advantages of the time modeling of HSMMs is the possibility to effectively face the prediction problem.The knowledge of the state duration distributions allows the estimation of the remaining time in a certain state and, in general, the prediction of the expected time D before entering in a determinate state.
As already mentioned, an interesting application of the prediction problem is the Remaining Useful Lifetime (RUL) estimation of industrial equipments.Indeed, if each state of an HSMM is mapped to a different condition of an industrial machine and if the state   that represents the failure condition is identified, at each moment, the RUL can be defined as the expected time D to reach the failure state   .If we assume that the time to failure is a random variable  following a determinate probability density, we define the RUL at the current time  as where E denotes the expected value.
Having fixed the failure state, the estimation of the RUL is performed, in two steps, every time a new observation is acquired (online): (1) estimation of the current state; (2) projection of the future state transitions until the failure state is reached and estimation of the expected sojourn time.
The estimation of the current state is performed via the Viterbi path, that is, the variable   = [  ()] 1≤≤ defined in (29).To correctly model the uncertainty of the current state estimation, we use the normalized variable   () obtained as that is, an estimate of the probability of being in state   at time .
Together with the normalized variable   (), the maximum a posteriori estimate of the current state ŝ *  is taken into account according to (34).If ŝ *  coincides with the failure state, the desired event is detected by the model and the time to this event is obviously zero.Otherwise, an estimation of the average remaining time in the current state davg (ŝ *  ) is calculated as where with    we denote the expected value of the duration variable in state   according to the duration distribution specified by the parameters   .Equation (49) thus estimates the remaining time in the current state by subtracting the estimated states duration, d (), at time  from the expected sojourn time of state   , and weighting the result using the uncertainty about the current state,   (), and, finally, by summing up all the contributions from each state.
In addition to the average remaining time, a lower and an upper bound value can be calculated based on the standard deviation,    , of the duration distribution for state   : Once the remaining time in the current state is estimated, the probability of the next state is calculated by multiplying the transpose of the nonrecurrent transition matrix by the current state probability estimation as follows: while the maximum a posteriori estimate of the next state ŝ * next is calculated as Again, if ŝ * + d coincides with the failure state, the failure will happen after the remaining time at the current state is over and the average estimation of the failure time is Davg = davg (ŝ *  ) calculated at the previous step, with the bound values Dlow = dlow (ŝ *  ) and Dup = dup (ŝ *  ).Otherwise the estimation of the sojourn time of the next state is calculated as follows: This procedure is repeated until the failure state is encountered in the prediction of the next state.The calculation of the RUL is then simply obtained by summing all the estimated remaining time in each intermediate state before encountering the failure state: Finally, Algorithm 1 details the above described RUL estimation procedure.

Experimental Results
To demonstrate the effectiveness of the proposed HSMM models, we make use of a series of experiments, performed both on simulated and real data.
The simulated data were generated by considering a leftright HSMM and adapting the parameters of the artificial example reported in the work of Lee et al. [15].The real case data are monitoring data related to the entire operational life of bearings, made available for the IEEE PHM 2012 data challenge (http://www.femto-st.fr/en/Research-departments/AS2M/Research-groups/PHM/IEEE-PHM-2012-Data-challenge.php).
( For each of the continuous and the discrete cases, three data sets have been generated by considering the following duration models: Gaussian, Gamma, and Weibull densities.For each of the three data sets, 30 series of data are used as training set and 10 series as testing set.Each time series contains  = 650 observations.The parameters used to generate the data are taken from the work of Lee et al. [15] and are adapted to obtain an equivalent left-right parametric HSMM as follows:  the Weibull distribution.It must be noticed that, as explained in Section 2.1, for state  5 , being the absorbing state, the duration parameters  5 have no influence on the data, since, once the state  5 is reached, the system will remain there forever.
Concerning the continuous observation modeling, a bivariate Gaussian distribution has been used with the following parameters [15]: An example of simulated data both for the continuous and the discrete cases is shown in Figure 2, where a Gaussian duration model has been used.

Training and Model Selection.
The goal of this experimental phase is to test the effectiveness of the AIC in solving the automatic model selection.For this purpose, the training sets of the 6 data sets (continuous/discrte observation with Gaussian, Gamma, and Weibull duration models) have been taken individually and, for each one of them, a series of learning procedure has been run, each one with a variable HSMM structure.In particular we took into account all the combinations of the duration distribution families (Gaussian, Gamma, and Weibull), an increasing number of states, , from 2 to 8 and, for the continuous observation cases, an increasing number of Gaussian mixtures, , in the observation distribution from 1 to 4.
As accurate parameter initialization is crucial for obtaining a good model fitting [14], a series of 40 learning procedures corresponding to 40 random initializations of the initial parameters  0 have been executed for each of the considered HSMM structures.For each model structure, the AIC value, as defined in (46), has been evaluated.The final trained set of parameters  * corresponding to the minimum AIC value has been retained, resulting in 7 HSMMs with a number of states from 2 to 8.
The obtained results are shown in Figure 3, for both, the continuous and discrete observation data.As it can be noticed, for all the 6 test cases of Figure 3 the AIC values do not improve much for a number of states higher than 5, meaning that adding more states does not add considerable information to the HSMM modeling power.Hence 5 states can be considered as an optimal number of states.Moreover, as shown in the zoomed sections of Figure 3, for the HSMMs with 5 states, the minimum AIC values are obtained for the duration distributions corresponding to the ones used to generate the data.As a consequence AIC can be considered as an effective approach to perform model selection for HSMM, as well as selecting the appropriate parametric distribution family for the state duration modeling.

Condition Monitoring.
The optimal parameters  * obtained in the previous phase have been used to perform the online condition monitoring experiment on the 10 testing cases for all the 6 considered HSMM configurations.In this experiment, we simulate online monitoring by considering all the testing observations up to the current time, that is, {x 1 x 2 ⋅ ⋅ ⋅ x  }.Each time a new data point is acquired, the Viterbi algorithm is used to estimate the current state  *  = arg max 1≤≤ [  ()] as specified in (34).An example of execution of the condition monitoring experiment is shown in Figure 4, for both, continuous and discrete observations, respectively.In Figure 4(a) the state duration has been modeled with a Gamma distribution, while in Figure 4(b) with a Gaussian distribution.In Figures 4(a) and 4(b), the first display represents the true state of the HSMM and the second display represents the estimated state from the Viterbi algorithm, while the third display represents the observed time series.
Knowing the true state sequence we calculated the accuracy, defined as the percentage of correctly estimated states over the total length of the state sequence, for each of  the testing cases.The results are summarized in Table 1(a) for the continuous observations and in Table 1(b) for the discrete observations.The high percentage of correct classified states shows that HSMMs can be effectively used to solve condition monitoring problems for applications in which the system shows a strong time and state duration dependency.

Remaining Useful Lifetime Estimation.
In this experimental phase we considered the state  5 as the failure state and the trained parameters  * of Section 5.1.2for each HSMM configuration.As the online RUL estimation procedure is intended to be used in real time, we simulated condition monitoring experiment where we progressively consider the observations {x 1 x 2 ⋅ ⋅ ⋅ x  } up to time .When a new observation is acquired, after the current state probability   () is estimated (Equation ( 48)), the calculation of the average, upper, and lower RUL (( 57), (58), and ( 59)) is performed.
Examples of RUL estimation are illustrated in Figure 5.In particular Figure 5(a) represents the case of continuous observations and duration modeled by a Weibull distribution, while Figure 5(b) shows the case of discrete observations and duration modeled by a Gamma distribution.From the figures one can notice that the average, as well as the lower and the upper bound estimations, converges to the real RUL as the failure time approaches.Moreover, as expected, the uncertainty about the estimation decreases with the time, since predictions performed at an early stage are more imprecise.As a consequence, the upper and the lower bound become more narrow as the failure state approaches, and the estimation becomes more precise until it converges to the actual RUL value, with the prediction error tending to zero at the end of the evaluation.
To quantitatively estimate the performance of our methodology for the RUL estimation, we considered at each time  the absolute prediction error (APE) between the real RUL and the predicted value defined as where RUL real () is the (known) value of the RUL at time , while RUL() is RUL predicted by the model.To evaluate the overall performance of our methodology, we considered the average absolute prediction error of the RUL estimation, defined as where  is the length of the testing signal.APE being a prediction error average, values of (64) close to zero correspond to good predictive performances.
The results for each of the 10 testing cases and the different HSMM configurations are reported in Tables 2(a) and 2(b), for the continuous and the discrete observation cases, respectively.As it can be noticed, the online Remaining Useful Lifetime estimation and in general the online prediction of the time to a certain event can be effectively faced with HSMMs, which achieve a reliable estimation power with a small prediction error.
Finally, we tested our RUL estimation methodology using the state duration estimation of ( 16) introduced by Azimi et al. [30][31][32].The results are shown in Tables 3(a) and 3(b), in which, respectively, the prediction errors obtained for continuous and discrete observations are reported.
Comparing Table 2 and Table 3, one can notice that the proposed RUL method outperforms the one of Azimi.This   is mainly due to the proposed average state duration of (20), compared to the one of Azimi given by ( 16).

Real Data.
In this section we apply the proposed HSMMbased approach for RUL estimation to a real case study using bearing monitoring data recorded during experiments carried out on the Pronostia experimental platform and made available for the IEEE Prognostics and Health Management (PHM) 2012 challenge [49].The data correspond to normally degraded bearings, leading to cases which closely correspond to the industrial reality.The choice of testing the proposed methodology on bearings derives from two facts: (i) bearings are the most critical components related to failures of rotating machines [50] and (ii) their monotonically increasing degradation pattern justifies the usage of left-right HSMM models.
The Pronostia platform allows to perform run-to-failure tests under constant or variable operating conditions.The operating conditions are determined by two factors that can be controlled online: (i) the rotation speed and (ii) the radial force load.During each experiment, temperature and vibration monitoring measurements are gathered online, through two type of sensors placed in the bearing  The platform is composed of three main parts: a rotating part, a load profile generation part, and a measurement part, as illustrated in Figure 6.
The rotating part is composed of an asynchronous motor which develops a power equal to 250 W, two shafts, and a gearbox, which allows the motor to reach its rated speed of 2830 rpm.The motor's rotation speed and the direction are set through a human machine interface.
The load profiles part issues a radial force on the external ring of the test bearing through a pneumatic jack connected to a lever arm, which indirectly transmits the load through a clamping ring.The goal of the applied radial force is to accelerate the bearing's degradation process.The measurement part consists of a data acquisition card connected to the monitoring sensors, which provides the user with the measured temperature and vibration data.The vibration measurements are provided in snapshots of 0.1 s collected each 10 seconds at a sampling frequency of 25.6 kHz (2560 samples per each snapshot), while the temperature has been continuously recorded at a sampling frequency of 10 Hz (600 samples collected each minute).
Further details on the Pronostia test rig can be found on the data presentation paper [49] and on the web page of the data challenge (http://www.femto-st.fr/en/Research-departments/AS2M/Research-groups/PHM/IEEE-PHM-2012-Datachallenge.php).Regarding the data provided for the PHM 2012 challenge, 3 different operating conditions were considered: (i) first operating conditions: speed of 1800 rpm and load of 4000 Newton; (ii) second operating conditions: speed of 1650 rpm and load of 4200 Newton; (iii) third operating conditions: speed of 1500 rpm and load of 5000 Newton.
Under the above operating conditions, a total of 17 accelerated life tests were realized on bearings of type NSK 6804 DD, which can operate at a maximum speed of 13000 rpm and a load limit of 4000 N. The tests were stopped when the amplitude of the vibration signal was higher than 20 g; thus this moment was defined as the bearing failure time.An example of bearing before and after the experiment is shown in Figure 7 together with the corresponding vibration signal collected during the whole test.
Table 4 reports how the 17 tested bearings were separated into the three operating conditions.Moreover, the duration of each experiment, being the RUL to be predicted for each bearing, is also given.We performed two sets of experiments by considering, respectively, the bearings relative to the first and the second operating condition (i.e., Bearing1 1, Bearing1 2, . .., Bearing1 7 and Bearing2 1, Bearing2 2, . .., Bearing2 7).
As already mentioned, the available data correspond to normally degraded bearings, meaning that the defects were not initially induced and that each degraded bearing contains almost all the types of defects (balls, rings, and cage), resembling faithfully a common real industrial situation.Moreover, no assumption about the type of failure to be occurred is provided with the data and, since the variability in experiment durations is high (from 1 h to 7 h), performing good estimates of the RUL is a difficult task [49].
In our experiments we considered, as input to our model, the horizontal channel of the accelerometer.We preprocessed the raw signals by extracting two time-domain features, that is, root mean square (RMS) and kurtosis, within windows of the same length as the given snapshots ( = 2560).Let   () be the raw signal of the th window; for each  we To assess the performance of the proposed HSMM, after the model selection procedure, we implemented a leaveone-out cross validation scheme: by considering separately conditions 1 and 2, we performed the online RUL estimation for each of the 7 bearings, using an HSMM trained with the remaining 6 bearing histories.Similarly to the simulated case, we considered the average absolute prediction error, defined in (64), to quantitatively evaluate our method.

Bearings RUL Estimation.
We performed our experiments in two steps: firstly we applied model selection in order to determine an optimal model structure, and secondly, we estimated the RUL of the bearings.The full procedure is detailed in the following.
(A) HSMM Structure.To determine an appropriate HSMM structure for effectively modeling the considered data, we considered several HSMM structures, characterized by (i) the duration distribution family (being Gaussian, Gamma, or Weibull), (ii) an increasing number of states, , from 2 to 6, and (iii) an increasing number of Gaussian mixtures, , in the observation density from 1 to 4. For each model structure, obtained by systematically considering all the combinations of (i) to (iii), we run 120 parameter learnings, corresponding to 120 random initializations,  0 , on the data sets (Bearing1 1,  (B) RUL Estimation.Using the above obtained optimal HSMM structure, we trained it via a leave-one-out cross validation scheme by using for condition 1, at each iteration, Bearing1 i, 1 ≤  ≤ 7, as the testing bearing, while the remaining six bearings were used for training.Once the trained parameters  *  were estimated for the th testing bearing, we progressively collected the observations of the tested Bearing1 i to calculate, at each time , the average, lower, and upper RUL, as specified in (57), (58), and (59), respectively, considering the state  4 as the failure state.The same procedure has been performed for the bearings in condition 2.
Examples of RUL estimation for Bearing1 7 and Bear-ing2 6 are shown in Figures 10(a) and 10(b), respectively, where the black solid line represents the real RUL which goes to zero as the time goes on.As it can be seen, the average as  well as the lower and the upper bound estimations converge to the real RUL as the real failure time approaches and the uncertainty about the estimation decreases with time.
Concerning the quantitative estimation of the predictive performances, we report in Table 5 the average absolute prediction error of the RUL estimation (see Equation (64)), expressed in seconds.As it can be noticed, the average absolute prediction error of the average RUL is, respectively, 1 hour and 15 minutes for condition 1, and 1 hour and 5 minutes for condition 2, which are good values, considering the high variability in the training set durations and the fact that the performance metric takes into account also the less accurate predictions performed at an early stage of the bearings life.Moreover, for both conditions, the average prediction errors of 5 tests out of 7 are below the average, while the best average error of the mean RUL is only 23 minutes for condition 1 while it further decreases to 14 minutes for condition 2.

Conclusion and Future Work
In this paper, we introduced an approach based on Hidden Semi-Markov Models (HSMM) and Akaike Information Criteria (AIC) to perform (i) automatic model selection, (ii) online condition monitoring, and (iii) online time to event estimation.
The proposed HSMM models the state duration distribution with a parametric density, allowing a less computationally expensive learning procedure due to the few required parameters to estimate.Together with the provided general model specification, the modified learning, inference, and prediction algorithms allow the usage of any parametric distribution to model the state duration, as well as continuous or discrete observations.As a consequence, a wide class of different applications can be modeled with the proposed methodology.This paper highlights, through experiments performed on simulated data, that the proposed approach is effective in (i) automatically selecting the correct configuration of the HSMM in terms of number of states and correct duration distribution family, (ii) performing online state estimation, and (iii) correctly predict the time to a determinate event, identified as the entrance of the model in a target state.As a consequence, the proposed parametric HSMM combined with AIC can be used in practice for condition monitoring and Remaining Useful Lifetime applications.
As the targeted application of the proposed methodology is failure prognosis in industrial machines, combining the proposed HSMM model with online learning procedure, capable of adapting the model parameter to new conditions, would be considered in a future extension.

( 1 )
Given the observation x and a model , calculate the probability that the sequence x has been generated by the model , that is, P(x | ).(2) Given the observation x and a model , calculate the state sequence  =  1  2 ⋅ ⋅ ⋅   which have most probably generated the sequence x.(3) Given the observation x find the parameters of the model  which maximize P(x | ).

Figure 2 :
Figure 2: The data generated with the parameter described in Section 5.1.1,both for the continuous and the discrete case.
discrete case,  = 7 distinct observation symbols have been taken into consideration with the following observation probability distribution AIC values for discrete data and Gamma duration distribution AIC values for discrete data and Weibull duration distribution

Figure 3 :
Figure 3: Akaike Information Criterion (AIC) applied to continuous and discrete observations data.AIC is effective for automatic model selection, since its minimum value provides the same number of states and duration model used to generate the data.

Figure 4 :
Figure 4: Condition monitoring using the Viterbi path.HSMMs can be effective to solve condition monitoring problems in time-dependent applications due to their high accuracy in hidden state recognition.
Remaining Useful Lifetime estimation for continuous data and Weibull duration distribution Remaining Useful Lifetime estimation for discrete data and Gamma duration distribution

Figure 5 :
Figure 5: HSMMs effectively solve RUL estimation problems.The prediction converges to the actual RUL value and its uncertainty decreases as the real failure time approaches.
(a) APE of the RUL estimation for the continuous observation test cases Test case Duration distribution Gaussian Gamma Weibull APE APE APE APE APE APE APE APE APE avg up low avg up low avg up low APE of the RUL estimation for the discrete observation test cases Test case Duration distribution Gaussian Gamma Weibull APE APE APE APE APE APE APE APE APE avg up low avg up low avg up low temperature probe and two accelerometers (one on the vertical and one on the horizontal axis).

Figure 7 :Figure 8 :
Figure 7: A tested bearing before and after the experiment with its recorded vibration signal [49].

2 Figure 9 :
Figure 9: In both cases the minimum AIC value corresponds to an HSMM with  = 4 states, a Weibull duration model, and  = 1 mixture in the observation density.

Figure 10 :
Figure 10: By obtaining a low average absolute prediction error, the proposed parametric HSMM is effective for estimating the Remaining Useful Lifetime of bearings.
)2.2.3.The Training Algorithm.The training algorithm consists of estimating the model parameters from the observation data.As discussed in Section 2.1, a parametric HSMM is defined by  = (A 0 , Θ, , , , ) if the observations are continuous, or  = (A 0 , Θ, , ) if the observations are discrete.Given a generic observation sequence x = x 1 x 2 ⋅ ⋅ ⋅ x  , referred to as training set in the following, the training procedure consists of finding the model parameter set *which locally maximizes the model likelihood P(x | ).
1 =1   () is the expected number of transitions from state   , and ∑ −1 =1   (, ) is the expected number of transitions from state   to state   .Equation (39) represents the expected number of times that the model starts in state   , while (40) represents the expected number of transitions from state   to state   with  ̸ =  over the total expected number of transitions from state   to any other state different from   .

Table 2 :
(20)age absolute prediction error (APE) of the RUL estimation using the proposed state duration estimator of(20).

Table 4 :
[49]time duration (in seconds) and operating conditions of the bearings tested in the IEEE PHM 2012 Prognostic Challenge dataset[49].

Table 5 :
Average absolute prediction error (APE) of the RUL estimation, expressed in seconds.