Risk-Based Predictive Maintenance for Safety-Critical Systems by Using Probabilistic Inference

Risk-based maintenance (RBM) aims to improve maintenance planning and decision making by reducing the probability and consequences of failure of equipment. A new predictive maintenance strategy that integrates dynamic evolution model and risk assessment is proposed which can be used to calculate the optimal maintenance time with minimal cost and safety constraints.The dynamic evolutionmodel provides qualified risks by using probabilistic inference with bucket elimination and gives the prospective degradation trend of a complex system. Based on the degradation trend, an optimal maintenance time can be determined by minimizing the expected maintenance cost per time unit. The effectiveness of the proposed method is validated and demonstrated by a collision accident of high-speed trains with obstacles in the presence of safety and cost constrains.


Introduction
Safety-critical systems, such as chemical factory, nuclear plant, and train control systems, are those where failures could result in loss of life, significant property damage, or damage to the environment.The loss caused by safety critical system failures is now becoming difficult to estimate.The efficient maintenance strategies are playing more important roles in preventing such system failures.
Over the last decade, reactive (fixing or replacing equipment after it fails) or blindly proactive strategies (also known as preventive strategies) have been used for system maintenance.The main disadvantage of both approaches is that they are extremely wasteful.As condition-based maintenance (CBM) systems have been implemented in a way to continuously output data that is calculated against the status and performance of the equipment, the decision making in CBM focuses on predictive maintenance (PdM) which promises to reduce downtime, spare inventory, maintenance cost, and safety hazards.Much work has been carried out in the area of predictive maintenance in order to improve safety.Generally speaking, current prognostic approaches can be classified into three categories,namely, model-based, data-driven, and hybrid prognostics [1].For example, in [2], a DBN-HAZOP model was proposed to deduce the opportunistic predictive maintenance for complex multicomponent systems.The key idea behind the model is reliability-based maintenance.Krishnasamy et al. [3] proposed the risk-based maintenance (RBM) methodology, and a case study of a powergenerating unit was used to illustrate the methodology.Arunraj and Maiti [4] identified the risk analysis and riskbased maintenance methodologies and classified them into suitable classes.
With the aforementioned research contributed to the efficient maintenance strategies of systems, to the best of our knowledge, however, the integration of dynamic system failure scenario into a risk-based maintenance model and the adoption of efficient inference methods for optimal maintenance strategy have received little attention.For the integration of dynamic system failure scenario with riskbased maintenance model, reference [2] proposed a component dynamic failure for reliability-based maintenance model.In the model, the key focus is on reliability rather than risk.Regarding the inference approaches to manipulating the maintenance model, junction tree algorithms [5,6] are quite common and widely used.However, junction algorithms are comparatively complex, which demands long digressions on graph theoretic concepts.Although there has been effort to explain junction tree algorithms without resorting to graphical concepts [7], the effort has not produced a variable elimination-like scheme for inference.
In order to tackle these problems, we propose a 2-TBN (two-slice temporal Bayes net) and risk-based maintenance model.By encoding the failure scenario into the conditional probability table (CPT) of risk based maintenance model, the risk of failure scenario is embedded.In order to facilitate efficient inference, an ad hoc bucket-eliminationbased probabilistic inference is presented.Comparing with the complex junction-tree based inference, an attractive property of bucket elimination approaches is that it is relatively easy to understand and implement.Finally, by utilizing the optimal theory, the optimal maintenance time interval with minimal cost and risk constraints can be obtained.
The rest of the paper is organized as follows.In Section 2, the principle of RBM and the proposed RBM methodology are introduced.In Section 3, a maintenance model for degradation and risk prediction is presented.Section 4 gives optimal predictive maintenance strategies.A case study of a collision between a high speed train and an obstacle is discussed in Section 5. Section 6 draws the conclusion of the paper.

Risk-Based Maintenance (RBM) Methodology
Risk-based maintenance methodology provides a tool for maintenance planning and decision making to reduce the probability and consequences of failure of equipment.The resulting maintenance program minimizes the risk of the system and the maintenance cost.Figure 1 shows a general follow diagram of RBM.It consists of the following steps: (1) identification of components, subsystems, system, and their relationships: the system is divided into subsystems, and the components of each subsystem and their relationships are identified; in the following sections, we model the system structure by using a special case of dynamic Bayesian network, the 2-TBN; (2) Collecting failure data, failure model and failure rate: the information is encoded in the CPT in 2-TBNg based maintenance model.(3)Risk assessment and evaluation: by using probabilistic inference with bucket elimination, a consequence analysis is implemented to quantify the effect of the occurrence of each failure scenario and obtain quantitative measure for its associated risks.The risk is used to study maintenance costs including the costs incurred as a result of failure.(4) Optimal maintenance strategy: by defining different maintenance costs, the optimal maintenance scheme can be derived by applying the optimization theory to the risk quantitative measure computed in the aforementioned step.

Maintenance Model for Degradation and Risk Prediction
This section illustrates the first two steps of the RBM architecture discussed in Section 2 above.The main purpose is to encode the states, the dependency relations among components in each subsystem, subsystems, and the system.In order to facilitate the understanding of the optimization of predictive maintenance, we first introduce some basic notions including dynamic Bayesian network (DBN) and 2-TBN Model.We then prescribe the maintenance model based on 2-TBN and discuss the opportunistic predictive maintenance strategies.

Dynamic Bayesian Network and 2-TBN Model.
A Bayesian network (BN) is a directed acyclic graph (DAG), which is a probability-based knowledge representation method and appropriate for the modeling of causal processes with uncertainty.The formal notion is defined as follows.
Definition 1 (see [8]).A Bayesian network (BN) is a triple (, , ), where  is a set of variables,  is a connected directed acyclic graph (DAG), and there is a one-to-one correspondence between nodes in  and variables in . is a set of probability distribution: where (V) denotes the set of parents of V in .
The statistic Bayesian network can be extended to a dynamic Bayesian network (DBN) by introducing relevant temporal dependences that capture the dynamic behaviors of the domain variables at different times of a static network.Definition 2 gives the formal definition of DBN.Definition 2. A dynamic Bayesian network (DBN) is a quadruplet  = (⋃ =0   , ⋃ =0   , ⋃ =0  →  , ⋃ =0   ), and each   is a set of nodes labeled by variables, which represents the dynamic domain at time instant  (0 ≤  < ) .Collectively, ⋃  =0   represents the dynamic domain over  instants.Each   is a set of arcs among nodes in   , which represents dependencies among domain variables at time .Each  →  is a set of temporal arcs each of which is directed from a node in  −1 to a node in   (0 <  < ).  is set of probability distributions, which can be referred to [8].
In this paper, we only consider a special class of DBNs, which is called 2-slice temporal Bayesian network (2-TBN) [9].A 2-TBN is a DBN which satisfies the Markov property of order 1; that is, the future is independent of its past given its present.

2-TBN Based Maintenance
Model.2-TBNs are general tools allowing the modeling of dynamic complex systems.Besides, it is important to note that using 2-TBNs to represent a variable depending on its own past is equivalent to the use of Markov chain to describe its local transition model.Consequently, we propose a 2-TBN based maintenance model capable of representing dynamic degradation and risk level of subsystems.We treat system state, system failure, and accidents as random variables and model dependencies  [3]).among them by exploiting the use of conditional probability tables (CPT).In order to simplify calculation, all of variables in our model are assumed to be discrete.1.In other words, any failure in component  1 and/or  1 will lead to the failure of subsystem  1 .
Similarly, under the assumption that the failure rate of a component follows an exponential distribution where all these transition rates are constant, the transition relations between consecutive nodes for the different components maintenance model are obtained as follows (the failure rate is denoted by   , the time interval between two successive trials is denoted by Δ, and the components are assumed to be new on the initial trial  = 0): So the conditional probabilities for state transitions can be obtained directly from the above equation.For example, Pr( +1  =  |    = 0) can be obtained, where  denotes normal or failure state: The corresponding temporal CPT for component    (1 ≤  ≤ , 1 ≤ ) is obtained as shown in Table 2.
Finally, the consequence resulting from different subsystem failures (i.e.,    (1 ≤  ≤ , 1 ≤ )) can be classified as shown in Table 3.The specific consequence can be determined by different failure remain so manually.The risk can be computed by integration of consequence and probability resulting from different failure scenarios.Please note that the probability of subsystem failure scenario can be calculated by using probability inference from the 2-TBN based maintenance model.The detailed procedures are discussed in the following sections.

Optimal Predictive Maintenance Strategies
This section discusses the calculation of optimal predictive maintenance strategies which consists of the calculation of the failure and accident probability of the maintenance model and optimal maintenance time under the repairing cost constraints.

Calculation of the Failure Probability of a Component in Maintenance
Model.The purpose of this subsection is to evaluate the probability of any failure scenario for a time length of .In other words, the underlying problem boils down to the calculation of the following probability: The following theorem gives a recursive characterization of Ψ  based on the derivation of the bucket elimination method presented in [10].
Then, the computation of Ψ  can be simplified as follows: So the theorem can be proved now.
Remark 4. The bucket-elimination based inference approach presented in the risk-based maintenance model aims at efficiently computing the failure probability densities of components in a maintenance model which is represented in a dynamic Bayesian network.The construction of bucket tree simplifies the presentation and produces an algorithm that is easy to grasp and implement.The algorithm relies only on independency relations and probability manipulation and does not use graphical concepts such as triangulations and cliques, and it focuses solely on the probability densities and avoids complex digressions on graph theoretic concepts.

Calculation of the Optimization Maintenance Time.
This subsection concerns the optimization of predictive maintenance under the criterion of minimizing its life time operation and repair costs.Similar to [2], two types of costs need to be considered: (1) the cost of repairing component degradation of failure which is termed as "repairing cost" and (2) production losses caused by the shutdown of the system to undertake repairs which is related to the time lost in these tasks.There are two kinds of repairing costs: corrective repairing cost needs to be charged when component failure occurs before proactive schedule time, and proactive repairing cost is charged when component is under repair or replacement at certain proactive scheduled time without failure.
For the th component, the specific corrective and proactive repair costs are denoted as RC   and RC   , respectively.We consider the latter less than the former because the former contains production loss, personal injures, and environment contamination.For the th component, the expected total cost per unit time of predictive maintenance is given by where  is the time for a proactive repair of component  and   () is its failure probability distribution.It represents the cumulative distribution function of random variable   "time to failure, " which is the output of the 2-TBN based maintenance model.If the system contains  components, that is,  = {1 , 2, . . ., }, the expected group repair cost rates are given as follows: Unlike [2], the associated production loss depends on different failure scenarios with difference severity.So the production loss rate is given as where AccTypeNum denotes all the kinds of failure types, N the component number,   the loss due to the accident with type ,and  and  are the failure and normal component indices in accident   .The expected total cost per unit time of predictive maintenance for the system is given by The optimal predictive maintenance time is boiled down the optimal problem and can be solved by many numerical optimal tools such as Matlab: where 2TBNMM denotes the 2-TBN based maintenance model.

Case Study
In this section, an accident for a high speed train with an obstacle located on the rail segment is considered to demonstrate the feasibility and effectiveness of the proposed approach.Figure 3 shows the configuration of the accident which consists of signal, track circuit, computer interlocking system, and train control system.Signals are placed between track segments and show different aspects.These aspects inform the train driver to go or stop safely; track circuit is monitored by electrical equipment to detect the presence of a train.It can also be used to send allowable train velocity code to assure the train moving safely.Computer interlocking system (CI) is used to give the right route for a train to enter the station.If a route is successfully established, CI will inform the signal to display green aspect.Otherwise, the red aspect will be displayed.Train control system receives the allowable train velocity code from the track circuit and the signal aspect and then determines whether the train accelerates or decelerates by applying a braking system.The event tree analysis for train collision is shown in Figure 4. Three barriers,namely, Signal, decelerate code by track circuit, and Brake systems, have been established to decrease the risk caused by the train collision.Each of the barriers has two possible states: ok or fail.As a result of the analysis, eight collision accidents/consequences are distinguished.For example, when an external obstacle occupies the track, and the monitor system can successfully detect the presence of the obstacle and send the information to CI via track circuit (TC), the CI will then inform the signal to display red aspect (i.e., signal is ok).At the same time the TC sends deceleration code to train (i.e., TC is ok) and the braking system is normal (i.e., brake is ok); then the collision will be prevented and the consequence is "safe".On the other hand, when an external obstacle occupies the track and the monitor system, TC, CI, signal, and brake system all fail then the collision will be inevitable and the resulting consequence is catastrophic (d7).Given the failure rate of the different components, the "equivalent risk" for each accident is estimated by the numerical results derived from the probability inference discussed in Section 4 above.Figure 5 illustrated the maintenance model of obstacle collision with high-speed train.The model consists of three subsystems: track circuit (TC), Signal, and brake system (BrakSys).The reliability of subsystem depends on its constituted components.For example, the reliability of TC subsystem depends on code sending module (InfSend) and code receiving module (InfRev), signal subsystem on monitor system (Monitor) and CI, and brake subsystem on automatic train protection (ATP), and brake equipment (Brake).The failure rate of components, corrective and proactive cost, and the product loss of different accident levels are given in Tables 4, 5, and 6, respectively.
The reliability probability distribution of component Signal, track circuit (TC), and Brake system (BrakSys) is shown in Figure 6.The total mission time is assumed to be 31 time units (i.e., month).Thirty-one months are sufficient for this purpose because, for predictive maintenance, it is inaccurate and meaningless to predict future deterioration for complex industrial system due to operational regulation, environmental changes, and human activity.The result of the mean values of expected repair cost rate of Signal, TC, and Brake component is shown in Figure 7. Figure 8    mean values of total repair cost rate, total production loss rate, and the total cost rate.The latter is the sum of total repair cost rate and total production loss rate.From Figure 8, it can be seen that the optimal maintenance time is 10 time units.From (11), the corresponding reliability of for Signal, TC, and Brake component is 0.98711, 0.99856, and 0.94971, respectively.

Conclusions
The paper presents a methodology for the optimization of maintenance strategies.This approach ensures that not only the safety of equipment is increased but also that the cost of maintenance including the cost of failure is reduced.The work reported contribute to the "availability" of the safety critical systems.In order to calculate the failure probability and consequence of each failure scenario, a maintenance model based on 2-TBN has been created.An ad hoc inference procedure along with its proof of correctness is provided to efficiently compute the probability of component failure rates.The consequence of different failure scenarios is coded in conditional probability table (CPT) as part of the associated maintenance model.In the approach proposed in the paper, only the system's optimal maintenance time was considered.However, the study can be extended so that each component's optimal maintenance time can be calculated in the same way.

Figure 2 : 2 -
Figure 2: 2-TBN based maintenance model: (a) gives the initial state of the maintenance model while (b) depicts its transition model between time slice  − 1 and .

The 2 -
TBN based maintenance model is depicted in Figure 2. The model consists of the following variables:    (1 ≤  ≤ , 1 ≤ ) the state of component  (e.g., failure or ok) in the th subsystem at time instant  (for the sake of simplicity, only two components  and  are shown in the figure);    (1 ≤  ≤ , 1 ≤ ) denotes the states of th subsystem (e.g., failure or ok) at time instant ;   (1 ≤ ) denotes the accident probability of the system due to the subsystem failure.RC  (1 ≤ ) represents the corresponding

Figure 3 :Figure 4 :
Figure 3: Configuration of collision of a high speed train with an obstacle.

Figure 5 :
Figure 5: Maintenance model for high-speed train.

Figure 6 :
Figure 6: Reliability probability distribution of Signal, TC, and Brake component.

Figure 7 :
Figure 7: Mean values of expected repair cost rate of Signal, TC, and Brake component.

Figure 8 :
Figure 8: Mean values of total repair cost rate, total production loss rate, and the total cost rate.

Table 2 :
Temporal CPT for component  1 .TBN based maintenance model, the conditional probabilities must be specified for (1) the state transition of components between different time slice and (2) the dependency of components output on the subsystem, system, and accident state.For example, assume the state of component (1 ≤  ≤ , 1 ≤ ) has only two values: ok or fail, then its dependencies among    (1 ≤  ≤ , 1 ≤ ) and    ,    (1 ≤  ≤ , 1 ≤ ) can be illustrated by the CPT as shown in Table

Table 3 :
Consequence resulted from different subsystem failure scenario.

Table 4 :
Failure rate of components.

Table 5 :
Corrective and proactive cost.

Table 6 :
Product loss of different accident levels.
illustrates the