Reliability Evaluation of Data Communication System Based on Dynamic Fault Tree under Epistemic Uncertainty

Fault tree analysis is a well-structured, precise, and powerful tool for system evaluation. However, the conventional approach has been found to be inadequate to deal with the absence of fault data, failure dependency, and uncertainty problems. This paper presents a comprehensive study on the evaluation of data communication system (DCS) using dynamic fault tree approach based on fuzzy set. It makes use of the advantages of the dynamic fault tree for modelling, fuzzy set theory for handling uncertainty, and Bayesian network (BN) for inference ability. Specifically, it adopts expert elicitation and fuzzy set theory to evaluate the failure rates of the basic events for DCS and uses a dynamic fault tree model to capture the dynamic failure mechanisms. Furthermore, some reliability parameters can be calculated by mapping a dynamic fault tree into an equivalent BN.The results show that the proposed method is more flexible and adaptive than conventional fault tree analysis for fault diagnosis and reliability estimation of DCS.


Introduction
Data communication system (DCS) is a key subsystem of urban rail transit and its reliability has a direct impact on the stability and safety of the train operation system.For fast technology innovation, the performance of key equipment in the DCS of urban mass transit has been greatly improved with the wide application of high technology on one hand, but, on the other hand, its complexity of technology and structure increasing significantly raise challenges in system reliability evaluation and maintenance.These challenges are displayed as follows.(1) Lack of sufficient fault data: fault data integrity has significant influence on the system reliability analysis.However, it is very difficult to obtain mass fault samples which need lots of case studies in practice due to some reasons.One reason is the imprecise knowledge in an early stage of new product design.The other factor is the changes of the environmental conditions which may cause that the historical fault data cannot represent the future failure behaviours.
(2) Failure dependency of components: DCS adopts many redundancy units and fault tolerance techniques to improve its reliability.So, the behaviours of components in the system and their interactions, such as failure priority, sequentially dependent failures, functional dependent failures, and dynamic redundancy management, should be taken into consideration.(3) High levels of uncertainty: DCS is usually operated in a dynamic environment and is greatly affected by the technical, human, and operational malfunctions that may lead to hazardous incidents.
Fault tree analysis (FTA) has been widely used to calculate reliability of complex systems.It is a logical and diagrammatic method for evaluating the possibility of an accident resulting from combinations of failure events.However, the conventional FTA, which is commonly assuming that components of a complex system are described by precise probability distributions describing their reliability characteristics, has been found to be inadequate to deal with these challenges mentioned above.Therefore, fuzzy set theory has been introduced as a useful tool to handle challenges (1) and (3).The fuzzy fault tree analysis model employs fuzzy set and possibility theory and deals with ambiguous, qualitatively incomplete, and inaccurate information.Several researchers successfully used the fuzzy fault tree technique in various areas, including the nuclear safety assessment [1], risk analysis [2,3], and reliability of gas power plant [4].They treated basic events probabilities as fuzzy numbers and applied the fuzzy extension principle to compute the top event probability.However, these approaches use the static fault tree to model the system fault behaviours and cannot cope with challenge (2).Dynamic fault tree analysis has been introduced [5], which takes into account not only the combination of failure events but also the order in which they occur.Meshkat et al. analysed the dependability of systems with on-demand and active failure modes using dynamic fault tree and solved it to get some reliability results by Markov chains (MC) model [6].However, this method has two well-known problems: one is the ineffectiveness in solving large dynamic fault tree; that is, MC-based approach has the infamous state space explosion problem.The other is the ineffectiveness in handling uncertainty of failure data; that is, the failure rates of the system components are considered as crisp values.Hence, Li et al. proposed a fuzzy dynamic fault tree to analyse the fuzzy reliability of the CNC machining centre [7].Nevertheless, the solution for the fuzzy dynamic fault tree is still based on the MC model.In order to solve a larger dynamic fault tree, a discrete-time Bayesian network (DTBN) was proposed for the reliability analysis of dynamic fault tree in [8,9].They converted dynamic logic gates to DTBN and calculated the reliability results by a standard Bayesian network (BN) inference algorithm.However, this is an approximate solution and requires huge memory resources to obtain the joint probability distribution accurately.An innovative algorithm has been introduced to reduce the dimension of conditional probability tables by an order of magnitude.However, this method cannot perform probability updating [10].Montani et al. proposed a translation of the dynamic fault tree into a dynamic Bayesian network (DBN) [11].The DBN model is essentially applicable to Markov processes and the result of the calculation gives the approximated probabilities.
Motivated by the problems mentioned above, this paper presents a reliability evaluation for DCS based on fuzzy set and dynamic fault tree.It pays special attention to meet the above three challenges.We adopt expert elicitation and fuzzy set theory to deal with insufficient fault data and uncertainty problem by treating the failure rates as fuzzy numbers.In addition, we use a dynamic fault tree model to capture the dynamic behaviours of DCS failure mechanisms and calculate some reliability results using BN and algebraic technique in order to avoid the aforementioned problems.
The objective of this paper is to evaluate the reliability of DCS using fuzzy set and dynamic fault tree.This paper is organized as follows.Section 2 provides a brief introduction on DCS and its dynamic fault tree model.Section 3 describes estimation of failure rates for the basic events.Section 4 presents a novel dynamic fault tree solution which uses BN and algebraic technique.The outcomes of the research and future research recommendations are presented in the final section.

Dynamic Fault Tree of DCS
2.1.DCS.DCS is one of the key components of the train control system and is a medium for transmitting data among the modules in the automatic train control system.
It mainly includes ground wire backbone communication networks and train-ground communication networks shown in Figure 1.The ground wire backbone communication networks are mainly used to connect zone controller, computer based interlocking system, automatic train supervision system data storage unit, and so on.As for the ground wire backbone communication networks, we usually adopt bidirectional self-healing loop industrial Ethernet.In particular, when one device fails, the communication networks will not interrupt.The train-ground communication networks have experienced a point-type electromagnetic induction communication, point-type wireless communication, and continuous wireless communication.The wireless communication based train control can not only decrease the ground units but also satisfy the requirements of mass train-ground information transmission and secure communication and thus improve the operational capability of the urban rail transport system.
The train-ground communication networks consist of the train-ground access devices and the train-ground communication transmission system.The train-ground access devices are responsible for information acquisition, information composition, information decomposition, information encoding, information decoding, and information transmission security mechanism.This can guarantee a safe, reliable, and real-time information transmission.Specifically, the train-ground access devices include the following.
(i) Centralized Radio Control Unit (CRCU).CRCU, located in the control center, is primarily responsible for transmitting diagnostic information, passenger travel information, and speech information.
(ii) Decentralized Radio Control Unit (DRCU).DRCU, located in the decentralized control center, offers the interface between the decentralized control system and the traction power supply system.In addition, it also performs the most important task such as information acquisition, composition, decomposition, encoding, and decoding among the decentralized control system, the vehicle control system, localization system, and the traction power supply system.
(iii) Mobile Radio Control Unit (MRCU).MRCU, located on opposite ends of the train, not only offers the interface between the vehicle control system and the localization system, but also implements information processing among the vehicle control system, the localization system, the decentralized control system, and the traction power supply system.

Dynamic
Fault Tree for DCS.DCS of urban mass transit is a complex system and adopts redundancy technique to ensure higher reliability.For example, the hardware redundancy technique is adopted in designing CRCU, DRCU, and MRCU.High coupling degree together with complicated logic relationships exists between these modules.So, the behaviours of components in these modules and their interactions, such as failure priority, sequentially dependent failures, functional-dependent failures, and dynamic redundancy management, should be taken into consideration.Obviously, traditional static fault tree is unsuitable to model these dynamic fault behaviours.So, we use the dynamic fault tree model to capture the dynamic behavior of system failure mechanisms such as sequence-dependent events, spares and dynamic redundancy management, and priorities of failure events.Taking the decentralized traction control failure as the top event, the dynamic fault tree of DCS is established in Figure 2. The failure events and different components of DCS are represented by different symbols which are presented in Table 1.

Estimation of Failure Rates for Braking System
In order to evaluate the reliability of DCS, failure rates of the basic events must be known.However, it is very difficult to estimate a precise failure rate due to lack of insufficient data or vague characteristic of the events, especially for the new equipment.In this study, the expert elicitation through several interviews and questionnaires and fuzzy set theory are used to determine the fault rates of the basic events.

Selecting Experts to Form Evaluation Committee.
Experts are selected from different fields, such as design, installation, maintenance, operation, and management of the braking system, to judge failure rates of the basic events.They are more comfortable justifying event failure likelihood using qualitative natural languages based on their experiences and knowledge about the braking system, which capture uncertainties rather than by expressing judgments in a quantitative manner.The granularity of the set of linguistic values commonly used in engineering system safety is from four to seven terms.In this paper, the component failure rate is defined by seven linguistic values, that is, very high, high, reasonably high, moderate, reasonably low, low, and very low.Table 1: The basic events of DCS.

Node symbol Description X1
Software failure X2 Regional traction power supply 1 X3 Regional traction power supply 2 X4 Regional control system 1 X5 Regional control system 2 X6 Vehicle location system 1 X7 Vehicle  3. To eliminate bias coming from an expert, eleven experts are asked to justify how likely a basic event will fail in the system under investigation.So, it is necessary to combine or aggregate these opinions into a single one.There are many methods to aggregate fuzzy numbers.An appealing approach is the linear opinion pool [12]: where  is the number of basic events;   is the linguistic expression of a basic event  given by expert ;  is the number of the experts;   is a weighting factor of the expert ; and   represents combined fuzzy number of the basic event .Usually, an -cut addition followed by the arithmetic averaging operation is used for aggregating more membership functions of fuzzy numbers.The membership function of the total fuzzy numbers from  experts' opinion can be computed as follows: where   () is the membership function of a trapezoidal fuzzy number from expert  and () is the membership function of the total fuzzy numbers.numbers and cannot be used for fault tree analysis because they are not crisp values.So, fuzzy number must be converted to a crisp score, named as fuzzy possibility score (FPS), which represents the most possibility that an expert believe occurring of a basic event.This step is usually called defuzzification.There are several defuzzification techniques [13]: area defuzzification technique, the left and right fuzzy ranking defuzzification technique, the centroid defuzzification technique, the area between the centroid point and the original point defuzzification technique, and the centroidbased Euclidean distance defuzzification technique.In this paper, an area defuzzification technique is used to map the fuzzy numbers into FPS because it has the lowest relative errors and has the closest match with the real data.If (, , , ; 1) is a trapezoidal fuzzy number, then its area defuzzification technique is as follows:

Calculating Fuzzy
The event fuzzy possibility score is then converted into the corresponding fuzzy failure rate, which is similar to the failure rate.Based on the logarithmic function proposed by Onisawa [14], which utilizes the concept of error possibility and likely fault rate, the fuzzy failure rate can be obtained by (4).Table 2 shows the fuzzy failure rates of the basic events for the braking system:

Mapping Static Fault Tree into BN.
There is a clear correspondence between static fault tree and BN.The fault tree can be seen as a deterministic particular case of the BN.Conceptually, it is straightforward to map a fault tree into a BN: one only needs to "redraw" the nodes and connect them while correctly enumerating reliabilities.Figure 4 shows the conversion of OR and AND gates into equivalent nodes in a BN.Parent nodes  and  are assigned prior probabilities, which coincident with the failure probability of the corresponding basic nodes in the fault tree, and child node  is assigned its conditional probability table (CPT).Since the OR and AND gates represent deterministic causal relationships, all the entries of the corresponding CPT are either 0 or 1.The detailed algorithm of converting a fault tree into a BN was proposed in [15,16].

Fault Probability of a Module with Sequence Dependence.
Let us consider an event sequence composed of  events,  1 ,  2 , . . .,   , including several spare events.An event in the sequence is denoted by    , which means that the event that failed in the th order of the sequence is designated a spare of an event that failed in the th order. 0  denotes an event that was originally in active mode.   ( > 0,  < ) has a dormancy factor 0 ≤   ≤ 1.The sequence probability of ⟨ 1 1 ,  2 2 , . . .,    ⟩ can be calculated using the tuple integration as where   indicates the occurrence time of    ,   () is the probability distribution function of    , and   () is the survival function of    in standby mode.  is a set of events that were originally in active mode and   (  ) is a set of spare events that fail in active (standby) mode [17].
When the failure time of    in active mode follows an exponential distribution with   , the sequence probability is where (7) for   > 0, and  −1 is the inverse Laplace transform operator.
If every   in the above equation is distinct from the other, the sequence probability is where  0 = 0.

Mapping Dynamic Fault Tree into BN.
Dynamic fault tree extends traditional fault tree by defining special gates to capture the components' sequential and functional dependencies.Currently there are six types of dynamic gates defined: the functional dependency gate (FDEP), the cold, hot, and warm spare gates (CSP, HSP, WSP), the priority AND gate (PAND), and the sequence enforcing gate (SEQ).Here, we briefly discuss the FDEP and the WSP gates as they will be later used in our examples.

WSP Gate.
The WSP gate has one primary input and one or more alternate inputs.The primary input is initially powered on and the alternate inputs are in standby mode.When the primary fails, it is replaced by an alternate input, and, in turn, when this alternate input fails, it is replaced by the next available alternate input, and so on.In standby mode, the component failure rate is reduced by a factor  called the dormancy factor. is a number between 0 and 1.A cold spare has a dormancy factor  = 0 and a hot spare has a dormancy factor  = 1.The WSP gate output is true when the primary and all the alternate inputs fail.Figure 5 shows the WSP gate and its equivalent DTBN.Table 3 shows the CPT of node .Suppose that  and  follow the same exponential distribution with .Here,  1 () and  2 () in this table can be derived as (⟨,   ⟩)() and (⟨, ⟩)() are sequence probabilities calculated by (8).Consider

𝑃 (⟨𝑃, 𝐴
The output of node WSP is an AND gate whose CPT is shown in Figure 4.
Table 5: The CPT of node FDEP.
FDEP is used for modelling situations where one component's correct operation is dependent upon the correct operation of some other component.It has a single trigger input, which could be another basic event or the output of another gate, a nondependent output reflecting the status of the trigger, and one or more dependent basic events.Figure 6 shows functional dependency gate and its equivalent BN.Table 4 shows the CPT of node .Here,  3 () in this table can be derived as The CPT of output node FDEP is shown in Table 5.

Reliability Analysis of DCS
5.1.Calculating Reliability.According to the dynamic fault tree shown in Figure 2 and the basic failure data shown in Table 1, we can map the dynamic fault tree into an equivalent BN using the proposed method.Once the structure of a BN is known and all the probability tables are filled, it is straight forward to compute the fault probability of DCS using the inference algorithm.BN has already had some relatively mature accurate and approximate inference algorithms such as the variable elimination algorithm, the searchbased algorithm, the conditioning algorithm, the jointree algorithm, and the differential algorithm.Here, we use the jointree algorithm to calculate the reliability indices of DCS.Table 6 shows the unreliability of DCS at the different mission time using some different methods for the dynamic fault tree solution.As we can see in Table 6, the accuracy of DTBN method increases when  increases.Although the DTBN method ( = 5) is almost in agreement with the proposed method in this paper, the difference becomes larger with the memory of CPT and execution time.

Sensitivity Analysis.
Sensitivity analysis allows the designer to quantify the importance of each of the system's components and the impact the improvement of component reliability will have on the overall system reliability.Here, we show how one can perform sensitivity through the usage of sensitivity index [18].The sensitivity index of the th basic event is defined as where () is the probability of the top event failure; ( | ) is the probability that the top event has occurred given that the basic event  has not occurred.Table 7 shows the sensitivity index of all basic events for DCS.According to Table 7, we know that the MRCU multiplexer board and DRCU multiplexer board have the maximum sensitivity index, which means that they are the key components.So, we should improve their reliability at the stage of product design in order to decrease the failure probability of DCS by several approaches.

Performing Diagnosis.
Diagnosis is an obvious capability of the framework due to the use of BN.We can conveniently calculate some importance parameters by BN and perform diagnosis to locate the system failure.The diagnostic importance factor (DIF) is the corner stone of reliability based diagnosis methodology.DIF is defined conceptually as the probability that an event has occurred given that the top event has also occurred.This quantitative measure allows us to discriminate between components by their importance from a diagnostic point of view.Components with larger DIF are checked first.This assures a reduced number of system checks while fixing the system.Consider DIF  =  ( | ) , (13) where  is a component in system .
Suppose the system has failed, we would like to know what is the most probable cause that took the system down.So, we enter the evidence that the braking system has failed; Figure 1: A system block diagram of DCS.

Figure 2 :
Figure 2: A dynamic fault tree for decentralized traction control failure of DCS.

Figure 3 :
Figure 3: Fuzzy numbers used for representing linguistic value.

Figure 4 :
Figure 4: The equivalent BN of OR and AND gates.

Table 3 :
The CPT of node .

Table 4 :
The CPT of node .