New Computing Technology in Reliability Engineering

. Reliabilityengineeringisrelativelynewscientificdisciplinewhichdevelopsincloseconnectionwithcomputers.Rapiddevelopment ofcomputertechnologyrecentlyrequiresadequatenoveltiesofsourcecodesandappropriatesoftware.Newparallelcomputing technologybasedonHPC(highperformancecomputing)foravailabilitycalculationwillbedemonstratedinthispaper.The technologyisparticularlyeffectiveincontextwithsimulationmethods;nevertheless,analyticalmethodsaretakenintoaccount aswell.Ingeneral,basicalgorithmsforreliabilitycalculationsmustbeappropriatelymodifiedandimprovedtoachievebetter computationefficiency.Parallelprocessingisexecutedbytwoways,firstlybytheuseoftheMATLABfunctionparforandsecondly bytheuseoftheCUDAtechnology.Thecomputationefficiencywassignificantlyimprovedwhichisclearlydemonstratedin numericalexperimentsperformedonselectedtestingexamplesaswellasonindustrialexample.Scalabilitygraphsareusedto demonstratereductionofcomputationtimecausedbyparallelcomputing.


Introduction
The concept of reliability has become recently a pervasive attribute worth of both qualitative and quantitative connotations.Quantitative treatment of the reliability of engineering systems and plants has led to the rise of reliability engineering as a scientific discipline [1].Reliability is a fundamental attribute for the safe operation of any modern technological system.
System is a set of components that work together and form a functional unit.Real engineering systems include technical machines, production lines, computer networks, and other devices.In practice, it is often necessary to model, compute, and optimize reliability characteristics of such systems.
In engineering applications which involve availability modeling, we frequently face the following subproblem: the logical structure of the system as well as availability of each component at a given time  ∈ R is known and using this knowledge availability of the whole system at the time  ∈ R must be computed.In some cases, the time evolution of the availability is required.This problem was solved in the past by many different algorithms that can be roughly divided into analytical and simulation ones.In general, the simulation approach is employed when analytical techniques have failed to provide a satisfactory mathematical model.The principle behind the simulation approach is relatively simple and easy to apply.However, the common simulation techniques are slow and take a lot of time to provide accurate results.Nevertheless, this technique is the only practical method of carrying out reliability or risk studies, particularly when system is maintained and arbitrary failure and repair distributions are used or some special repair or maintenance strategy is prescribed.
The Monte Carlo (MC) method allows complex systems to be modeled without the need to make unrealistic simplifying assumptions, as is inevitably done when using analytical methods.With the increasing availability of fast computers, MC methods become more and more powerful and feasible [2].Recent reliability analyses of complex systems based on the MC method bring very efficient estimators [3].Finding of an appropriate estimator must be necessarily connected with a variance-reduction technique.There exist efficient techniques that provide significant reduction in variance, when they are correctly applied [4].The application of these techniques gives other potentials for optimization of the simulation algorithms solving complex real reliability problems.The feasibility of application to realistic cases stems from the possibility of evaluating the model in reasonable computing times, for example, by biasing techniques [5].
Another problem connected with the application of the MC method is slow convergence.To mitigate this issue, several techniques have been developed such as those based on the importance sampling and other classical methods [6].Another possibility of improving the convergence is provided by the so-called conditioning methods which directly modify the modeling or the simulation procedure, as, for example, in the Petri net approach [7,8].In spite of all these advanced techniques, in most practical cases whenever reliability characteristics must be determined exactly and efficiently, the analyst faces computing problems, that is, slow convergence and large simulation time.That is because systems are more and more complex and demands on reliability are continually growing.For example, NEC producer [9] has offered system components for use in supercomputers in order to support the need for an extremely high performance.The improving reliability will be an essential feature of the products of the next generation.
The above mentioned computing problems are solvable by applying parallel processing of computing algorithms.This new computing technology is possible to put into practice just on assumption that an adequate computing infrastructure exists, which can be used for the HPC (high performance computing) calculations.In Technical University of Ostrava such a new supercomputing center is recently developed and built.
This paper is organized as follows.Section 2 brings simple method for system representation based on adjacency matrix.Basic methods (both analytical and simulation) for availability quantification derived from system state evaluation are introduced in Section 3. Section 4 demonstrates basic approach to the parallel computing of system reliability characteristics resulting from the MATLAB parfor loop on the one hand and from CUDA (Compute Unified Device Architecture) technology on the other hand.Both algorithms can be further optimized using bitwise operations.Both analytical and simulation methods are discussed in Section 5, which describes numerical experiments in context with computer efficiency related to the parallel computing technology.Section 6 describes availability quantification applied on industrial example from references.Section 7 brings conclusion.

System and Its Representation Using
Adjacency Matrix  components with just two possible states are considered, the state space has 2  elements [10].
The system state is determined by the states of its components.The function (B):  → {0, 1} defined as is called the system function.
The logical structure of systems is often represented using block diagrams.A block diagram is a directed graph with two special vertices (in Figure 1 marked IN and OUT); the remaining vertices symbolize components of the system.A directed edge connecting two vertices indicates that there is a direct path between corresponding system elements in the indicated direction.Le us say that the system is operational if there exists a path between the IN element and the OUT element passing only through functional components.For the system function, the following holds.
For implementation purposes, it is suitable to represent the block diagram equivalently using the adjacency matrix  of size (n + 2) × (n + 2).The first row and column of the matrix are used for the IN element; the last row and column are used for the OUT element.The remaining rows and columns represent the components of the system.If a directed edge between elements  and  exists, ()  = 1; otherwise, ()  = 0.

Methods of Availability Quantification
As mentioned above, the following problem is solved in this paper."A system with  independent components and an appropriate system function (B) is given.Further availabilities of all components are known.It is necessary to quantify the availability of the system." The state space of the given system includes 2  state vectors.By substituting each state vector into the system function, the state space can be divided into two disjoint subsets  0 = {B ∈  | (B) = 0} and  1 =  \  0 .Let us denote the availability of the th component by   .Further denote Therefore, the probability (B) that the system is in state defined by a specific state vector B = ( 1 ,  2 , . . .,   ) can be calculated as At this point both solving methods can be presented, the analytical one and the simulation one.

Analytical Method.
Availability  is the probability that the system is operational at a considered time, that is, that it is in a state belonging to  1 in the time.This probability can be calculated as the sum of probabilities (B  ) that the system is in state B  ∈  1 .Therefore, the availability  can be expressed as Using this formula, the algorithm for the analytical computation of system availability can be easily constructed.Notice that each state vector of the system is evaluated, so it can be expected that the computation time increases exponentially with respect to the number of components.

Simulation Method.
One of the ways to avoid evaluating the state of the system for each state vector is to solve the problem using the Monte Carlo method.The problem is simulated using appropriate random variable .Here the following random variable is chosen: It should be verified that the mean value of  equals the availability .For the mean value holds true the following formula: That is, the random variable  is selected properly.The experiment consists of generating a random state vector and determining the value of   .The vectors are generated according to following rule: the probability that a specific state vector B  is generated equals the probability (B  ).After  repetitions of this experiment a random sample  1 ,  2 , . . .,   is obtained.The sample mean  of this random sample estimates the mean value of the variable  and therefore the availability .

System State Evaluation.
The most important part of both approaches from computation point of view is determination of the system state (B).It is an algorithm which for a system given by an adjacency matrix and for a given state vector B determines the state of the system, that is, if the system is operational.The algorithm consists of two primary steps: (i) removal of failed components, (ii) deciding if the system is operational, and is defined as follows.
(1) Remove rows and columns that belong to failed components from the adjacency matrix.( 2 Note.In fact the word "move" in the steps above means relocation of components, that is, deletion from original set and saving in a new set.If some components already exist in the target set within the process, they remain there. Considering the estimation of the computation time, notice that steps (3) to ( 6) are executed at most  times, where  is the number of operational components of the system,  ≤ .This means that the state of the system is calculated in  iterations at most.

Parallelization
Both analytical and simulation algorithms described in the previous section were implemented in the MATLAB language, at first as a sequential application and then both algorithms were parallelized.The principle of the parallelization consists in dividing the task into multiple simultaneous threads, each of them executing a different part of the calculation.The aim of the process is to optimize the algorithms on the one hand and reduce their computing time on the other hand.
The Monte Carlo method is suitable for parallelization, because the experiments are independent and may be realized in different threads.Assume that  ⋅  experiments should be executed in  threads.This means that in the th thread  random state vectors are generated and for each of them the system state is evaluated.The output of the thread is the total number of generated vectors for which the system was operational, denoted by   .The estimated availability can be easily obtained from the intermediate results as follows: Similarly, in case of the analytical algorithm, the state space can be divided into several parts and the state vectors are evaluated in different threads.

The Parfor Loop.
MATLAB programming tool allows a simple parallelization of loops with fixed number of recurrences using the "parfor" keyword.In case of the Monte Carlo method, the loop over all the sample size is parallelized and in case of the analytical algorithm it is the loop over the state space.Number of threads in which the calculation is executed equals the number of used processor cores at most.

4.2.
The CUDA Technology.These algorithms can be efficiently parallelized using CUDA (Compute Unified Device Architecture).This technology allows using the NVIDIA graphics cards for parallel computing.The threads are organized in blocks and the blocks are organized in a grid (CUDA C Programming Guide 2013).In case the graphics card is used in this simulation, there can be at most 1024 threads per block and 65535 × 65535 blocks per grid.Furthermore, the maximum amount of available memory cannot be exceeded.
In each thread the so-called "CUDA kernel" is executed.This special function is written in a file with the ".cu" extension; syntax is similar to the C++ language.There is a possibility of working with these files directly within the MATLAB environment.But first creating a ".ptx" file (Parallel Thread Execution) using external compiler (NVIDIA CUDA Compiler) is needed.From these two files the kernel object can be created in MATLAB.
MATLAB also allows allocation of arrays on the graphics card.These arrays are used as input and output variables of the kernels.For the realization of random samples in MC experiments (generation of random state vectors), an array of length  ⋅  of random numbers from (0, 1) is preallocated on the graphics card ( is the number of components and  is the number of experiments).To generate the th state vector,  numbers from this array are used (beginning with the position  ⋅  when indexing from 0).

Bitwise Operation.
A further optimization of both algorithms can be achieved using bitwise operations.Their proper use results in a reduction of computing time as well as in lower memory requirements, which is especially important when using the CUDA technology.
The state indicator of a component is always zero or one.To save the state indicator, it is thus adequate to use one bit.In this context, simulation variables of data type "unsigned int" are used having the length of 32 bits.The binary code of such variable consists of 32 digits (0/1), and each of them represents the state of one component.
In the following numerical experiments the bitwise operations are used in the CUDA kernel code.Their use in the pure MATLAB implementation is not suitable, because the bitwise operations are extremely time-consuming.

Testing
For testing purposes the following well-scalable problem is used."A system of  component with a serial structure is given.Availability of each component equals 0.999.It is necessary to quantify the availability of the system."  The system with serial structure is represented by the adjacency matrix given by The problem was solved for various values of  using both analytical and simulation methods.All of the implementations were tested under equal conditions on two processors Intel Sandy Bridge E5-2665, 8 cores each.The CUDA parallelization was tested on the graphics card NVIDIA GT 555 M. The results of tests are shown in Tables 1-6 and Figures 2 and  3.
The analytical algorithm was tested on small systems with 24 components at most.Tables 1, 2, and 3 show that the analytical method is unsuitable for large systems, because (as expected) the computing time increases approximately exponentially with respect to .However, for small systems, its use is appropriate since exact results are obtained.
The parallel scalability graph in Figure 2 shows the computation time decreasing when the number of threads grows from 2 to 16 threads (note the logarithmic scale of both axes).In case of the simulation algorithm besides the computation time also the estimated availability  and PRSD (percentage relative standard deviation) were recorded, which is demonstrated in Tables 4, 5, and 6.Random samples of size 10 5 were used.The number of system components was significantly increased as well to  = 200.
Also in case of the Monte Carlo method, it has been shown that the computation can be significantly reduced by the use of parallel computing.This fact is illustrated in the scalability graph in Figure 3, for several sizes of the system.The scale of both axes is logarithmic.

Industrial Example
In this section an example of a concrete engineering problem is presented; for its solution the aforementioned methods are used.The task is to simulate the availability of a fire detector system with independent components; for basic motivation see Rausand and Hoyland [11].The block diagram is shown in Figure 4, and according to it, the corresponding adjacency matrix can be constructed.Assume that the system is in its useful life period; that is, the failure rate and the repair rate of each component are constant.Therefore, the availability   () of the th component is given by Dubi [10]: where   is the failure rate and   is the repair rate of the th component, exactly specified in Table 7.
Assume that in time  0 = 0 (h) all components are operational.The availability of the system is simulated in time; thus an appropriate time discretization should be chosen.Here, for example, the constant step of 6 hours is chosen.At this point, the simulation of the availability course in time can be computed.For each time step, availability of each component is first calculated using formula (10); then the availability of the whole system is calculated using one of the methods presented in Sections 3 and 4. It is convenient to plot the results for each time step into a graph; see Figure 5.
Table 8 compares the computation time needed to calculate the availability in 50 time steps (plotted in the graphs in Figures 5 and 6) by different types of the algorithm.

Conclusions
This paper was aimed at two specific computing methods (analytical versus simulation) which can be used in applications related to availability quantification of systems represented by the adjacency matrix.Both methods were parallelized; that is, their corresponding algorithms (see Section 3) were implemented in parallel processing, by two ways: (i) parallelization using the MATLAB parfor loop,  (ii) parallelization using the CUDA technology together with the use of the bitwise operations.
Using these two parallelization techniques, the computation efficiency of both algorithms was successfully enhanced.More significant reduction of the computing time was achieved using the second technique.For instance, the practical example presented in Section 6 was computed over a computing time which is 275 times shorter than the computing time used for the original sequential implementation.
The parallel scalability was illustrated in the first of these two techniques (the parfor loop).The scalability graphs in Figures 3 and 4 confirm that the use of parallel computing is always more efficient than the traditional approach using the sequential implementation.
On testing example we showed that the simulation methodology is particularly convenient when the number of system components increases.For instance, if the number of system components approaches  = 200, this method gives acceptable results, whereas using analytical method is impossible.

Figure 4 :
Figure 4: Block diagram of a fire detection system.

Figure 5 :Figure 6 :
Figure 5: Dependence of the availability on the time: analytical method.
Consider a system consisting of  components,  ∈ N. At a given time  ∈ R each component is in one of two disjoint states, operational or failed.To the th component a state indicator   belongs.If the component is operational,   = 1; otherwise,   = 0.The vector B = ( 1 ,  2 , . . .,   ) is called the state vector of the system.The set of all possible state vectors is called the state space of the system and is denoted by .Since ) Create three empty sets ACTUAL, AVAILABLE, and VISITED.Move system elements that are directly connected to the IN element to the AVAILABLE set.

Table 3 :
Analytical algorithm using CUDA and bitwise operations.

Table 6 :
Simulation algorithm using CUDA and bitwise operations.

Table 7 :
Failure and repair rates.