Availability Allocation of Networked Systems Using Markov Model and Heuristics Algorithm

It is a common practice to allocate the system availability goal to reliability and maintainability goals of components in the early design phase. However, the networked system availability is difficult to be allocated due to its complex topology and multiple down states. To solve these problems, a practical availability allocationmethod is proposed. Network reliability algebraicmethods are used to derive the availability expression of the networked topology on the system level, and Markov model is introduced to determine that on the component level. A heuristic algorithm is proposed to obtain the reliability and maintainability allocation values of components. The principles applied in the AGREE reliability allocation method, proposed by the Advisory Group on Reliability of Electronic Equipment, and failure rate-based maintainability allocation method persist in our allocation method. A series system is used to verify the new algorithm, and the result shows that the allocation based on the heuristic algorithm is quite accurate compared to the traditional one. Moreover, our case study of a signaling system number 7 shows that the proposed allocation method is quite efficient for networked systems.


Introduction
Availability is the probability that a system or a component is performing its required function at a given point in time or over a stated period of time when operated and maintained in a prescribed manner [1].If the system or component repair can be viewed as a renewal process, the steady-state availability exists.One type of the steady-state availability, inherent availability, is based solely on the failure distribution and repair-time distribution as a design parameter and is defined as follows where  is the operating time, MTBF is the mean time between failures, and MTTR is the mean time to repair.In the early design phase, the system availability goal is specified and should be allocated to reliability requirements (e.g., failure rate, MTBF) and maintainability requirements (e.g., repair rate, MTTR) of components for further design and verification.The reliability and maintainability allocation results provide meaningful inputs to design (i.e., establishment of the right input design criteria at the proper level) and criterions for verification.
As Messer stated [2], availability allocation is extended from reliability allocation.Bouissou and Brizec summarized more than 20 availability allocation methods and generalized them into two categories: one is optimal availability allocation which aims at finding the minimum cost under availability goal or the maximal system availability under cost constraints, and the other is based on weighing factors which considers the system structure [3].However, these availability allocation methods are only suitable for simple structured systems.In the recent years, researchers made a great effort improving the availability allocation methods.For example, Elegbede and Adjallah applied the genetic algorithms to solve the NP-hard multiobjective optimal availability allocation problem for series-parallel systems [4]; Chiang and Chen proposed a simulated annealing based multiobjective genetic algorithm (saMOGA) to solve the optimal availability allocation problem for series-parallel systems [5]; Barabady and Kumar used the availability importance measure based on MTBF and MTTR to find optimal allocation results with the minimum cost based on genetic algorithm for series, parallel, and series-parallel systems [6]; Juang et al. proposed a genetic algorithm based on a knowledge-based interactive decision support system to improve the availability allocation for series-parallel systems [7]; Liu studied the availability optimization problem for -stage standby system under different resource and design configuration constraints by applying Tabu-genetic algorithm combination method [8]; Xie et al. extended the optimal availability allocation to consider redundancy allocation and spare parts provisioning simultaneously for -out-of-: G systems [9].However, the systems mentioned above are simple structured ones.Mayer considered the availability allocation problem for multipath networks, but the system availability was modeled using series-parallel relationships, while networked structure was not included [10].Nowadays, networked systems are common across natural and man-made world, for example, networked communication systems, networked control systems, and networked power systems.For these systems, the system availability goal cannot be allocated using the methods above due to the networked structure.To the best of our knowledge, the availability allocation is still not well studied for networked systems.
Moreover, there are multiple down states for some complex components of networked systems.As Ali stated in [11], several types of complex failures, for example, detection failure, coverage failure, diagnostic failure, and recovery failure, are common for digital switched systems.The availability of such components cannot be directly expressed by (1).Markov model is widely used in complex system availability analysis.For example, Lazaroiu and Staicut applied the Markov model to derive availability expression for telecommunication switching systems [12]; Lai et al. used the Markov model for hardware/software systems to cover both hardware and software failures [13]; Liu and Trivedi introduced the Markov model to drive availability expression of telecommunications switching systems and combined it to the performance model [14].Further, Hu et al. applied the Markov model to optimal allocation problem for series-parallel systems [15].
In this paper, we study the availability allocation based on weighing factors for networked systems.Traditionally, the system inherent availability goal is broken down to reliability and maintainability goals on the system level, and those system goals are allocated to subsystems or components using reliability allocation method and maintainability allocation method, respectively.However, as mentioned earlier, the traditional availability allocation methods are not so practical for networked systems due to their complex structures and multiple down states.To solve these problems, we propose an availability allocation method based on the Markov model and heuristics algorithm, in which the principles of both AGREE reliability allocation method and failure rate-based maintainability allocation method persist.
The remainder of the paper is organized as follows.Section 2 introduces the availability models for networked systems based on network reliability algebraic method and Markov model.Section 3 proposes our availability allocation method, including goals, assumptions, principles, and procedures.In Section 4, a series system is allocated to verify our heuristic algorithm compared to the traditional one.A case study of a signaling system number 7 (SS7) is presented in Section 5 to validate our availability allocation method on networked systems.Finally, concluding remarks are provided in Section 6.

Availability Models for Networked System
A simple structure of a networked system is illustrated in Figure 1.
Since availability is a probability, the network reliability algebraic methods, for example, inclusion-exclusion method, sum of disjoint method, and factoring method as Shier summarized in [16], can be applied to compute the availability of a networked system, and the system availability can be calculated from knowledge of node and link availability.Furthermore, the availability of links and nodes can be modeled using the reliability block diagrams (RBD) and expressed as a function of the availability of components that make them up.Therefore, the availability of such a networked system is given by where  1 ,  2 , . . .,   are the availability of the  types of components.
According to Ali [11], several fault tolerance techniques are applied to the component design in the networked system, and some complex failures are introduced.For example, (1) detection failure occurs when a component fails to detect failure when it is supposed to; (2) coverage failure occurs when a component fails during a switchover between active and standby model; (3) diagnostic failure occurs when a component's diagnostic cannot correctly identify failed units; and (4) recovery failure occurs when a component's emergency recovery program cannot bring the component back to an operational mode.
For components with such complex failures, their availability cannot be calculated through (1).Markov model is capable of solving this problem.After creating a state transition diagram for the component, its steady-state probability can be solved through the flow rate equations, and the component availability can be obtained by adding all the available states together.Therefore, in addition to reliability and maintainability parameters, there are other variables in the component availability expressions, for example, detection frequency, coverage probability, diagnostic frequency, and recovery rate.The component availability can be expressed as where   and   are the failure rate and repair rate of component , and  1 , . . .,   , represent other  variables in the availability expression of component .

Assumptions.
In this paper, we study the availability allocation problem based on the following assumptions.
(1) The nodes and links of the networked system only have two states, perfect functioning and complete failure.
(2) All nodes and links are independent physically and statistically.
(3) Upon completion of a maintenance function, a repaired unit is as good as a new one.
(4) All failure time and repair time of components in the lowest allocation level follow exponential distributions.
(5) The system maintainability goal is already specified as MTTR *  , and other variables in the component availability expression (see (3)) are also given.(6) The operating time for all the components is the same.

3.2.
Principles.AGREE and failure rate-based method are two of the most widely used reliability and maintainability allocation methods.However, these two methods cannot be applied for networked system directly according to its complex topology and multiple down states.The ideas of these allocation methods, such as allocating reliability according to component importance and complexity and allocating maintainability considering component failure rate, can still be used as our allocation principles.
In AGREE method, the reliability allocation is applied for the series system which is constituted by components with exponential lifetime.It is realized by allocating the following failure rate to component : where  *  () is the system reliability goal at system operating time ,   is the complexity number, for example, the number of modules within component ,  = ∑   is total number of modules in the system,   is the probability that the system will fail if component  fails, and   is the operating time of the component  (  ≤ ).
In the failure rate-based method, for a system whose repair follows renewal process, the maintainability allocation is implemented by allocating the following repair rate to component type  [17]: where  is the number of component types and   is the number of identical components of type .
As the structure and failures of a networked system are complex, the availability goal cannot be allocated through reliability allocation and maintainability allocation separately.Moreover, ( 6) cannot be applied directly to nonseries system.Generally, the AGREE method and failure ratebased method set up four basic principles of our availability allocation: (1) assign higher reliability goals for less complex components; (2) assign higher reliability goals for more important components; (3) assign higher reliability goals for components which operate longer; (4) assign higher maintainability goals for components with higher failure frequency.
As ( 6) and ( 7), the four basic availability allocation principles above persist with 3.3.Procedures.Let  be the availability allocation accuracy requirement, and allocate the system availability goal  *  to its component reliability and maintainability requirements using the following procedures.
Step 1. Determine the system reliability expression using network reliability algebraic method on the network level and RBD on the lower level as where   () is the system reliability at time  and   () is the reliability of identical component type .
Step 3. Let the initial reliability importance of each component type be equal to 1; that is, Step 4. Calculate the failure rate coefficient for component type  as where   is its longest operating time.To persist the allocation principle in (8), let the allocated failure rate for component type  be where  is a positive variable waiting to be solved.
Step 5. Obtain the allocated repair rate expression of component type  from ( 7) and ( 14) as Step 6.By substituting ( 14) and ( 15) into (11), the system availability is a function with the variable .Solve the following optimization problem using the bisection search method: where  is the decision variable, and the maximum allowable failure rate can be obtained under the constraint of the system availability goal.
Step 7. From the optimal , compute the allocated  and  for component type  using ( 14) and (15).Then, calculate the allocated reliability as compute the probability of completing a repair in less than  hours as and obtain the allocated availability as (3).Step 8. Calculate the new reliability importance of components according to the Birnbaum importance from ( 10) and (17) as Step As (16) expresses, the requirement of (4) will always be followed, and we do not have to verify whether the allocation results satisfy the system availability goal.

Verification
Consider a series system with four components, as Figure 2 shows.This system is analyzed to verify the heuristic algorithm in Section 3.3.
Under the assumptions described in Section 2, the reliability and availability of component  can be expressed as respectively.The system reliability and availability can be obtained from RBD as Suppose that the system availability goal is  *  = 0.99, the system maintainability goal is MTTR *  = 20 hours, the allocation accuracy requirement is  = 0.000001, and the module number of components 1, 2, 3, and 4 are 10, 30, 20, and 10, respectively.Using the procedures in Section 3.3, the accuracy requirement was achieved after 4 iterations.The iteration process is illustrated in Table 1.The bold numbers indicate the allocation results that could not satisfy the accuracy requirement and needed more iteration.The data in the last 3 rows are the final allocation results.The root mean square error (RMSE) between the allocation results in each iteration and the final results can be calculated as where  represents , , or  in (20),  is the number of component types, and  is the number of iterations.RMSE decreases after each iteration as Figure 3 illustrates.One can see that our new allocation algorithm has a good convergence behavior.
If the traditional availability allocation method is used, the system reliability goal is firstly obtained as Then, the reliability and maintainability goals are allocated to components using the AGREE reliability allocation method and failure rate-based maintainability allocation method described in Section 3.2.The allocation results are illustrated in Table 2.By comparing the allocation results obtained from our method and the traditional method, one can see that the RMSE is only 0.00121 and this error is mainly caused by different importance calculation methods.In AGREE method, the reliability importance is the probability that the system will fail given component has failed, while the Birnbaum importance in our new method is about the maximum loss in system reliability when component switches from normal state to failed state.This case shows that the new heuristic algorithm in our availability allocation method is suitable for series systems and the allocation difference is very low.

Case Study
In this Section, a SS7 system is used to illustrate the effectiveness of our allocation method.The topology of the system is shown in Figure 4, where we have the following.
(i) Service switching point (SSP): it is an end-point used as switches that originate, terminate or tandem calls.
It sends signaling messages to other SSP to setup, manage and release voice circuits required, or sends a query message to service control point to seek routing information.
(ii) Signaling transfer point (STP): it is a packet switch used to transfer traffic between signaling points based on routing information contained in the SS7 message.(iii) Service control point (SCP): it is an end-point used as a specialized database.It can accept queries from SSP and retrieves routing information to support services.(iv) A link, access link, connects a signaling end point (e.g., an SCP or SSP) to an STP.(v) B link, bridge link, connects one STP to another.
Typically, a quad of B links interconnects primary STP.(vi) C link, cross link, connects STP performing identical functions into a mated pair.A C link is used only when an STP has no other route available to a destination signaling point due to link failures.
The data transmission process works as follows.When a customer dials the telephone number, this number is forwarded to SSP, and then SSP recognizes it as a call requiring special handling and queries SCP database through STP.The response containing routing information is passed via the STP switching system back to SSP.Finally, the virtual link is constructed and the source and the destination are connected together through the rout given by SCP.

Availability Model.
To successfully build a connection between the two telephones, at least one path needs to exist from the source telephone and one of the SCP, and at least one path should exist between the two telephones.The RBD of the SS7 system is shown in Figure 5.One can see that it is a type of networked structure.It is assumed that links are perfect and the system availability goal is only allocated to the components that make up the nodes.
From Figure 5, we can find 8 minimal paths, and the analytic expressions of the SS7 system reliability and availability can be obtained using inclusion-exclusion method as   () =   where   (),   (),   (), and   () are the reliability of the telephone, SSP, STP, and SCP, and   ,   ,   , and   are the availability of the corresponding nodes.
For the SS7 system, due to the multiple down states, the component availability cannot be directly modeled only using RBD.Take the STP as an example.Its RBD is illustrated in Figure 6.One can find that it is a series system, and the STP reliability and availability can be calculated by where   1 ()   2 (),   3 (),   1 ,   2 , and   3 are the reliability and availability of the STP processor, packet switcher, and power supply, respectively.
The Markov models of the STP signal processor, packet switcher, and power supply are shown in Figure 7.The states in one circle are the available states, and the states in two circles are down states.The STP signal processor has diagnostic and recovery function, and the packet switcher has failure detection function.From these Markov models, the steady-state probabilities can be calculated from the flow rate equations, and the availability for   1 ,   2 , and   3 are expressed as follows: where  =   + By combining above availability and reliability expressions together, we have The parameters are illustrated in Table 3.

Availability Allocation.
Assume that the accuracy requirement is  = 0.000001, and the module number of phone, SSP, STP signal processor, STP packet switcher, STP power supply, and SCP are 50, 200, 80, 90, 30, and 200, respectively.
Allocate the system availability goal down to the reliability and maintainability requirements using our procedures in Section 3.3, and the accuracy requirement was achieved after 11 iterations.Table 4 shows the iteration process, and the bold numbers indicate the allocation results that need more iteration.The final results were obtained in the 11th iteration.When component importance shifts between two adjacent iterations, their mean can be used to accelerate the iteration process.
According to (24), the RMSE between the allocation results in each iteration and the final result decreases sharply as Figure 8 shows.

Conclusion
In this paper, an availability allocation method is proposed for networked systems.This method has three advantages: (1) a heuristic algorithm is proposed to solve the problem with the networked structure, whereas the traditional availability allocation methods can only be used for simple structures; (2) Birnbaum importance is applied to calculate the component importance, where the component importance is not easy to be obtained based on the networked structure; and (3) Markov method is introduced into the availability modeling process in order to model the component with multiple down states.

Figure 1 :
Figure 1: The topology of a simple networked system.

Component 1 2 3 4 Figure 2 :
Figure 2: The reliability block diagram of a series system.

Figure 3 :
Figure 3: RMSE of the availability allocation iteration.

Figure 4 :
Figure 4: The topology of a SS7 system.

Figure 5 :Figure 6 :
Figure 5: The schematic diagram of the SS7 system.

Table 1 :
Availability allocation process for a series system (new method).

Table 2 :
Availability allocation for a series system (traditional method).
1 ,   2 , and   3 are the repair rates of the three components;   ,   ,   and   are recovery rate, recovery failure probability, diagnostic return rate, and diagnostic frequency of the STP signal processor; and   and   are detection frequency and detection probability of the STP packet switcher, respectively.