Designing Fault Tolerance Strategy by Iterative Redundancy for Component-Based Distributed Computing Systems

Reliability is a critical issue for component-based distributed computing systems, some distributed software allows the existence of large numbers of potentially faulty components on an open network. Faults are inevitable in this large-scale, complex, distributed components setting, which may include a lot of untrustworthy parts. How to provide highly reliable component-based distributed systems is a challenging problem and a critical research. Generally, redundancy and replication are utilized to realize the goal of fault tolerance. In this paper, we propose a CFI (critical fault iterative) redundancy technique, by which the efficiency can be guaranteed tomake use of resources (e.g., computation and storage) and to create fault-tolerance applications.When operating in an environment with unknown components’ reliability, CFI redundancy is more efficient and adaptive than other techniques (e.g., KModular Redundancy and N-Version Programming). In the CFI strategy of redundancy, the function invocation relationships and invocation frequencies are employed to rank the functions’ importance and identify the most vulnerable function implemented via functionally equivalent components. A tradeoff has to be made between efficiency and reliability. In this paper, a formal theoretical analysis and an experimental analysis are presented. Compared with the existing methods, the reliability of components-based distributed system can be greatly improved by tolerating a small part of significant components.


Introduction
With technology scaling, the occurrence of Internet-based services, such as cloud computing, volunteer computing, is sharing resources (e.g., software, hardware platform, and computation resources) to provide services on demand.At the beginning of Elastic Compute Cloud (EC2) proposed by Amazon, clouding computing which involves multiple components communication by incomplete reliable networks has become one of the hottest research areas in recent years.As a typical cloud-based application, volunteer computing uses Internet-connected computers volunteered by their owners as the source of computing power and storage.It can support applications that are significantly more dataintensive or have larger memory or storage requirements.Compared with other types of high-performance computing (e.g., grid computing), volunteer computing has a high degree of diversity.The volunteered computers vary widely in terms of software and hardware type, speed, availability, reliability, and network connectivity, as well as the resource requirements and completion time constraints of the applications [1].
The reliability of cloud computing and volunteer computing is far from perfect in reality.In traditional reliability engineering, fault-forecasting, fault-prevention, fault-removal, and fault-tolerance are used.But how to build a highly reliable and available component-based services is a challenging and urgently-demanded research problem no matter academic world or industrial community.Therefore, how to make a tradeoff between efficient use of resources and system reliability should also be taken into account.There are existing large numbers of redundant computing resources in the setting of cloud computing, especially in volunteer computing which is based on unreliable volunteer resources.A well-known technique of software fault tolerance called design diversity can be employed to tolerate faults in this setting.But when the reliability of each functionally component is low enough, the traditional three modular redundancy may not obtain consistency results at one deployment.For instance, when the reliability of each functionally equivalent component is 0.55, the probability of three modular redundancy, that gets two or three consistency results, is ( 3 2 ) 0.55 2 ⋅ 0.45 + ( 3 3 ) 0.55 3 = 0.57475.The correct result of three modular redundancy in this setting may not meet the goal of high system reliability.
We present a CFI (critical fault iterative) redundancy technique in this paper, ensuring that efficient redundancy resources can gain high system reliability.We first construct a function ranking model based on the graphic representation of the functions' invocation relationships and invocation frequencies.A function ranking algorithm is used to identify the Top-K significant functions via the invocation relationships and invocation frequencies.A new iterative redundancy technique is then proposed to enhance the system reliability, which does not require to know the component reliability (assuming the reliability of components   ≥ 0.5).In this paper, the concepts of function and component are interchangeable.However, when a function executes via several functionally equivalent components, there exist some discrepancies between the reliability of the function and the reliability component.CFI, based on majority voting algorithms (such as TMR [2] and NVP [3,4]), exploits the properties of distributed computation architectures to adapt more efficiently and to achieve the same level of system reliability at a lower cost factor.By using the function ranking algorithm, we observe that the function invoked frequently by other functions generally has a higher ranking score.On the other hand, the functions invoked by the functions with lower ranking score will get lower scores.CFI can be adapted to the dynamic environment by reexecuting the function ranking algorithm.The key property of CFI redundancy is that resources can be assigned efficiently to the most vulnerable functions to improve the system reliability.The key superiority of CFI is that it is unnecessary to know the reliability of each component.In order to show the effectiveness of the proposed method, a theoretical analysis based on probability theory and an experimental analysis based on Pajek simulation environment [5] are conducted.The CFI method can be used by the architecture designers/engineers of distributed computing or volunteering computing systems to design highly robust applications under untrusted components.
The main contributions of this paper are summarized in the following.
(i) Paper introduces a novel iterative fault tolerance strategy called CFI that does not need to know the components reliability.This expends some redundant techniques which need the components reliability and expends the scenario that iterative redundancy can be applied.
(ii) We conduct function ranking inspired by Google PageRank algorithm [6] and expand PageRank by adding invocation frequencies to better identify significance functions in complex component-based systems for redundancy.In order to make appropriate cost and reliability trade-offs.
(iii) A formal theoretical analysis based on probability theory and experiments are designed to compare the reliability of system reliability.
(iv) Extensive experiments are designed to evaluate the implicit effects of cost factor and percent of significant functions redundancy on system reliability.
The rest of this paper is organized as follows.The background and related works are introduced in Section 2. In Section 3, a system model is presented based on a ranking algorithm for searching significant functions and an iterative redundancy algorithm for fault tolerance is presented.Theoretical analysis and experiment results on present strategy are given in Section 4. Section 5 presents implicit effects on system reliability.The conclusion of the paper is shown in Section 6.

Background and Related Works
Many Internet services interact over unreliable networks, such as clouding computing, ecommerce, search engines, and volunteer computing.These systems utilize redundancy and replication to realize the goal of high reliability.Distributed computation architectures (DCA) systems utilize highly parallel computing resources to dynamical networks; the computing resources of DCA are built by potentially faulty and untrusted components.Widely used DCA systems such as Hadoop project [7], which uses Distributed File System (DFS) to provide high-throughput access to application data and MapReduce for parallel processing of large data sets.A form of distributed computing in which the general public volunteer processing and storage resources to scientific research project called BOINC (Berkeley Open Infrastructure for Network Computing) [8] is being used by a number of projects, including CAS@home, SETI@home, Climateprediction.net [9].Volunteer participates provide their idle computation resources to cure diseases, study global warming, discover pulsars, and do many other types of scientific research.
Oliner and Aiken [10] propose an online, scalable method for inferring the interactions among the components of large production systems, such as supercomputers, data center clusters, and complex control systems.This work uses the idea of computing correlations and delays between component signals.Convert raw logs into meaningful anomaly signals, then use these anomaly signals to identify important relationships among components, and these relationship information is useful for system administrators to set early-warning alarms.
Automated vulnerability discovery (AVD) [11] presents a feedback-driven techniques, automatically assessing a small number of malicious participant nodes that inflict on large distributed system performance.The work focuses on the fact that the interface between correct and faulty nodes can help developers build high-assurance distributed-systems.A smart redundancy for volunteer distributed computing proposed by Brun et al. [12] demonstrates redundant strategy, which ensures efficient replication of computation and data given finite processing and storage resources.However, the shortcoming of smart redundancy is aiming at single computing task only.
Progressive redundancy on a self-configuring optimistic programming technique aims at component-based systems proposed by Bondavalli et al. [13].It focuses on the problem of providing tolerance to both hardware and software faults at component-based hybrid fault tolerance architecture systems.But they only consider minimizing response time and typically allocate finite resources to each task.
The motivation of this work is that intuition of failures of critical components in distributed computing system will have greater impact on system reliability; thus these critical components will have higher fault tolerance requirements.On the contrary, the other noncritical components' failure will have less impact and need less fault tolerance requirements, especially, in the circumstance that traditional three modular redundancy may not get two or three consistency results at once employment.

Iterative Model and Fault Tolerance Strategy
The key idea of iterative redundancy for vulnerability-driven fault tolerance strategy is made up of two steps.First of all, it identifies significant functions via invocation relationships and invocation frequencies of interconnected functions, accomplished by single component or several functionally equivalent components.Then using iterative strategy to fault tolerance unreliable components.The detailed information of these two steps is shown bellow.

Function Ranking.
The purpose of function ranking is using functionally equivalent components' redundant execution to the most significant functions (or the most vulnerable functions for system reliability), in order to improve the system reliability and make the tradeoff between system reliability and efficiency.The measure, based on the invocation relationships and frequencies between interconnected functions, comes from the intuition of PageRank [6] that web pages linked by large numbers of significant pages are also important.Since the failure of these significant functions must have heavier impact on the whole system reliability than other functions, so these significant functions are more vulnerable to system reliability.In the component-based distributed application, a weighted directed graph, called Function Graph, can be modeled via invocation relationships and frequencies.A node V  in the graph represents a function accomplished by single component or several functionally equivalent components.A directed link   from V  to V  represents an invocation relationship between different functions, and a nonnegative weight value (  ), where 0 ≤ (  ) ≤ 1, represents edge weight which can be calculated by Here, (  ) is invocation frequency of function pair ⟨V  , V  ⟩, (  ) = 0 represents that there is no invocation relationship between function V  and V  , and IN(V  ) is the set of the incoming edge of V  .Through the definition of (  ), the larger invocation ratio (  ) represents that function V  is invoked more frequently by function V  , compared with other functions in the set of IN(V  ).
The weight of the function V  is defined as the sum of the incoming edges weight (  ) multiply the weight of function V  , such that The sum weights of all function nodes in Function Graph Based on these definitions, the procedure of componentbased ranking algorithm can be computed as follows.
(i) Randomly assign an initial numerical ranking scores (V  ) to the nodes in Function Graph, where 0 ≤ (V  ) ≤ 1.
(ii) Compute the ranking score for each function V  by the following: where  = ||.The parameter  is a damping factor which can be set between 0 and 1, and  is employed to adjust the significance values derived from other functions.The resulting weight values of V  are affected by , but the resulting ranking scores are insensitive to .In the experiment of Section 4, when we set  from 0.7 to 0.9, the result of function ranking is stable; thus we set the parameter of  to 0.85 which is similar to [6,14].From (3), the weight score of function (V  ) is composed by the basic value (1−)/ and the weights score of the functions that invoked V  .Assume that  is a vector of the functions' weight, And  is a matrix of the invocation relationship, If V  has no function to invoke, we set (V 1 ) ⋅ ⋅ ⋅ (V  ) to 1/ in general.Therefore, the simultaneous equations can be rewritten by vector form where   is the transposed matrix of .If we assume that the computing process is represented by a probabilistic state transition, the function graph can be seen as a Markov chain model.Therefore the weight of each function is corresponding to the stationary state of the Markov chain.
(iii) Equation ( 6) can be solved by repeating the computation until all the ranking scores become stable.
For the sake of simplicity, instead of repeating the computation of Markov chain's stationary state, we solve it by computing the eigenvector with eigenvalue 1 in our experiments.
Figure 1 shows a function invocation graph with computed weights.The node V  represents the function accomplished by component, the weighted value (  ) represents invocation frequency from function V  to function V  , and the sum of weighted values of node V  's incoming edges is equal to 1.In this example, when setting  = 0.85, we will get the function significant ranking in Table 1, where  1 invoked by  2 and  5 gets the highest ranking score and  6 only invoked by  4 gets lowest ranking score.This function ranking result is in accordance with the intuition that function invoked by significant functions is also important for system reliability.
With the approach above, Top-K most significant functions whose weight scores are highest have been identified.In the next subsection we will use redundant components' execution to enhance the reliability of the these functions to obtain higher system dependability.

Critical Fault Iterative Strategy.
At the step of function ranking, Top-K significant functions have been recognized.In order to obtain high system reliability, functionally equivalent fault tolerance components can be used to meet this target.In this paper, the CFI redundant strategy is proposed to improve the system reliability efficiently.By contrast, several well-known fault tolerance techniques will be introduced, and a formal analytical analysis and a simulated empirical analysis of the system failure probability are presented.

Traditional Strategies
Primary Backup Replication (PBR).Primary backup replication and active replication are also well-known in the area of distributed computing.Primary backup uses serval replications to improve the system reliability.There is a replication assigned as primary.It handles on-the-fly updated of the backups to ensure limits on losses from primary replica failures, while keeping the cost of updates of the replications low.Active replication does not assigne any replica as primary replica, so it removes the centralized control of primary backup.All replicas receive system's invocation, and then reply the result.So it incurs a high cost for keeping all replicas synchronized.Active replication costs more system resources than primary backup but minimizes losses that occur when some replicas fail.Taking into account the cost of primary backup and active replication, they obtain the same failure probability, which can be calculated by  = ∏  =1   () where  is the number of the replicas and   () is failure probability of the th replica.
KMR or NVP fault tolerance strategies get system reliability  at cost factor , such that Traditional redundant strategies have different advantages and disadvantages.KMR and NVP strategy must wait until all the redundant replicas have executed to determine the final result, while active replica strategy takes the first response replica as the final result.The scenarios that these redundant strategies can be employed are variant.Active replication is employed in the areas which have strict constraint of response time.Primary backup is widely used in commercial fault-tolerance systems.

Progressive Redundancy Strategy.
Progressive redundancy strategy is a step by step calculation process, when facing component-based distributed systems whose components' reliability is high and seldom return failure results because of high reliability.In this environment, the calculation results of traditional redundancy strategy often gets consensus quickly, but it still requires to distribute jobs which will not change the task's output.Progressive redundancy strategy distributes the number of jobs to functionally equivalent components as less as possible.Taking -majority voting for example, progressive redundancy strategy just distribute ( + 1)/2 jobs to component-based distributed systems.If all jobs completed by functionally equivalent components return with the same result, the consensus result will be regarded as final result, because any additional computation is irrelevant.If some functionally equivalent components (represented by  dis ) return with disagreeing results, the server will automatically distributes the minimum number of additional jobs, such as ((( + 1)/2) −  dis ), to produce a consensus.This process is repeated until a consensus has been reached; the algorithm of progressive redundancy strategy is shown in Algorithm 1.
The reliability of progressive redundancy with -majority voting is at most (−1)/2 functionally equivalent components fail, and return disagree results: where  represents the reliability of the functionally equivalent components and  represents cost factor.

Critical Fault Iterative Redundancy.
The CFI redundancy will assign appropriate number of components to different functions according to the function ranking algorithm introduced above (in Section 3.1).It distributes the minimum number of functionally equivalent components to reach the system desired reliability.Since some components will fail, the results of functionally equivalent components will be variant.If all the results agree with majority components, then the task assigned to these components is completed.If some of the components fail or results disagree with majority components, then the degree of confidence of majority results is decreased.For instance, if the reliability of functionally equivalent components is 0.75, and the desired function reliability accomplished by these functionally equivalent components is 0.96.Function server distributes only one component to execute the job of this function; there is a 0.75/(0.75+ 0.25) = 0.75 probability that the result is correct.But if the server distributes 3 functionally equivalent components to accomplish the job and all of these three components return with the same result, the degree of confidence that the consistent result is correct will be 0.75 3 /(0.75 3 + 0.25 3 ) > 0.96, so three is minimum number that the function to achieve confidence threshold 0.96.However, if two of three components return with agreeing results and one returns with disagreeing result, the function server at least distributes two more components return the agreeing result to achieve confidence threshold 0.96.In this scenario, how many independent components should be allocated to this function to meet the level of system reliability is determined by CFI redundancy algorithm as follows.This process can be repeated until the gap between the majority result with other results meets the requirement of the system confidence threshold.
From intuition by Bayes' Theorem, we can draw the following conclusions.If the number of the majority response results ( maj ) minus the other response results ( oth ) is constant (i.e.,  maj −  oth = , where  is constant), we will get the same degree of confidence.For example, if a function is distributed to 10 functionally equivalent components, and 8 of them has response result A and the remaining 2 has other results, it will get the same confidence as 108 components response result A, 102 components response other results.Supposing that  +  functionally equivalent components are distributed jobs to complete a given function,  components return one result with probability , and  components return other result with probability 1 − .(, , ) represents that  components reported result is correct and  components reported result is wrong (e.g.,  represents result that is majority).So (, , ) =   (1 − )  /(  (1 − )  +   (1 − )  ).Then the proof of this Bayes' theorem, that for all , (, , ) = (,  + ,  + ), is givin as follows: Corollary: no matter what the reliability of component   is, if these components get result A  +  times and get other results  times, the confidence that  is true depends only on  and is independent of .Let  be a Bernoulli random variable that represents the number of components and let  ∈ ; then there exists  such that, for all  ∈ , if out of  + 2 components of , exactly  +  results of components are correct, so the rest of results are wrong.Then the probability that (() ≥ 0.5) is constant , because there are two possibilities: either () ≥ 0.5 or () < 0.5.If () ≥ 0.5, the probability that exactly + results are correct is ( +2 + ) () + (1−())  .If () < 0.5, the probability that exactly  +  results are correct is where (() ≥ 0.5) is only depending on  and does not depend on .Thus, we can conclude that (() ≥ 0.5) is identical for all .Now, we have shown that the result's confidence is only depending on the different value  between the majority result with others.For instance, if  is 3 that means a function distributed to functionally equivalent components until 3 more components reported one result than the other.Then we can conduct an automatically critical fault iterative (CFI) redundant algorithm to meet the requirement of system reliability in Algorithm 2.
Using Algorithm 2, we only need to determine the system reliability requirement factor  (i.e. = |  −   |); then the system reliability   CFI () =   /(  + (1 − )  ), where  represents the reliability of the functionally equivalent components.The algorithm first distributes  jobs to functionally equivalent components, reports minus value between the number of jobs reporting the majority results and the number reporting the other results, and then the algorithm iterates automatically distribute jobs until  more jobs have reported one result than the others.In order to obtain the system reliability requirement factor ,  + 2 jobs should be distributed to functionally equivalent components to execution, and  +  functionally equivalent components return the same result, and  components return the other result.The cost factor of the iterative redundancy is shown as follows: where  is the reliability of functionally equivalent components.In the case of large requirement factor , the cost factor can be approximate compute by () ≈ /(2 − 1).

Experiment Results
In this section, we compare the improvement of system reliability based on the CFI strategy with traditional strategies and discuss the experiment results.

Experimental Framework.
We use a scale-free directed function graphs generator tool called Pajek [5] to simulate component-based distributed system.A scale-free graph is a graph whose degree distribution follows a power law [15].Large self-organizing networks, such as the Internet, the World Wide Web, and social and biological networks, often exhibit power-law degrees.Four fault tolerance approaches have been conducted to learn the performance of CFI redundancy on system reliability improvement: (i) NoR: there is no fault tolerance strategy that is employed for the function in component based systems; (ii) RandomR: randomly select  functions to employ fault tolerance strategy to improve the reliability of these functions; (iii) CFIR: using the function ranking algorithm to identify the most vulnerable Top-k functions to employ iterative redundancy to improve the system reliability; (iv) AllR: using fault tolerance strategy for all functions to obtain the system reliability.
Towards the component-based distributed system, we conduct a random trace to travel from the scale-free directed graph generated by Pajek to simulate the invocation behavior and invocation relationship.A node in the directed graph stands for the function accomplished by single component or several functionally equivalent components, an edge stands for invocation relationship, and the weight value of the edge is used to simulate the invocation probability or invocation frequency.During the execution the componentbased system, initial node is randomly selected, and a random trace starting from the selected function is performed.We regard the execution as failed if the invoked function is failure; a failure probability is set to the functions provided by these functionally equivalent components.If there is a fault tolerance strategy employed for the invoked functions, the reliability of these functions will be improved.We conducted 100 travel traces for each generated scale-free directed graph.Four method, such as NoR, RandomR, CFIR, and AllR fault tolerance strategies, have been deployed for these travel traces, then averaging the simulate results.

Reliability Comparison of Distributed Computing System.
When we employ different fault tolerance strategies, system will obtain different failure probabilities.The results of the experiment are showing in Table 2.In the experiment, a scalefree directed function graph with 5000 nodes is generated by Pajek.Among the experiments we simulated, AllR always gets the lowest system failure probability, while NoR always gets the highest system failure probability.The results of AllR and NoR are very intuitive, since AllR employs redundant strategies for all the functions while NoR provides no fault tolerance strategies for any function.
In the experiment, since failure probability of component is less than 1%, we just set the function requirement  factor to 3. In this setting, the function, accomplished by functionally equivalent components, will get high reliability.
Compared with NoR, RandomR does not improve the system reliability obviously.This observation indicates that fault tolerance the functions that are not frequently invoked will be useless, and the failures of these nonsignificant functions will have less impact on the system reliability.CFI redundant strategy makes a tradeoff between system reliability and cost factor.Through comparing with CFI redundant strategy, AllR obtains better system reliability in all the simulated experiments, but AllR pays a bigger price than CFI.In all our experiments, CFI fault tolerance strategy obtains better reliability than RandomR.Because significant functions identified by the step of function ranking are invoked more frequently, the failure of these significant functions will have greater impact on the component-based distributed system.So tolerating failures of these significant functions can achieve better system reliability than tolerating failures of randomly selected functions.
When the components failure probability increases from 1% to 5% and 10%, the whole system failure probabilities of four redundant strategies (e.g., NoR, RandomR, CFIR, and AllR) are increased greatly.This is because when the number of failure components increases greatly, only tolerating the failure of functions which are frequently invoked is not enough for providing a highly reliable system.

Implicit Effects of Cost Factor on Reliability.
To study the impact of Cost Factor on the component-based distributed system's failure probability.Iterative redundancy method, called CFI redundancy (CFIR), proposed in this paper is compared with traditional majority voting redundancy (MajorR).The cost factor is setting from 3 to 17 with a step value of 2. The number of functions in this experiment created by Pajek is 1024.Table 3 shows that CFIR outperforms MajorR in all the cost factors no matter what redundant percent is deployed (e.g., Top 1%, Top 5%, and Top 10%).With the increase of cost factor from 3 to 17, system failure probabilities of these two redundant methods are all becoming lower.
In the corollary of system model, we have shown that it is unnecessary to know the reliability of each component to implement CFI redundant strategy (assuming the reliability of each component is bigger than 0.5).Therefore, the system architect engineer just only needs to specify how much improvement is required to enhance the system reliability.
In Figure 2 we have shown that if the reliability of each component is higher than 0.75, the iterative redundant algorithm just needs to set the requirement factor to 4; then the reliability of the function accomplished by these functionally equivalent components will be higher than 0.95.The higher the component reliability, the smaller the cost factor needed to achieve the high system reliability.Therefore, if architect engineer has the knowledge of component failure probability, he may make requirement factor more effective.
In some real-time system which have strict time constraints, traditional fault tolerance strategy such as threemodular redundancy which can be deployed to three components at once, but using CFI redundant strategy, a job must first be deployed to several components, and waiting for the results before determining whether should to deploy more jobs to functionally equivalent components or not.The responding time depends on the requirement factor and component failure probability.So CFI redundancy increases the responding time for some jobs that need high reliability.In this case, more jobs can be deployed to functionally equivalent components at once to decrease the responding time.

Implicit Effects of Top-K on Reliability.
In order to study the impact of the redundant percentage on system reliability, we set different redundant percents of components to compare the CFIR with MajorR.The result is showing in Table 3.The tendency of system failure probabilities when different redundant percents are deployed is shown in Figure 3.
We can conclude that when the redundant percent increases, the failure probabilities of CFIR and MajorR decrease.Under different component redundant percent settings, CFIR strategy consistently outperforms MajorR in the from Top-K = 1% to Top-K = 10%.When component failure probability is high, in order to obtain higher system reliability, larger cost factor and component redundant percent are needed.probability of these components are increasing from 1% to 9%, the distributed system failure probability of these four methods (e.g., AllR, CFIR, RandomR, and NoR) becomes larger.CFIR outperform RandomR in all the settings and have a more effective use of redundant components.

Conclusion
The paper proposes a CFI redundant strategy that improves the existing techniques by using resource more efficient, especially in the environment that the failure probability of

Figure 1 :
Figure 1: An example of stable invocation relationship and invocation frequency between 6 functions.

Table 1 :
The function ranking and their invoked (In) and invoke (Out) degrees.

Table 2 :
System failure probability of different redundancy strategies.

Table 3 :
System failure probability of different cost factors with component failure probability is 0.3.