A Selective Mirrored Task Based Fault Tolerance Mechanism for Big Data Application Using Cloud

. With the wide deployment of cloud computing in big data processing and the growing scale of big data application, managing reliability of resources becomes a critical issue. Unfortunately, due to the highly intricate directed-acyclic-graph (DAG) based application and the flexible usage of processors (virtual machines) in cloud platform, the existing fault tolerant approaches are inefficient to strike a balance between the parallelism and the topology of the DAG-based application while using the processors, which causes a longer makespan for an application and consumes more processor time (computation cost). To address these issues, this paper presents a novel fault tolerant framework named Fault Tolerance Algorithm using Selective Mirrored Tasks Method (FAUSIT) for the fault tolerance of running a big data application on cloud. First, we provide comprehensive theoretical analyses on how to improve the performance of fault tolerance for running a single task on a processor. Second, considering the balance between the parallelism and the topology of an application, we present a selective mirrored task method. Finally, by employing the selective mirrored task method, the FAUSIT is designed to improve the fault tolerance for DAG based application and incorporates two important objects: minimizing the makespan and the computation cost. Our solution approach is evaluated through rigorous performance evaluation study using real-word workflows, and the results show that the proposed FAUSIT approach outperforms existing algorithms in terms of makespan and computation cost.


Introduction
Recent years have witnessed that the big data analysis grows dramatically, and the related applications have been used everywhere in both academia [1] and industry [2].There is no denying that the developmental cloud platform technologies played a key role in this process; the plenty of processers in cloud make sure that the scholars can handle the significant large-scale big data processing [3][4][5][6].However, due to voltage fluctuation, cosmic rays, thermal changes, or variability in manufacturing, the chip level soft errors and the physical flaws are inevitable for a processer even when the probability of that is extremely low [7,8].Furthermore, the abundant use of processers by a big data application induces that the probability cannot be ignored.
Big data and big data analysis have been proposed for describing data sets as analytical technologies in large-scale complex programs, which need to be analyzed with advanced analytical methods [9,10].No matter whether the big data applications are developed for commercial purposes or scientific researches, most of these applications require significant amount of computing resources, such as market structure analysis, customer trade analysis, environmental research, and astrophysics data processing.Motivated by the reasonable price, rapid elasticity, and shifting responsibility of maintenance, backups, and management to cloud providers, more and more big data applications have been deployed to clouds, such as EC2 [11], Google Cloud [12], and Microsoft Azure [13].
The clouds provide unlimited computing resources (from the user's point of view) including CPU resources and GPU resources.The on-demand recourses facilitate users to choose apposite processors (with CPU or GPU resources) for executing their big data applications efficiently [14].However, 2 Wireless Communications and Mobile Computing according to [15], there are many factors such as voltage fluctuation, cosmic rays, thermal changes, or variability in manufacturing, which cause the processors (both CPU and GPU) to be more vulnerable.Indeed, the probability of fault rate is really low.But, as already noted, plenty of processors participate in computing the big data application, and the computing time of each processor may be very long.These potential factors will lead to an exponential growth for the fault rate in a cycle of running a big data application.
The failures caused by processors are disastrous for a big data application which is deployed on the processors; i.e., once a fault occurs on any processor, the application will have to be executed all over again, and that will waste lots of monetary cost and time cost.Thus, improving the robustness (or reducing the fault rate) for running a big data application has attracted many scholars' attention; many studies have been exploring this problem.According to the literature, these studies are classified into two main categories: resolving the problem in hardware or software level.
From the hardware level's perspective, improving the mean time between failures (MTBF) [16] of the processors is the key to reduce the fault rate for running a big data application.As everyone knows, it is impossible to eradicate the failures in a processor.Apart from lifting tapeout technology, the [15] proposed an adaptive low-overhead fault tolerance mechanism for many-core processor, which treats fault tolerance as a device that can be configured and used by application when high reliability is needed.Although [15] improves the MTBF, but the risk of occurring failures remains.Therefore, the other scholars seek solutions in software level.
From the software level's perspective, the developed check-point technology [17] makes sure that a big data application can be completed under any size of MTBF.Thus, many check-point strategies are proposed to resolve this problem such as [18,19].These strategies only pay a little extra cost, but they can complete the applications under any size of MTBF.As a result, almost all of cloud platforms provide check-point interface for users.However, these strategies did not consider the date dependencies between the processors, which make it inappropriate to big data application running on cloud.
In order to handle the DAG based applications, the copy task based method (also known as primary backup based method) is proposed to resolve the problem.In [20], Qin and Jiang proposed an algorithm eFRD to enable a systems fault tolerance and maximize its reliability.On the basis of eFRD, Zhu et al. [21] developed an approach for task allocation and message transmission to ensure faults can be tolerated during the workflow execution.But the task based methods will make the makespan have a long delay due to the fact that the backup tasks are not starting with the original tasks.
To the best of our knowledge, for the big data application, there are no proper check-point strategies which can handle the data dependencies among the processors simultaneously.In this paper, we propose a novel check-point strategy for big data applications, which considers the effect of the data communications.The proposed strategy adopts high level failure model to resolve the problem, which makes it closer to practice.Meanwhile, the subsequent effect caused by data dependencies after a failure is also considered in our strategy.
The main contributions of this paper are as follows: (i) A selective mirrored task method is proposed for the fault tolerance of the key subtasks in a DAG based application.
(ii) On the basis of the selective mirrored task method, a novel check-point framework is proposed to resolve the fault tolerance for a big data application running on cloud; the framework is named Fault Tolerance Algorithm using Selective Mirrored Tasks Method (FAUSIT).
(iii) A thorough performance analysis is conducted for FAUSIT through experiments on randomly generated test big data application as well as real-world application traces.
The rest of this paper is structured as follows.The related work is summarized in Section 2. Section 3 introduces the models of the big data application, the cloud platform, and the MTBF and then formally defines the problem the paper is addressing.Section 4 presents the novel checkpoint strategy for the big data application running on cloud.Section 5 conducts extensive experiments to evaluate the performance of our algorithm.Section 6 concludes the paper with summary and future directions.

Related Work
Over the last two decades, owing to the increasing scale of the big data applications [22], the fault tolerance for big data application is becoming more and more crucial.Considerable research has been explored by scholars.In this section, we summarize the research in terms of theoretic methods and heuristic methods.
A number of theoretic methods have been explored by scholars.For a task running on a processor, Young [21] proposed a first order failure model and figured out an approximation to the optimum check-point interval which is   = √ 2, where  is the time to write a checkpoint file, M is the mean time between failures for the processor, and   is the optimum computation interval between writing check-point files.However, the model in Young [21] will never have more than a single failure in any given computation interval.This assumption goes against some practical situations; for instance, there may be more than one failure occurring in a computation interval.
Due to the downside of the model in Yang [21], Daly [22] proposed a higher order failure model to estimate the optimum check-point interval.The model of Daly [22] assumes that there may be more than one failure occurring in a computation interval, which is closer to the realistic situation.The optimum computation interval figured out by Daly [22] is Wireless Communications and Mobile Computing 3 ( which is a good rule of thumb for most practical systems.However, the models in both Yang [22] and Daly [22] are aimed at one task running on a processor, which are not applicable to DAG based application running on cloud for the following reasons.First, there are many subtasks running on different processors; the completion time of each subtask may have influence on the successive subtasks.Second, the importance of each subtask in a DAG (a DAG based application) is different; for instance, the subtasks on the critical path of the DAG are more important than the others.Therefore, some scholars proposed heuristic methods aiming at DAG based application running on cloud.
Aiming to resolve the fault tolerance for DAG based application running on cloud, Zhu [23] and Qin [24] proposed copy task based methods.In general, the basic idea of copy task based methods is running an identical copy of each subtasks on different processors, the subtasks and their copies can be mutually excluded in time.However, these approaches assume that tasks are independent of one other, which cannot meet the needs of real-time systems where tasks have precedence constraints.In [24], for given two tasks, the authors defined the necessary conditions for their backup copies to safely overlap in time with each other and proposed a new overlapping scheme named eFRD (efficient fault-tolerant reliability-driven algorithm), which can tolerate processors failures in a heterogeneous system with fully connected network.
In Zhu [23], on the basis of Qin [24], the authors established a real-time application fault-tolerant model that extends the traditional copy task based model by incorporating the cloud characteristics.Based on this model, the authors developed approaches for subtask allocation and message transmission to ensure faults can be tolerated during the application execution and proposed a dynamic fault tolerant scheduling algorithm, named FASTER (fault tolerant scheduling algorithm for real-time scientific workflow).The experiment results show that the FASTER is better than eFRD [24].
Unfortunately, the disadvantage of copy task based methods including Zhu [23] and Qin [24] is very conspicuous.First, the copy of each subtask may consume more resources on the cloud, which makes them uneconomical.Second, the copy of each subtask will be executed only when the original subtask failed; this process will waste a lot of time, and it will be even worse due to the DAG based application; i.e., the deadline of the application will be not guaranteed in most of cases if the deadline is near to the critical path.
Thus, in this paper, we will combine the advantage of theoretic methods and heuristic methods to propose a novel fault tolerance algorithm for a big data application running on cloud.The threshold of The set of key subtasks () The makespan of the application

Models and Formulation
In this section, we introduce the models of big data application, cloud platform, and the failure model then formally define the problem this paper is addressing.To improve the readability, we sum up the main notations used throughout this paper in Table 1.

Big Data Application Model.
The model of a big data application  is denoted by (, ), where (, ) represents a DAG.Besides, we use   to denote the execution time of the critical path in (, ).Each node   ∈  represents a subtask   (1 ≤  ≤ V) of , and V is the total number of subtasks in . is the set of weights, in which each wire presents the execution time of a subtask   on a VM. is the set of edges in (, ), and an edge (  ,   ) represents the dependence between   and   ; i.e., a task can only start after all its predecessors have been completed.Figure 1 shows an example of DAG for a big data workflow, consisting of twelve tasks from  1 to  12 .The DAG vertices related to tasks in  are represented by circles, while the directed edges denote the data dependencies among the tasks.failure

Cloud Platform Model.
A cloud platform  is modeled as a set of processors { 1 ,  2 , . . .,  N } and N is the total number of processors on the cloud.We use  to denote the number of processors rented by users for executing an application.Actually, the N is much greater than the mount which the user need.In general, to reduce the cost (monetary cost or time cost), the users apply proper schedule arithmetic to deploy their big data applications on cloud.But most of schedule algorithms did not consider the failure in each processor, which may consume extra cost (monetary cost and time cost).

The Check-Point Model.
The check-point technology has been used on cloud for years, which makes application complete successfully in the shortest amount of time.In general, the check-point model is defined as bellow: ideally, the time to complete a task on a processor is denoted by   ; we use  to denote the check-point interval.After each computation interval (), the processor makes a backup for the current status of the system, and the time consumed by this process is represented by .If a failure occurs, the time consumed to read the latest backup and restart the computation is denoted by .Finally, we use   () to denote the practical completion time for the task running on a processor while the check-point interval is .
Referring to Figure 2, the ideal completion time for the task is   = 5.Actually, there is a failure occurring after  time in the third interval, and it takes the processor  time to restart the third interval.At last, the practical time consumed by the tasks is   () = 5 +  +  + 4.

The Failure Model.
For a given MTBF (mean time between failures) which is denoted by , according to [21], the life distribution model for mechanical and electrical equipment is described by an exponential model.Thus, the probability density function is Then, the probability of a failures occurring before time Δ for a processor is represented by a cumulative distribution function Obviously, the probability of successfully completing for a time Δ without a failure is We use   to denote the computation time for a subtask running on a processor; the  denotes the compute interval between two check-points.Then, the average number of attempts (represented by .)needed to complete   is Therefore, the total number of failures during Δ is the number of attempts minus the number of successes.
Notice that this assumes that we will never have more than a single failure in any given computation interval.Obviously, this assumption is relaxed in a real-life scenario.Thus, in [22], the scholar presented a multiple failures model. where The derivation process of Formula ( 8) is detailed in [22]; we will not repeat the process in this paper.In this paper, we will take Formula (8) as the failure model in our framework for two reasons; first, this could only provide the MTBF (, mean time between failures) determined by statistics [19,25] and, second, this model is closer to reality than the other model in [21].
3.5.Definitions.In order to make readers have a better understanding of this algorithm, we make some definitions first.For a given big data application  = (,) and a schedule , we define the following terms.

Schedule (𝑆).
A schedule S is a map from the subtasks in (, ) to the processors on the cloud; meanwhile, the start time and the finish time of each subtask have been figured out.
In general, the S is determined by a static schedule algorithm, such as HEFT [26] and MSMD [27].

Finish Time (𝑓𝑡(𝜏 𝑖 )).
The (  ) denotes the finish time of   on the schedule .
Same with (  ), the (  ) is different with the traditional earliest start time in DAG.
3.5.9.Slack Time ((  )).For a given schedule S, the (  ) of   is defined as follows: 3.5.10.Indirect Slack Time ((  )).For a given schedule S, the (  ) of   is defined as follows: The (  ) denotes that the slack time of a subtask can be shared by its predecessors.

The Set of Key Subtask (𝐾𝑒𝑦𝑇).
The signal  denotes the set of key subtask for a given schedule .

Problem Formalization.
The ultimate objective of this work is to provide a high-performance fault tolerance mechanism and make sure that the proposed fault tolerance mechanism will consume less computation cost and makespan.The computation cost represents the processor time consumed by all the subtasks in ; thus, the object of minimizing computation cost is defined by The makespan can be defined as the overall time to execute the whole workflow by considering the finish time of the last successfully completed task.For an application , this object is denoted by Minimize  () .

The Fault Tolerance Algorithm
In this section, we first discuss the basic idea of our algorithm for the fault tolerance of running big data application on Wireless Communications and Mobile Computing cloud.Then, on the basis of the idea, we will propose the fault tolerance algorithm using Selective Mirrored Tasks Method.

The Basic Idea.
As show in Section 2, the theoretic methods which are devoted to find the optimal   are not applicable to the DAG based application, even if the   they have determined is very accurate for one task running on a processor.Besides, the heuristic methods based on the copy task will waste a lot of extra resource, and the completion time of the application may be delayed by much more time.
To find a better solution, we will integrate the advantages of theoretic methods and heuristic methods to propose a high-performance and economical algorithm for big data application.The check-point method with an optimal computation interval   is a dominant and economical method for one task running on a processor; thus, the check-point mechanism is the major means in our approach.Furthermore, owing to the parallelism and the dependencies in a DAG based application, the importance of each subtask is different; i.e., the subtasks on the critical path are more important than the others.The fault tolerance performance of these subtasks which adopt check-point method is insufficient, because the completion time of an application depends to a great extent on the completion time of these subtasks.Therefore, for the important subtasks, we will improve the fault tolerance performance of an application by introducing the task copy based method.In the task copy based methods [24], the original task and the copy do not start at the same time; to reduce the completion time, the original task and the copy will start at the same time.
In summary, the basic idea is as follows.First, identify the important subtasks in the DAG based application, which are named as key subtasks in the rest of this article.Then, apply the task copy based methods to the key subtasks; meanwhile, all the subtasks will employ the check-point technology to improve the fault tolerance performance, but the key subtasks and the normal subtasks will use different optimal computation interval   , the details of which will be described in the following sections.
It should be noted that our fault tolerance algorithm will not schedule the subtasks on the processors; we just provide a fault tolerance mechanism based on the existing static scheduler algorithm (such as HEFT and MSMD) to make sure that the application can be completed with the minimum of time.

Fault Tolerance Algorithm Using Selective Mirrored Tasks Method (FAUSIT).
Based on the basic idea above, we propose the FAUSIT to improve the fault tolerance for executing a large-scale big data application; the pseudocode of the FAUSIT is listed in Algorithm 1.
As shown in Algorithm 1, the input of FAUSIT is a map from subtasks to processors which is determined by a static scheduler algorithm and the output is fault tolerance operation determined by FAUSIT.The function DetermineKeyTasks() in Line (1) is to find the key subtasks according to the schedule  and the application .Then, the function DeployKeyTasks() in Line (2) deploys the mirrored subtasks and determine the proper   for the subtasks.
In the following content in this subsection, we will expound the two functions in FAUSIT.

Function of DetermineKeyTasks(). The function of
DetermineKeyTasks() is to determine the key subtasks in the DAG based application.In order to make readers have a better understanding of this function, we need to expound the key subtask and the indirect slack time clearly.Definition 1. Key subtask: in a schedule , the finish time of a subtask has influence on the start time of its successors; if the influence exceeds a threshold, we define the subtask is a key subtask.
The existence of the key subtasks is very meaningful to our FAUSIT algorithm.For a given schedule , in the ideal case, each subtask as well as the application will be finished in accordance with the .In practice, a subtask may fail when it has executed for a certain time; then, the processor will load the latest check-point files for continuation.At this point, the delay produced by the failure subtask may affect the successors.For the subtasks which have sufficient slack time, the start time of the successors is free from the failed subtask.On the contrary, if the failed subtask has little slack time, it will affect the start time of the successors undoubtedly.Given all that, we need to deal with the key subtasks which has little slack time.Definition 2. Indirect slack time: for two subtasks   and   ,   is the successor of   , if   has slack time (defined in Section 3.5.9), the slack time can be shared by   , and the shared slack time is indirect slack time for   .
The indirect slack time is a useful parameter in our FAUSIT algorithm, the existence of which will make the FAUSIT save a lot of time (makespan of the application) and cost.For a given schedule , a subtask may have sufficient slack time which can be shared by predecessors.Thus, the predecessors may have enough slack time to deal with failures; then, the completion time of the predecessors and the subtask will not delay the makespan of the application.Indeed, the indirect slack time is the key parameter to determine whether a subtask is a key subtask.Moreover, the indirect slack time reduces the count of the key subtasks in a big data application, which will save a lot of cost, because the key subtask will apply mirrored task method.The pseudocode for function DetermineKeyTasks() is shown in Algorithm 2. The input of DetermineKeyTasks() is a schedule  of an application  and the threshold ; the output is the set of key subtasks.First, in Line (1), the (), () and the () of the subtasks are determined according to Sections 3.5.6 and 3.5.7 and Formula (12).Then, the () of each subtask is figured out by Formula (13) in Line (2).Line (3) determines the () of the subtasks by recursion.Finally, the set of key subtasks is determined according to the threshold .
Table 2 shows the process of DetermineKeyTasks().When the  = 0.15, Figure 3 shows the key subtasks which shall adopt the mirrored task method to improve the performance of fault tolerance.
It should be noted that the threshold  is given by the users, which is related to the makespan of the application; i.e., the higher  leads to more key subtasks.Then, the makespan of the application is shorter.On the contrary, the smaller  will lead to a longer makespan.

Function of DeployKeyTasks(). The function of
DeployKeyTasks() is to deal with the key subtasks, which minimizes the makespan of the application to the least extent.The main operation of DeployKeyTasks() is using mirrored task method; in order to make readers have a better understanding of this function, we need to expound the mirrored task method first.
The mirrored task method is to deploy a copy of a key subtask on another processor; the original subtask is denoted by    and the copy of the subtask is denoted by    .The    and the    start at the same time, and the check-point interval of them is 2  (the   is determined by [22]).The distinction between the    and the    is that the first checkpoint interval of    is   ; meanwhile, the first check-point interval of    is 2  .Obviously, once a failure occurs in one of the processors, the interlaced check-point interval of the two same subtasks makes sure that the time delayed by dealing with the failure is  (the time to read a check-point file).Figure 4 shows an example of the mirrored task method.The Figure 4(a) displays the ideal situation of the mirrored task method; i.e., there are no failures happen in both   and   , and the finish time is 4 + 2.In Figure 4(b), there is only one failure happening on   in the second check-point interval  2 .First, the processor   reads the latest checkpoint file named  1  .Then, with time goes by, the processor   will immediately load the latest check-point file  3   when it generates.Thus, the finish time is 4 + 2 + W. Figure 4(c) illustrates the worst case; both   and   have a failure in the same interval  2 , the two processors will have to load the latest check-point file  1  .Thus, the finish time is 4 + 2 +  + .
Obviously, the mirrored task method is far better than the traditional copy task based method, since the copy task will start only when the original task failed, and it will waste a lot of time.Moreover, the mirrored task method is also better than the method in [22], since the probability of the worst case is far less than the probability of the occurrence for one failure in a processor.
The pseudocode for function DeployKeyTasks() is displayed in Algorithm 3. The loop in Line (1) makes sure that all the key subtasks can be deploy on the processors.The applied processors should be used first when deploying the key subtasks; the loop in Line (2) illustrates this constraint.
Line (3) makes sure that the key subtasks    have no overlapping with other tasks on   .Lines (4)-( 5) deploy    on   .If all the applied processors have no idle interval for    (Line (8)), we will have to apply a new processor and deploy    on it; then, we put the new processor into the set of applied processors (Lines (9)-( 11)).At last, we deploy the check-point interval (  or 2  ) to the processors (Line (14)) and save these operations in  (Line (15)).
It should be noted that the overlapping in Line (3) is not just the computation time of the    ; we also consider the delayed time which may be caused by failures.In order to avoid the overlap caused by the delayed    , we use the 1.3  as the execution time of    to determine whether an overlap happen, since the 1.3  is much greater than a delayed    .

The Feasibility Study of FAUSIT.
The operations to the processors in our FAUSIT are complex, such as the different check-point interval and the mirrored subtasks; the readers may doubt the feasibility of FAUSIT.Thus, we will explain the feasibility study of FAUSIT in this subsection.
According to [15], Google Cloud provides the gcloud suit for users to operate the processors (virtual machine) remotely.The operations of gcloud include (but not limited to) applying processors, releasing processors, check-point,   E n dfor in Line (2).(6) end if (7) end for (8) if    has not be deployed on the processor.then (9) Apply a new processor.(10) Deploy   on the new processor.(11) Put the new processor in the set of processors.(12) end if (13) end for (14) Deploy the check-point interval to the processors.and loading check-point files.These operations in gcloud can make user implement the FAUSIT easily.

Empirical Evaluations
The purpose of the experiments is to evaluate the performance of the developed FAUSIT algorithm.We evaluate the fault tolerant of FAUSIT by comparing it with two other algorithms published in the literature.They are FASTER algorithm [23] and the method in [22].The main differences of these algorithms to FAUSIT and their uses are briefly described below.
(i) FASTER: a novel copy task based fault tolerant scheduling algorithm.On the basis of copy task based method, the FASTER adjust the resources dynamically.
(ii) The method in [22]: it is a theoretic method which provides an optimal check-point interval   .

Experiment Settings.
The DAG based big data applications we used for the evaluation are obtained from the DAG based applications benchmark provided by Pegasus WorkflowGenerator [28].We use the largest application from the benchmark, i.e., Epigenomics, since the bigger application is more sensitive to failures.The detailed characteristics of the benchmark applications can be found in [29,30].In our experiments, the number of subtasks in an application is ranging from 100 to 1000.Since the benchmark does not assign deadlines for each application, we need to specify the deadlines; we assign a deadline for the applications: it is 1.0.Table 3 gives the characteristics of these applications including the count of tasks and edges, average task execution time and the deadlines.Because FAUSIT does not schedule the subtasks on the processors, we hire a high-performance schedule algorithm (MSMD) to determine the map from subtasks to the processors.MSMD is a novel static schedule algorithm to reduce the cost and the makespan for an application.On the basis of MSMD, we use FAUSIT to improve the fault tolerant of an application.

Evaluation Criteria.
The main objectives of the FAUSIT are to find the optimal fault tolerant mechanism for a big data application running on cloud.The first criterion to evaluate the performance of a fault tolerant mechanism is how much time is delayed to complete the application, i.e., the makespan to finish the application.We introduce the concept of makespan delay rate to indicate how much extra time consumed by the fault tolerant mechanism.It is defined as follows:
where the ideal makespan represents the completion time for running an application without fault tolerant mechanism and any failures, and the practical makespan is the completion time consumed on practical system which has a fault tolerant mechanism and the probability of failures.The second goal of the FARSIT is to minimize the extra cost consumed by the fault tolerant mechanism.Undoubtedly, to improve the performance of fault tolerant, any fault tolerant mechanisms will have to consume extra computation time for an application.Thus, the extra computation time is a key criterion to evaluate the performance of fault tolerant mechanism.Therefore, we define extra cost rate for the evaluation:

𝐸𝑥.𝐶𝑜𝑠𝑡.𝑅𝑎𝑡𝑒 =
The practical processors time The ideal processors time .
where the practical processors time denotes the time consumed by the processors to running an application for a practical system with failures, The ideal processors time is the sum of the   in (, ) for an application.

The Optimal 𝛼 for FAUSIT.
Before comparing with the other algorithms, we need to determine the optimal  for FAUSIT first.Due to the optimal  is an experimental parameter, we have to figure it out by experiments.We test the  by the Epigenomics application with 1000 subtasks for 10 times and make the   = 1.0  ; the results are shown in Figures 5(a) and 5(b).In Figure 5(a), the Mak.Del.Rate becomes lower alone with the increased; i.e., a larger  will lead to a shorter makespan for an application.
On the contrary, Figure 5(b) shows that the larger  will lead to more cost for an application.
Through a comprehensive analysis of Figures 5(a) and 5(b), we make the  = 0.033 which can equalize the Mak.Del.Rate and the Ex.Cost.Rate and make sure that both the Mak.Del.Rate and the Ex.Cost.Rate are smaller than other situations.

Comparison with the Other Solutions.
Since the proposed FAUSIT algorithm is a heuristic algorithm, the most straightforward way to evaluate its performance is to compare with the optimal solution when possible.We randomly generate 10 different applications for each scale of Epigenomics shown in Table 3, and we make the   = 1.0.
What should be paid attention to is that although the FASTER is a copy tasks based method, but it is developed for multiple DAG based big data applications.As a result, it cannot compare the performance of it with the other two methods directly.In order to make a comprehensive comparison, on the basis of FASTER, we modified the FASTER to make it handle a single DAG based application, and this new FASTER is denoted by FASTER * .
The average of Mak.Del.Rate and Ex.Cost.Rate are displayed in Figure 6.In Figure 6(a), our FAUSIT has the minimum Mak.Del.Rate for any size of application.Meanwhile, the FASTER * has higher Mak.Del.Rate than our FAUSIT, and the Daly's method has the highest Mak.Del.Rate.The result in Figure 6(a) shows that our FAUSIT has the best performance for minimizing the makespan of an application among these methods.
In Figure 6(b), due to the fact that the Daly's method only adopts the check-point mechanism, this makes it has the minimum Ex.Cost.Rate.Meanwhile, the FASTER * consumes the maximum cost, since the Ex.Cost.Rate of it is very large.Interestingly, along with the increase of the count of subtasks in an application, the Ex.Cost.Rate of our FAUSIT is becoming smaller, which indicates that our FAUSIT has the ability to handle much bigger scale of application.
In conclusion, compared with the FASTER, our FAUSIT outperforms FASTER on both Mak.Del.Rate and Ex.Cost.Rate.Compared with the Daly's method, our FAUSIT outperforms it much more on Mak.Del.Rate and only consume 9% extra cost, which still makes our FAUSIT have competitiveness in practical system.Besides, the  in our FAUSIT can satisfy the requirements from different users; i.e., if the users need a shorter makespan for an application, they can turn the  up.On the contrary, if the users care about the cost, they can turn the  down.Thus, the  make the FAUSIT have strong usability.

Conclusion
This paper investigates the problem of improving the fault tolerant for a big data application running on cloud.We first analyze the characteristics of running a task on a processor.Then, we present a new approach called selective mirrored task method to deal with the imbalance between the parallelism and the topology of the DAG based application running on multiple processors.Based on the selective mirrored task method, we proposed an algorithm named FAUSIT to improve the fault tolerant for a big data application running on cloud; meanwhile, the makespan and the computation cost is minimized.To evaluate the effectiveness of FAUSIT, we conduct extensive simulation experiments in the context of randomly generated workflows which are real-world application traces.Experimental results show the superiorities of FAUSIT compared with other related algorithms, such as FASTER and Daly's method.In our future work, due to the superiorities of selective mirrored task method, we will try to apply it to other big data applications processing scenarios, such as improving the fault tolerant of multiple applications in the respect of could providers.Furthermore, we will also investigate the effectiveness of the selective mirrored task method in parallel real-time applications running on cloud.

Figure 2 :
Figure 2: An example of a failure.

Figure 4 :
Figure 4: An example of mirrored tasks method.

Figure 6 :
Figure 6: The result of Mak.Del.Rate and Ex.Cost.Rate.

Table 1 :
Cost optimization factors.DAG of the tasks in   the critical path of (, )   the -th subtask in    the weight of -th tasks in   the set of   in   , data dependence from   to    the set of  ,  the time to make a check-point file   the optimal check-point interval   () the practical completion time for the task  the time to read a check-point file  the time before a failure in an interval the computation time for a task  a schedule which maps the tasks to processors (  ) the weight of   (  ) the predecessor subtasks set of   in (, ) (  ) the successors subtasks set of   in (, ) (  ) the next subtask   on the same processor (  ) the finish time of   on the  (  ) the earliest start time of   on the  (  ) the latest finish time of   on the  (  ) the slack time of   on the  (  ) the indirect slack time of   on the  (  ) denotes the quantitative importance of 3.5.7.Earliest Start Time on the Schedule ((  )).For a given schedule , the start time of   is the (  ).It should be noted that the (  ) is different with the traditional earliest start time in DAG.3.5.8.Latest Finish Time on the Schedule ((  )).For a given schedule , the (  ) of   is defined below: ) and  (  ) are empty, min   ∈(  )∨(  ) { (  )} , otherwise.

Table 3 :
The Characteristics of the Big Data Application.
Figure 5: (a) The Mak.Del.Rate and (b) the Ex.Cost.Rate of .