^{1}

^{1}

^{1, 2}

^{3}

^{4}

^{1}

^{2}

^{3}

^{4}

This paper presents a cost optimization model for scheduling scientific workflows on IaaS clouds such as Amazon EC2 or RackSpace. We assume multiple IaaS clouds with heterogeneous virtual machine instances, with limited number of instances per cloud and hourly billing. Input and output data are stored on a cloud object store such as Amazon S3. Applications are scientific workflows modeled as DAGs as in the Pegasus Workflow Management System. We assume that tasks in the workflows are grouped into levels of identical tasks. Our model is specified using mathematical programming languages (AMPL and CMPL) and allows us to minimize the cost of workflow execution under deadline constraints. We present results obtained using our model and the benchmark workflows representing real scientific applications in a variety of domains. The data used for evaluation come from the synthetic workflows and from general purpose cloud benchmarks, as well as from the data measured in our own experiments with Montage, an astronomical application, executed on Amazon EC2 cloud. We indicate how this model can be used for scenarios that require resource planning for scientific workflows and their ensembles.

Today, science requires processing of large amounts of data and use of hosted services for compute-intensive tasks [

Research presented in this paper can be seen as a step towards developing a “cloud resource calculator” for scientific applications in the hosted science model [

The main contributions of this paper are summarized as follows.

We define the problem of workflow scheduling on clouds as a cost optimization problem of assigning levels of tasks to virtual machine instances, under a deadline constraint.

We specify the application model, infrastructure model, and the scheduling model as mixed integer programming (MIP) problems using AMPL and CMPL modeling languages.

We discuss the alternative scheduling models for coarse-grained and fine-grained tasks.

We evaluate the models using infrastructure performance data: one obtained from CloudHarmony benchmarks, and the one based on our own experiments with Montage workflows on Amazon EC2 cloud.

After outlining the related work in Section

Our work is related to heuristic algorithms for workflow scheduling on IaaS clouds. In [

The deadline-constrained cost optimization of scientific workloads on heterogeneous IaaS described in [

Pipelined workflows consisting of stages are addressed in [

Integer linear programming (ILP) method is applied to scheduling workflows on hybrid clouds in [

The core of our methodology (see Figure

An overview of our approach to workflow scheduling. Mathematical models are input to the solver: application, infrastructure, and scheduling models, together with corresponding datasets.

The mathematical programming approach enables us to formally define optimization problem. AMPL (a mathematical programming language) and CMPL (COIN mathematical programming language) are algebraic mathematical modeling languages that resemble traditional mathematical notation to describe variables, objectives, and constraints. Algebraic modeling languages allow expressing a wide range of optimization problems: linear, nonlinear, and integer. The advantage of AMPL is that it is one of the most advanced mathematical programming languages, while CMPL is easier to use in open source projects. AMPL or CMPL enables us to separate model definition and instance specific data, usually into three files: model, data, and calling script. The model file defines abstract optimization model: sets and parameters, objective and constraints. The data file populates the sets and parameters with the numbers for the particular instance of the problem. Both model and data files are loaded from calling script that may do some pre- or postprocessing. In addition, it is possible to import and export data and results into some external format such as YAML for analysis or integration with external programs.

The input to the solver has to be prepared in the form of a problem description. We separate the problem into an application model (in this case the leveled workflows) and infrastructure model (cloud consisting of compute sites running virtual machines and object storage such as Amazon S3). In addition, a scheduling model has to be defined, specifying how to calculate the objective and constraints using the application and infrastructure models. The challenge in the scheduling model is that it has to be developed to allow the solver to find a solution in a reasonable amount of time, so it must incorporate appropriate assumptions, constraints, and approximations. We discuss these assumptions in detail in Section

The scheduling problems that we deal with in this paper are formulated as mixed integer programming (MIP) problems. This class of optimization problems has linear objective and constraints, while some or all of variables are integer-valued. Such problems are solved by using branch-and-bound approach that uses a linear solver to solve subproblems. Moreover, the solvers can relax the integrality of the variables in order to estimate the solution, since no integer solution can be better than the solution of the same problem in continuous domain. The difference between the best integer solution found and the noninteger bound can be used to estimate the accuracy of solution and to reduce the search time (see Section

In this paper, we describe two alternative scheduling models: for workflows with fine-grained and coarse-grained tasks. This is motivated by the observation [

The scheduling models have to be provided with the actual values of parameters, consisting of the application data and infrastructure data. To evaluate our models, we use two sources of application data: synthetic workflows obtained from the workflow generator gallery [

In the following sections, we describe the models and datasets used in more detail.

In this paper we focus on large-scale scientific workflows [

We assume that each workflow may be represented with a directed acyclic graph (DAG) where nodes in the graph represent computational tasks, and the edges represent data- or control-flow dependencies between the tasks. Each task has a set of input and output files. We assume that the task and file sizes are known in advance.

Based on the characteristics of large-scale workflows, we assume that a workflow is divided into several levels that can be executed sequentially and tasks within one level not do depend on each other (see Figure

Example application structure.

Similar to what is in [

For evaluation, we use synthetic workflows that were generated using historical data from real applications [

In this section we give the mathematical formulation of the models, beginning with application and infrastructure models, and then describe the scheduling models for coarse-grained and fine-grained workflows. We have intentionally decided to present the problem in a form which is different from the routine statement of mathematical progrramming way. The main reason was to make it easily understood for reasearchers engaged in workflow execution optimization.

To perform optimization of the total cost of the workflow execution, mixed integer problem (MIP) is formulated and implemented using a mathematical programming language. First, we have implemented the optimization model using AMPL [

Introducing

Each instance type

This instance model assumes the hourly billing cycle, which is the case for most of the cloud providers, notably for Amazon EC2.

Storage site

Additionally, we need to provide data transfer rates

Our application model is different from that in [

The application model assumes that the estimated execution time

In this model, we schedule groups of tasks of the same type divided into levels. We do not schedule individual tasks as in [

To keep this model in the MIP class, we had to take a different approach than in [

The variables defined in this way allow the solver to search over the space of possible assignments of instances to task groups (

The objective function

To fix that the actual execution time of a level, rounded up to a full hour, gives us the level sub-deadline (

To make sure that all the instances run for enough time to process all tasks allocated to them we adjust

To reject symmetric solutions and thus to reduce the search space, we add three constraints:

Finally, the constraint

The scheduling model presented above shows its advantages if the workflow tasks are about one hour long or larger, and the deadline exceeds one hour. For fine-grained workflows, such as Montage where most task execution times are in order of seconds and the whole workflow may be finished within an hour, a model can be simplified.

When scheduling workflows with many short tasks and with deadlines shorter than the cloud billing cycle (one hour), we do not need to use the

In addition to these assumptions, we changed the way how the data transfer time is computed. Since for short tasks the data access latency is important, in addition to transfer rate

Based on these modifications, the auxiliary parameters transfer time

The remaining part of the model has the following form.

This scheduling model yields reasonable results only for the cases when it is actually possible to complete all the workflow tasks before the deadline. If not, the solver will not find any solution.

The optimization models introduced in this section were implemented using CMPL and AMPL effectively being workflow schedulers. The source code of the schedulers is available as an online supplement (

To perform optimization we need to provide optimization models defined in the previous section with data describing an application and an infrastructure. First, we used the generic infrastructure benchmarks obtained from CloudHarmony and the application data from the workflow generator gallery. Next, we performed our own experiments using the Montage workflows on Amazon EC2, which provided the application-specific performance benchmark of cloud resources together to obtain the real application data. The data gathered during experiments are inputs for the scheduler.

To evaluate the coarse-grained scheduler on realistic data, we used CloudHarmony [

We used the first generation of CloudHarmony CPU benchmarks described in [

Amazon instance benchmarks for different tasks compared to the generic CloudHarmony benchmark and Amazon ECU. Data was normalized to

We tested the coarse-grained scheduler with all of the applications from the gallery: Montage, CyberShake, Epigenomics, LIGO, and SIPHT for all available workflow sizes (from 50 to 1000 tasks per workflow up to 5000 tasks in the case of SIPHT workflow). We varied the deadline from 1 to 30 hours with 1-hour increments. We solved the problem for two cases, depending on whether the data are stored on S3 or on CloudFiles.

Cloud benchmarks, such as CloudHarmony [

Usually, benchmarks take into account the fact that instances provide multiple virtual cores that speed up multithreaded applications, but it has no impact on single threaded ones. Montage workflow tasks are single threaded and therefore in our experiment the number of execution threads running in parallel was equal to the number of virtual cores. We used the HyperFlow workflow engine [

The data we gathered in experiments may be used to calculate application-specific performance metric of the instance (ECU-like). In Figure

The observation from this evaluation is that the benchmarks from CloudHarmony give better approximation to the task performance than the generic ECU value. Moreover, it is important to distinguish between parallel and sequential workflow levels when selecting the virtual machine instance type. The dataset obtained in this experiment was used for evaluation of fine-grained scheduling model in Section

In this section, we present the results of optimization, obtained by applying our schedulers to the application and infrastructure data. First, we show the results of using the coarse-grained scheduler applied to the generic CloudHarmony datasets. Next, we present the results of the fine-grained scheduler applied to the dataset obtained from our experiments with the Montage workflow on EC2.

Figure

Result of coarse-grained scheduling for the Epigenomics application.

500 tasks, 4 GiB data size

400 tasks, 1 GiB data size (legend the same as above)

Ratio of the actual completion time to the deadline for the Epigenomics workflow with 500 tasks.

One interesting feature of our scheduler is that for longer deadlines it enables finding the cost-optimal solutions that have shorter workflow completion time than the requested deadline. This effect can be observed in Figure

Figures

Optimal cost found with the coarse-grained scheduling for CyberShake and LIGO applications.

CyberShake, 500 tasks

LIGO, 500 tasks

Optimal cost found by the scheduler for Montage and SIPHT applications.

Montage, 500 tasks

SIPHT, 5000 tasks

To investigate how the scheduler behaves for workflows with the same structure, but with much longer runtimes of tasks, we run the optimization for Montage workflow with tasks 1000x longer. This corresponds to the scenario where tasks are in the order of hours instead of seconds. The results in Figure

Optimal cost found by the coarse-grained scheduler for Montage workflow of 500 tasks with runtimes artificially multiplied by 1000 for different cloud infrastructures.

The runtime of the optimization algorithm for workflows with up to 1000 tasks ranges from a few seconds up to 4 minutes using the CPLEX [

Optimization time of the solver.

Epigenomics, 600 tasks

SIPHT, 5000 tasks

Figure

Solver runtime with different relative MIP gap (in percent), showing the relation between accuracy and runtime of the solver for the coarse-grained scheduler for Montage workflow of 500 tasks with runtimes artificially multiplied by 1000 for different cloud infrastructures.

We performed optimization for deadlines ranging from 13 to 60 minutes, using the Amazon EC2 cloud, with S3 or local storage. When assuming that the storage is local, we set the

The results shown in Figure

Montage workflow execution cost (

In this paper, we presented the schedulers using cost optimization for scientific workflows executing on multiple heterogeneous clouds. The models, formulated in AMPL and CMPL, allow us to find the optimal assignment of workflow tasks, grouped into levels, to cloud instances. We validated our models with a set of synthetic benchmark workflows as well as with the data of real astronomy workflow, and we observed that they gave useful solutions in a reasonable amount of computing time.

Based on our experiments with execution of Montage workflow on Amazon EC2 cloud and its characteristics, we developed separate scheduling models dedicated to coarse-grained workflows and to fine-grained workflows with short deadlines. We also compared the general-purpose cloud benchmarks, such as CloudHarmony, with our own measurements. The results underline the importance of application-specific cloud benchmarking, since the general purpose benchmarks can serve only as the rough approximation of the actual application performance. The observed relations between the granularity of the tasks and the performance of optimization models shows the influence of the cloud billing cycle on the cost optimizing workflow scheduling.

By solving the models for multiple deadlines, we can produce trade-off plots, showing how the cost depends on the deadline. We believe that such plots are a step towards a scientific cloud workflow calculator, supporting resource management decisions for both end-users and workflow-as-a-service providers.

In the future, we plan to apply this model to the problem of provisioning cloud resources for workflow ensembles [

The authors declare that there is no conflict of interests regarding the publication of this paper.

This research was partially supported by the EC ICT VPH-Share Project (Contract 269978) and the KI AGH Grant 11.11.230.124. The work of K. Figiela was supported by the AGH Dean’s Grant. E. Deelman acknowledges support of the National Science Foundation (Grant 1148515) and the Department of Energy (Grant ER26110). Access to Amazon EC2 was provided via the AWS in Education Grant. The authors would like to express their thanks to the reviewers for their constructive recommendations that helped them improve the paper.