Specification and runtime workflow support in the ASKALON Grid environment 1

We describe techniques to support the runtime execution of scientific workflows in the ASKALON Grid environment. We present a formal model and three middleware services that support in combination the effective execution in heterogeneous and dynamic Grid environments: performance prediction, scheduling, and enactment engine. We validate our techniques with concrete experimental results for two real-world applications executed in the Austrian Grid environment.


Introduction
The workflow model based on the loosely-coupled coordination of atomic activities emerged as one of the most interesting paradigms in the Grid community for programming or porting scientific applications to Grid environments.To meet this need numerous efforts [1,4,7,9,12,13,16,17,21], among which is the ASKALON project [10], are currently developing integrated environments to support the development cycle of scientific Grid workflows through graphical modelling tools, XML-based specification languages, or middleware services for advanced resource management, scheduling, prediction, reliable execution, and monitoring.
In this paper, we present new techniques developed within the ASKALON project to support the development of scientific workflows in Grid infrastructures.First of all, we present a model that tries to formally cover the most important constructs that we encountered in several real-world applications which we designed as scientific workflows (see Section 2).A performance prediction service (see Section 3) performs an offline training phase based on a well-defined experimental design to predict the execution times of workflow activities on Grid sites with a reduced number of experiments.A scheduling service (see Section 4) converts complex hierarchical workflows in flat Directed Acyclic Graphs (DAG) that can be effectively mapped onto the Grid using optimisation heuristics such as genetic or list scheduling algorithms.An enactment engine (see Section 5) simplifies the scheduled workflow using a partitioning algorithm to reduce the middleware overheads required for achieving improved performance.We compare our approach with the relevant related work in Section 6 and conclude in Section 7.

Workflow model
We present in this section a generic abstract model for formally representing large scale and complex scientific workflows in Grid environments.Our representation is generic and independent of any language or grammar as the underlying implementation platform.For example, we implemented our model through the XML-based Abstract Grid Workflow Language (AGWL) [11].Definition 1.We define a workflow as a DAG: W = (Nodes, C -edges, D -edges, IN -ports, OUT -ports), where Nodes is the set of activities, C -edges = N1,N2∈Nodes (N 1 , N 2 ) is the set of control flow dependencies, D -edges = N1,N2∈Nodes (N 1 , N 2 , D -port ) is the set of data flow dependencies, IN -ports is the set of workflow input data ports, and OUT -ports is the set of output data ports.An activity N ∈ Nodes is a mapping from a set of input data ports IN -ports N to a set of output data ports OUT -ports N : N : IN -ports N → OUT -ports N .A data port D -port ∈ IN -ports N × OUT -ports N is an an association between a unique identifier (within the workflow representation) and a well-defined type: D -port = (identif ier, type).
The type of a data port is instantiated by the type system supported by the underlying implementation language, e.g. the XML schema.The most important data type in our experience that shall be supported for Grid workflows is file along side other basic types such as integer, float, or string.An activity N ∈ Nodes can be of two kinds.
(1) Computational activity or atomic activity represents an atomic unit of computation such as a legacy sequential or parallel application.(2) Composite activity is a generic term for an activity that aggregates multiple (atomic and composite) activities according to one of the following four patterns: (i) parallel loop activity allows the user to express large-scale workflows consisting of hundreds or thousands of atomic activities in a compact manner; (ii) sequential loop activity defines repetitive computations with possibly unknown number of iterations (e.g.dynamic convergence criteria that depend on the runtime output data port values computed within one iteration); (iii) conditional activity models if and switch-like statements that activate one from its multiple successor activities based on the evaluation of a boolean condition; (iv) workflow activity is introduced for modularity and reuse purposes, and is recursively defined according to Definition 1.
For this particular paper, of special importance is the parallel loop (similar to a parameter sweep) which we represent as a tuple: N par = (N body , IN -ports Npar , OUT -ports Npar ), where: (1) ∃(D -port card , integer) ∈ IN -ports Npar a predefined cardinality input port of type integer that defines the runtime cardinality of the parallel loop (i.e.number of parallel activities), denoted as |N par |; (2) N body is an atomic or composite activity representing the parallel loop body of which |N par | independent instances are executed.The cardinality port can be instantiated either statically or at runtime, for example from one output port of a predecessor activity through a data flow dependency.

Performance prediction
The performance prediction service is responsible for predicting the execution times of atomic activities onto given Grid sites needed for scheduling purposes.We employ a prediction model based on historical data collected through a well-defined experimental design and training phase.Specifically in our work, the general purpose of the experimental design phase is to set a strategy for experiments to get the maximum performance training information to support its prediction later in minimum numbers of experiments.
The factors affecting the response variable which we currently consider are the problem size p incorporating the range of instances for each parameter variable, the Grid size g comprising all the Grid sites (i.e.parallel computers), and the machine size m including all different processor numbers on a Grid site.To reduce the experimental space from p × g × m, we introduce a Performance Sharing and Translation (PST) mechanism based on several multiparameter performance relativity properties, experimentally observed for our case study applications.For example, embarrassingly parallel applications that scale linearly with the machine and problem size benefit from the following inter-and intra-platform performance relativity properties.
Inter-platform PST specifies that the performance behaviour P g (A, p) of an application A for a problem size p relative to another problem size r on a Grid site g is the same as that of the same problem sizes on another Grid site h: i.e.

Pg (A,p)
Pg (A,r) P h (A,p) P h (A,r) .Similarly, intra-platform PST specifies that the performance behaviour of an embarrassingly parallel application A on a Grid site g for a machine size m relative to another machine size n for a problem size p is similar to that for another problem size q, i.e.
We choose one Grid site (the fastest based on previous runs) as the reference site and execute the complete set of experiments on it based on the cross product of the input problem size parameters.Later, we make one single experiment on each of the other Grid sites and use the reference values to calculate the predictions for other platforms using the inter-platform PST and thus minimise problem size combinations with the Grid size.Similarly, to minimise machine size combinations with the Grid size, we make complete set of experiments with one reference machine size and later make one single experiment each for each of the other machine sizes to translate the reference performance values for other machine sizes using intra-platform PST.
By means of inter-platform PST, the total number of experiments reduces from p × m × g to p × m + (g − 1) for parallel computers, and from p × m to p + g − 1, for single processor machines.By introducing intra-platform PST, we reduce total number of experiments for parallel machines as Grid sites further to a linear complexity of p + (m − 1) + (g − 1).

Experiments
We report experiments for two real-world applications, WIEN2k [20] and Invmod [23], in the Austrian Grid environment.We will present these workflows in detail in Sections 4.3 and 5.3, while in this section we concentrate on the two most computational expensive activities of these workflows: LAPW1 and WasimB2C.
We analysed the scalability of our experimental design strategy by varying the problem size from 10 to 200 for fixed values of the remaining factors: 10 Grid sites with machine size of 20 and 50 single processor machines.We observed a reduction in the total number of experiments (i.e.previously denoted as p × g × m) from 96% to 99%, as shown in Fig. 1a.A reduction from 77% to 97% in the total number of experiments was observed when we varied the machine size from one to 80, for fixed factors of 10 parallel machines, 50 single processor Grid sites and problem size of five.From another perspective, we observed that the total number of experiments increased from 7% to 9% when the Grid size was increased from 15 to 155, for the fixed factors of five parallel machines with machine size of 10 and problem size 10.We observed an overall reduction of 78% to 99% when we varied all factors simultaneously: five parallel machines with machine size from 1 to 80, single processor Grid sites from 10 to 95, and problem size from 10 to 95.
We comparatively show the predicted results using the inter-platform PST method versus the measured values for LAPW1 and iWasimB2C in Figs 1b and 1c, respectively.The lowest curve represents the execution values on the base Grid site whose values are used in the PST mechanism.In both cases the curves of the measured and predicted values are very similar, however, we can see that they are closest to each other near to the reference problem size and they distance from each other as the distance from the reference problem size increases.Due to this reason and whenever possible, we take the reference problem size as close as possible to the target value to be predicted.We observed that the average variation in the predicted values from the measured value, if made on the basis of maximum available value, is at most 10% which yields 90% accuracy in the prediction.As we get more data during the actual runs, the probability of finding closer parameter values other than the one calculated in the training phase increases which further improves the prediction accuracy.

Scheduler
One major difficulty of the scheduling service is the fact that the objective function (i.e.execution time) cannot be precisely calculated for dynamic workflows comprising conditional activities or sequential and parallel loops with unknown number of iterations.We therefore approach the workflow scheduling problem in two phases implemented by two modules: (1) workflow converter (see Section 4.1) transforms compact hierarchical workflow representations into flat static DAGs; (2) scheduling engine (see Section 4.2) implements heuristic algorithms for achieving good mappings of the generated DAGs onto the Grid resources.

Workflow converter
The purpose of the workflow converter is to transform a hierarchical Directed Graph (DG)-based scientific workflows containing sequential loops into a flat DAG of atomic activities suitable for optimised scheduling using heuristic algorithms.There are four constructs corresponding to the four composite activities described in Section 2 which are handled by the converter through four corresponding transformations: conditional activities through branch    expansion, sequential and parallel loops through loop unrolling, and sub-workflows through workflow inlining.A complete specification of the conversion algorithm can be found in [19].These transformations usually require additional prediction information such as the probability of execution of each branch in conditional activities or the number of iterations within sequential and parallel loops, which we compute from historical data.Transformations based on correct assumptions can imply substantial performance benefits, while incorrect assumptions require appropriate runtime adjustments such as undoing existing optimisations or rescheduling based on the new information available (see Section 5.2.1).

Scheduling engine
The scheduling engine is responsible for the actual mapping of a workflow application converted into a DAG onto the Grid resources such that the execution time objective function is minimised.The scheduling engine employs the performance prediction service to obtain expected execution times of individual activities.Definition 2. Let W = (Nodes, C -edges, D -edges, IN -ports, OUT -ports) denote a scientific workflow application.We evaluate a workflow schedule by constructing the Gantt chart where the end timestamp of each activity N ∈ Nodes is recursively defined according to the function: The HEFT weight and rank calculations for the sample workflow depicted in Fig. 2 (a) Computational activity ranks.
( where R * + denotes the set of real positive numbers, pred (N ) = ∀(N ,N )∈C -edges N is the set of predecessors of activity N , and ∅ the empty set.
In this section we present two heuristics that we use to implement the scheduling engine, along side a genetic algorithm that we described in [18]: (1) Heterogeneous Earliest Finish Time (HEFT) [27] algorithm that is a list scheduling heuristic purposely tuned for scheduling complex DAGs in heterogeneous environments; (2) a myopic just-in-time algorithm acting like an opportunistic resource broker similar to Condor DAGMan [1].

Heterogeneous Earliest Finish Time Algorithm (HEFT)
The HEFT algorithm, illustrated in pseudocode in Algorithm 1, is an extension of the classical list scheduling algorithm for heterogeneous environments which consists of three distinct phases: (1) the weighting phase (lines 3-8); (2) the ranking phase (lines 9-20); (3) the mapping phase (lines 21-24).We explain these three phases through a concrete example depicted in Fig. 2.
Weighting during the weighting phase (lines 3-8) adjusted for heterogeneous Grid environments, we assign weights to the workflow activities equal to their predicted execution time which we estimate based on the training phase that we presented in Section 3. Afterwards, we calculate the weight associated with a computational activity N ∈ Nodes as the average value of the predicted execution times T P N on every individual processor P available on the Grid (lines 3-5): w N = avg ∀P∈GRID T P N , ∀N ∈ Nodes.Similarly, we compute the weight associated to a data dependency as the average of the predicted transfer times across all pairs of Grid sites (rather than processorslines 6-8): In the example depicted in Fig. 2 and Table 1, the Grid consists of three processors P 1 , P 2 , and P 3 , therefore, the weight of activity N 1 is calculated as follows: 1 displays the weights of all workflow activities calculated using the same formulas.
Ranking The ranking phase (lines 9-20) is performed by traversing the workflow graph upwards and assigning a rank value to each activity equal to the weight of the activity plus the maximum rank value of all the successors (line 13): R N = max ∀Nsucc∈succ(N ) w N + R Nsucc , where succ(N ) = ∀(N ,N )∈C -edges N .For example, the rank of the activity N 1 is calculated as follows: R N1 = max w N1 + w (N1,N2) + R N2 , w N1 + w (N1,N2) + R N3 = max {7 + 5 + 26, 7 + 3 + 15} = 38.The activity list is then sorted in a descending order of the activity ranks (line 20), i.e.N 1 , N 2 , N 3 , and N 4 .
Mapping Finally in the mapping phase (lines 21-24), each ranked activity N is scheduled onto the processor S N that delivers its earliest completion time according to Definition 2, i.

Myopic algorithm
To compare the effectiveness of the scheduling heuristics, we developed a simple and inexpensive method which makes the mapping based on local optimal decisions similar to the matchmaking mechanism performed by a resource broker like Condor DAGMan [1] (see Algorithm 2).The algorithm traverses the workflow in the top-down direction (lines 5 and 6), analyses every activity separately, and assigns it to the processor which delivers the earliest completion time (line 7).

Layered partitioning
Layered partitioning combines HEFT and the myopic algorithms and considers as input to the conversion algorithm only a sub-workflow of a given depth of n activities, calculated for a workflow W = (Nodes, C -edges, D -edges, IN -ports W , OUT -ports W as follows: OUT -ports Wn ), where: (1) This method is similar to the one described informally in [7] and is more suitable for workflows with regular structures and a large number of activities (see Section 5.3), since it needs less scheduling time to compute optimised mappings of smaller sub-workflows while preserving the overall quality of the solution.

Invmod
Invmod [23] is a hydrological application designed at the University of Innsbruck for calibration of parameters of the WaSiM tool developed at the Swiss Federal Institute of Technology Zurich.Invmod uses the Levenberg-Marquardt algorithm to minimise the least squares of the differences between the measured and the simulated runoff for a determined time period.We re-engineered the monolithic Invmod application into a Grid-enabled scientific workflow consisting of two levels of parallelism as depicted in Fig. 3: (1) each iteration of the outermost parallel loop called random run performs a local search optimisation starting from an arbitrarily chosen initial solution; (2) alternative local changes are examined separately for each calibrated parameter, which is done in parallel in the inner nested parallel loop.The number of sequential loop iterations is variable and depends on the actual convergence of the optimisation process, however, it is usually equal to the input maximum iteration number.
We performed the experiments on seven heterogeneous Grid sites of the Austrian Grid [22] infrastructure illustrated in Table 2, aggregating 116 processors in total.The Invmod workflow is a common case of strongly imbalanced workflows for which one of the outermost parallel loop iterations is significantly longer than the others due a different number of inner sequential loop iterations.In our case, the converted DAG consists of 100 parallel iterations, one of which contains 20 sequential iterations of the inner optimisation loop, while the other 99 iterations only contain 10 optimisation iterations each (see Fig. 3b).This means that one parallel iteration needs approximately approximately twice the execution time of the others.
The experimental results for the Invmod workflow illustrated in Fig. 4 explain how each of the three algorithms deals with such strongly imbalanced workflow structures.As expected, the myopic algorithm provides the worst results which are approximately 32% worse than HEFT.The genetic algorithm produces quite good results, however, it is worse than HEFT since it does not consider in the optimisation process the execution order of parallel activities scheduled on the same processor.In addition, we applied incremental scheduling using 10, 20, and 30 partitioning layers and compared the results against the full-ahead workflow scheduling consisting of 44 layers.For such strongly imbalanced workflows, the activities on this much longer critical path should be given priority which is   well handled by the entire workflow scheduling strategy based on optimisation heuristics like HEFT and the genetic algorithm.Therefore, scheduling strategies based on workflow partitioning deliver worse results than those based on full workflow analysis, although their results are still better than the one found by the myopic algorithm.The genetic algorithm requires two orders of magnitude longer than list scheduling algorithms to converge to good solutions.
In addition, we applied the scheduling algorithms with and without prediction information to study the value and impact of the prediction availability.Scheduling with prediction information delivers between 25%-33% better results than without performance prediction when all activities are considered to have equal execution times.

Enactment engine
The enactment engine is a service responsible for the effective workflow execution in three major steps: (1) in the first step, the (XML-based) workflow representation is delivered to the scheduler for appropriate mapping onto the Grid, as presented in Section 4; (2) once the concrete workflow schedule is received, the engine simplifies the workflow through a partitioning algorithm (see Section 5.1); (3) during runtime, the workflow execution is dynamically improved by a dynamic steering algorithm (see Section 5.2).

Workflow partitioning
Our experience in running real-world applications in the Austrian Grid environment revealed that executing one computational activity on a remote Grid site contains on average about 10-20 seconds of overhead mainly due to mutual authentication latency and polling for job termination.This overhead may be significantly larger if the access to Grid sites is performed through local job management systems and, therefore, becomes critical for large scientific workflows comprising hundreds to thousands of activities.
In this section we propose a partitioning algorithm to decrease the number of activities and data dependencies in a workflow.Determining the number of partitions of a set of n numbers is a classical NP-complete problem of combinatorial mathematics called the n-th Bell number.Some related partitioning approaches were already proposed to solve this problem although their algorithms have different goals [2,7].Definition 3. We define a workflow partition as the largest sub-workflow W P = (Nodes P , C -edges P , D -edges P ) with the following properties: (1) all activities are scheduled on the same Grid site: S N1 = S N2 , ∀N , N 2 ∈ Nodes P ; (2) there must be no control flow and data flow dependencies to / from activities that have predecessors / successors within the partition: pred (N ) = ∅ ∨ pred (N ) ∈ Nodes P , ∀N ∈ Nodes P .
The goal of the partitioning algorithm is to generate a partitioned workflow denoted W P = (Nodes P , C -edges P , D -edges P ) from a workflow W = (Nodes, C -edges, D -edges), where: Nodes P = {P 1 , . . ., P n } is the set of workflow partitions that fulfil Definition 3, n i=1 P i = ∅ (i.e.partitions are disjoint), n i=1 P i = Nodes (i.e.partitions cover all workflow activities), and n is minimum.We base our partitioning algorithm on graph transformation theory [3] as the formal background to rigourously express it.We define several rules for defining valid workflow partitions that aim to decrease the complexity of the algorithm (to polynomial) and create the set of cooperating workflow partitions.
Let (W, R) denote a workflow transformation system, where R denotes the set of graph transformation rules.We approach the workflow partitioning problem using a four step transformation sequence denoted as: edges , and W P are partition sets generated using different transformation rules that preserve the control and data flow dependencies of the original workflow W. We omit the workflow input and output data ports for clarity reasons since they are irrelevant to our partitioning algorithm.
Step 1: Partition the workflow according to three control flow dependency rules R CF : 1. every activity of the workflow must belong to exactly one partition: ∀N ∈ Nodes, ∃P ∈ Nodes CF ∧ N ∈ P ∧ N / ∈ P ∧ ∀P ∈ Nodes CF \ P ; 2. every partition is one composite or atomic activity.Currently we perform this step by using additional information provided by the user in the XML-based workflow representation [11] and mapping one composite activity (e.g.parallel activity) to one partition; 3. no control flow dependencies between intermediate activities in different partitions are allowed: where pred and succ denote the predecessor, respectively the successor of an activity in the workflow; 4. the number of activities inside one composite activity must be more than the average number of processors on one Grid site.We introduce this rule to avoid too fine grained partitions in the workflow that would start slave engines on sites with little workload.
For example, in Fig. 5a we partition all atomic activities of the composite activities N if , N loop , and N seq into one partition, respectively, which produces the following control flow partitioning:  Step 4: W RM2 =⇒ W P .Since the partitioning may have been done too fine grain, we merge the partitions connected through control flow dependencies using the following two merge rules: 1. merge the partitions that are connected through control flow dependencies but have no data flow dependencies (i.e. they are scheduled on the same site: 2. in the final partition, there must be no control and data flow dependencies to/from activities that have predecessors/successors within the partitions.This is achieved by iteratively applying the following formula within fixed point algorithm until nothing changes and the largest partitions are achieved: Therefore, Workflow partitioning allows the engine to aggregate the activities belonging to the same partition and execute them as one single job submission which drastically reduces the job management latencies of workflows with a large number of activities.In addition, the data dependencies between activities belonging to different partitions are archived, packed, and transferred as one file transfer job which drastically reduces the (GridFTP) connection latencies: D -edges P = ∀P1,P2∈Nodes P {(P 1 , P 2 , D -port archive )}, where: {D -port } is a compressed archive of all data dependencies between partitions P 1 and P 2 (typically instantiated during execution by files).

Virtual single execution environment
As a specialisation of workflow partitioning, we propose another simple technique called Virtual Single Execution Environment (VSEE) to reduce the management overheads of workflows characterised by a large (hundreds to thousands) number of activities with complex data dependencies which are relatively small in size.VSEE replaces the data dependencies between activities with the full data environment, recursively defined for a partition P as follows: V P = ∀(P ,P ,D-port)∈D-edges P V P ∀(P ,P ,D-port)∈D-edges P {D -port }.Clearly, the following property holds: Upon executing a workflow partition on a Grid site, each slave engine automatically creates and removes one working directory that represents its execution environment.The VSEE mechanism transforms complex data dependencies between activities into one environment dependency between partitions that is packaged and transferred at runtime as one single data transfer activity.In addition to noticeably reducing the latencies and the number of data transfers for compute intensive Grid applications with large amounts of small sized data dependencies, VSEE presents two additional advantages: (1) it reduces the overhead of activity migration upon rescheduling (see Sections 5.2 and 5.3); (2) it shields the user from the complexity of the workflow definition and the error prone task of specifying tens or hundreds of input and output data ports between activities.
Figure 6 illustrates a sample workflow (see also Section 5.3) scheduled on three Grid sites {M 1 , M 2 , M 3 }.First of all, the workflow is split into seven partitions: Nodes P = 7 i=1 P i , based on the algorithm presented in Section 5.1 (see Fig. 6a).Then, the data flow between partitions is optimised according to the VSEE-based relationships depicted in Table 3.For example, transferring data between partitions only according to the data flow dependencies requires P 6 to receive the data from: For certain compute intensive applications characterised by large numbers of small data dependencies like WIEN2k, the VSEE mechanism can drastically decrease the number of file transfers (up to orders of magnitude) as we will experimentally illustrate in Section 5.3.

Table 3
The VSEE results for the WIEN2k workflow

Workflow steering
There may be many external factors that affect the execution of large workflows in dynamic Grid environments which no longer follow the original plan computed by the scheduler.Such unpredictable factors may include unpredictable queuing times, external load on processors (e.g. on Grid sites that also serve as student workstation rooms in our real Grid environment), unpredictable availability of processors on workstation networks (e.g. if a student shuts down a machine or reboots it in Windows operating system mode), jobs belonging to other users on parallel machines, congested networks, or simply inaccurate prediction information.Moreover, we often encountered in our real Grid environment sites that offer a reduced capacity for certain resources, for example a small number of input and output nodes that generate a denial of service attack if the number of concurrent file transfers (often used to increase bandwidth utilisation) exceeds a certain limit.The steering module of the enactment engine aims to minimise the losses due to such unpredictable situations that violate the optimised static mapping computed by the scheduler through appropriate rescheduling techniques.

Rescheduling events
The steering module of the enactment engine continuously monitors the workflow execution and triggers appropriate rescheduling events whenever any of the following situations occur: -cardinality port value change which implies modifications in the workflow shape, in particular in the size of parallel loops (see Section 2 for the formal definition and Section 5.3 for a real-world example); -inaccurate prediction of various workflow characteristics based on new execution performance data available, in particular branch probabilities in conditional activities, number of iterations in sequential and parallel loops, or more accurate execution time estimations of computational activities; -resource change, in particular in the availability of Grid sites (i.e.number of processors available) where workflow activities are scheduled, or when new powerful parallel computers become available; -performance contract violation caused by activity executions that no longer follow the original optimised plan computed by the scheduler.

Definition 4.
Let N be a submitted activity, W N its underlying work assigned (e.g.floating point operations), T N its estimated execution time, and: start (N ) = end (N ) − T N its start timestamp, where the end timestamp end(N) was defined in Section 4 (see Definition 2).We define the performance contract [26] of an activity N at time instance t , such that start (N ) t < end (N ), as: The steering module of the enactment engine triggers a rescheduling event for activity N at time instance t whenever: PC (N , S N , t ) > f N , where f N is the predefined performance contract elapse factor of activity N .Currently, the value of the performance contract elapse factor f N needs to be statically defined by the user for each activity (as activity properties in the workflow specification [11]) that represents a certain percentage from its predicted activity execution time T N .After rescheduling, the workflow activities are restarted.

Steering algorithm
An activity N ∈ Nodes of the running workflow can be at a certain time instance t in one of the following states: queued, running, completed, or failed, denoted as state(N , t ).
In this section we propose a simple steering algorithm depicted in Algorithm 3 that is based on the repeated invocation of the static scheduling algorithm, as informally outlined by the following execution steps: (1) the algorithm receives as input a scientific workflow compliant with the model presented in Section 2 (lines 1-2); (2) the workflow is converted into a DAG and scheduled onto the Grid using optimisation heuristics as presented in Section 4 (lines 3-4); (3) the workflow is submitted for execution based on the initial schedule (line 5); (4) the workflow is monitored until it completes its execution (lines 6-14); (5) whenever one of the events presented in the previous section occur, a rescheduling event is triggered (line 7); (6) all activities that violate their performance contract are cancelled and reported as failed (lines 8-11); (7) the workflow is converted once again based on the new runtime information and rescheduled (lines 12-13).
To efficiently handle workflow rescheduling at runtime,we extended the workflow conversion algorithm introduced in Section 4.1 (see [19]) with a new time axis that only considers the relevant (i.e.still to be executed) part of the workflow as part of the optimisation process (lines 17-30).More specifically, the following activities are eliminated and not considered for rescheduling (lines 26-27): (1) all properly running activities that fulfill their performance contract; (2) all completed activities that do not have sequential loops as parents and, therefore, will not be re-executed.

WIEN2k
WIEN2k is a program package for performing electronic structure calculations of solids using density functional theory based on the full potential (linearised) augmented plane wave ((L)APW) and local orbital (lo) method.We first ported the application onto the Grid by splitting the monolithic code into several course grain activities coordinated in a workflow as illustrated in Fig. 6.The LAPW1 and LAPW2 activities can be solved in parallel by a fixed number of so called k-points.A final activity called Converged applied to several output files tests whether the problem convergence criterion is fulfilled.The number of sequential loop iterations is statically unknown.
We executed the WIEN2k workflow in a subset of the Austrian Grid infrastructure [22] consisting of a number of parallel computers and workstation networks accessible through the Globus toolkit and local job managers, as depicted in Table 4.We chose a problem size that produces at runtime 250 parallel k-points which means a total of over 500 workflow activities.We first executed the workflow application on the fastest site available (i.e.altix1.jku in Linz) that gives an indication of what can be achieved for this application by using only local compute resources.Then we incrementally added the next fastest sites for this application (i.e.top-down order in Table 4) and observed the benefits or losses obtained by executing the same problem size in a larger Grid We compared the performance delivered by three of our workflow enactment techniques: partitioning, partitioning and steering, and VSEE.
Figure 7a presents the number of WIEN2k partitions computed by the partitioning algorithm for each Grid site configuration.The number of partitions depends on the workflow structure and the execution plan computed by the scheduler and is proportional to the number of sites used in each execution.Figure 7b shows the execution times for running the same WIEN2k problem on different Grid size configurations ranging from one to six aggregated sites.Similarly, Fig. 7c  .Without any optimisation, the performance and the speedup deteriorate with the increase in the number of Grid sites used for scheduling and running the workflow.With optimisation and steering, the WIEN2k execution time improves because of the simplified data flow and balanced execution of the LAPW1 and LAPW2 parallel loops.
Figures 7d and 7e show that the number of file transfers, respectively remote job submissions, are considerably reduced when optimisation is applied, which explains the performance results obtained.Figure 7f shows that the   size of transferred data under VSEE is obviously larger than in the other cases, however, VSEE offers the biggest execution improvement since it reduces the number of file transfers by three orders of magnitude, which drastically reduces the latencies (i.e.mutual authentication to the GridFTP service) while effectively utilising the bandwidth.The steering improvement is caused by a large load imbalance in the workflow parallel loops caused by external load on the Altix shared memory machines and external jobs on the workstation networks.As a consequence, the execution deviated too much from the schedule which triggered a rescheduling event.Since the number of activities in the parallel loop exceeds the number of processors on the Grid testbed, the scheduler was able to reduce the load balance to half (see Fig. 7g) the loop by rescheduling the jobs not yet executed.We exhibit a slow down from five to six Grid sites using control and data flow optimisation because of the increased communication time across six distributed sites.
Figure 7h compares the data transfer overheads of the activity migration upon steering with and without the VSEE mechanism.One important aspect is that the data transfer overhead upon migrating LAPW1 and LAPW2 activities is zero when using the VSEE mechanism.The reason is that the sequential activities LAPW0 and LAPW2 FERMI replicate all their output files to the sites where the following LAPW1 and LAPW2 parallel loop activities are scheduled.Therefore, these activities will find their inputs already prepared on the sites where they are migrated which eliminates the data transfer overhead.
To better understand the steering algorithm, we generated three experimental WIEN2k workflows (i.e. two DAG and one DG-based containing one sequential loop) that correspond to different application input cases (i.e. the number of atoms and matrix sizes) with different parallelization sizes (i.e.number of k-points).To achieve a more fine-grained execution trace, we generated periodical rescheduling events in addition to those presented in Section 5.2.1.
Figure 8a traces the value of the makespan objective function optimised by the scheduling algorithm at consecutive scheduling events during the execution of each experimental workflow.A characteristic of the WIEN2k workflow is that the cardinality of the LAPW1 and LAPW2 parallel loops is determined by a cardinality port generated by LAPW0.Since the value of this port is statically unknown, the scheduler assumes one activity iteration which serialises the workflow activities onto the fastest Grid site.As soon as the cardinality port is instantiated, the workflow is rescheduled and the predicted makespan increases.As the remaining workflow activities are scheduled, execute, and complete, the makespan of the remaining DAG1 and DAG2 sub-workflows obviously decreases with the number of scheduling events.The abrupt decreases of the makespan happen after the submission of all the LAPW1 k-points, which are the most time consuming workflow activities that no longer need to be considered by the scheduler.
At this point, we artificially created external load by submitting several external jobs to the parallel machines.This caused abrupt increases in the makespan due to LAPW1 activities that violate their performance contract and need to be reconsidered by the scheduler for rescheduling, migration, and restart.In case of the DG-based workflow, the scheduler always receives the complete workflow as input but with a different control precedence relation between activities which explains why the makespan stays relatively constant.
Figure 8b traces the overall predicted workflow makespan (i.e. the time the entire workflow is expected to execute) at consecutive scheduling events during the workflow execution.The peaks are again due to the same performance contract violations of several LAPW1 activities which, after being rescheduled, bring the next predicted makespan close to the original predicted value.We achieve through rescheduling an estimate of about two fold improvement in the overall makespan.Since the workflow referred as DAG2 represents a larger problem size than DAG1, the benefit obtained through rescheduling and activity migration is higher.The final makespan of the DAG-based workflows is, however, about twice as large as originally predicted by the scheduler.For the DG-based workflow, we could not estimate the makespan of the entire workflow (i.e.beyond the execution of one sequential loop iteration) since the number of loop iterations is statically unknown.As a consequence, Fig. 8 represents the DG makespan of one workflow iteration only, which was successfully kept relatively constant through activity migration in two critical occasions.

Related work
The GrADS project pioneered the idea of runtime adaptation based on performance contracts through online monitoring and analysis performed by the Autopilot tool [24].The work focused primarily on Grid support for numerical libraries, high performance parallel applications [5], and parameter studies, which is complementary to our research on scientific workflows.
Gridbus [6] provides an XML-based language oriented towards parametrisation and Quality of Service requirements.No branches and loops are supported.Gridbus provides a scheduler supporting deadline and budget constraints based on genetic algorithms.The work is based on the a simulated environment, while our work targets real-world applications.Performance prediction is not addressed.
ICENI [16] workflow specification contains low-level enactment engine-specific constructs such as start and stop activities.Scheduling is done using random, best of n-random, simulated annealing, and game theory algorithms.Prediction work focuses on improving Grid predictability through advance reservation.
Similar to our approach, Karajan [25] can specify hierarchical workflows using an XML language that includes sequential and parallel loops.Workflows can be modified at runtime through interaction with a workflow repository or schedulers for dynamic association of resources to tasks.Karajan applies opportunistic round-robin or lookahead scheduling policies based on the maximum number of jobs allowed for a Grid site.No workflow optimisation or steering techniques are addressed.
Kennedy et al. [14] compared several task-based and workflow-based approaches to resource allocation for workflow applications based on simulation rather than real executions as we do.They also study the impact of uncertainty on the overall workflow schedule but do not propose runtime optimisation or steering techniques.
Kepler [15] extends the Ptolemy system with new features and components for scientific workflow design such as branches, sequential loops, and data dependencies.Parallel loops are not supported.Automatic scheduling based on performance prediction is not supported.
Pegasus [8] uses Condor DAGMan [1] as its enactment engine enhanced with data derivation techniques that simplifies the workflow at runtime based on data availability.Pegasus provides a layered workflow restructuring method which takes place before the scheduling phase.Our approach partitions and optimises the workflow after scheduling and uses this mapping information to further improve the partitioning.
Taverna [17] was originally designed to provide Grid support for bioinformatics applications with recent focus on semantic annotations, provenance, and workflow reuse.Taverna is based on Web services and uses the Simple Conceptual Unified Flow Language for workflow choreography, which is limited to DAGs.Taverna enables users to construct, share, and enact workflows using a customized fault-tolerant enactment engine based on opportunistic just-in-time scheduling.
Triana [21] uses the Grid Application Toolkit interface to the Grid through JXTA and Web services.It provides support for conditional activities and sequential loops, but lacks compact mechanisms for expressing large parallel loops.Scheduling is done just-in-time with no optimizations or performance estimates.Recent preliminary work targets a generic architecture for monitoring and steering legacy applications.
In [12], a pattern-based software engineering tool for Grid environments is described.The authors propose a broad set of structural patterns classified in topological (e.g.pipeline, ring, star) and non-topological (e.g.adapter, facade, proxy) categories.Additionally, they propose a set of structural operators to modify and manipulate patterns including increase, decrease, extend, reduce, embed, or extract.The approach is validated through an implementation of the structural patterns as an extension to the Triana tool.The approach is generic and at a high level of abstraction that could be used to express ASKALON workflows too.
UNICORE [9] provides a graphical composition of directed graph-based workflow with no support for parallel loops.The scheduling is done manually by allowing the user to allocate resources and specify data transfer tasks though the graphical user interface.
JISGA [13] is a Jini-based service-oriented architecture for Grid computing that, in addition to the basic functionalities of a general Jini system, supports workflow applications by implementing the XML-based Service Workflow Language (SWFL).SWFL extends IBM's Web Services Flow Language (WSFL) through improved conditional and loop control constructs, extended data flow patterns, more data mappings including arrays, and support for assignment statements.Optimisation is supported through discovery of a set of semantically equivalent services and selection based on real-time or historical data.Global workflow optimisations are not supported.

Conclusions
In this paper, we have presented techniques used in the ASKALON project for supporting effective modelling and high-performance execution of scientific workflows in Grid environments.Our approach is new or different from existing approaches in several aspects.As part of our abstract and generic hierarchical model, we introduced the concept of cardinality input port of parallel loops that changes the workflow structure dynamically at runtime.A performance prediction service supports the scheduler with accurate execution time information based on a welldefined training phase with a reduced number of experiments.A modular scheduling service employs advanced heuristics such as list scheduling, matchmaking, and genetic algorithms to find good mappings of workflow activities onto the Grid resources.An enactment engine service ensures scalable execution of large scientific workflows through techniques such as partitioning, control and data flow optimisation, and runtime steering adaptation.In contrast to related work often based on simulation, we validated our techniques by modelling, scheduling, and analysing the scalability of two real-world applications from material science and hydrological fields on our national Grid infrastructure.
Experiment reduction with problem, machine, and Grids ize.
a m e te r Va l u e s ( k -m a x) Execution Times (Seconds) agrid1 altix1 .uialtix1 .jk. gup hy dra.gup sc hafberg (b) Relative values of LAPW1.

Step 2 :
W RDF =⇒ W DF .Partition the original workflow according to three data flow dependency rules R DF : 1. each activity of the workflow must belong to exactly one partition: ∀N ∈ Nodes, ∃P ∈ Nodes DF ∧ N ∈ P ∧ N / ∈ P , ∀P ∈ Nodes DF \ P ; 2. the data dependencies between activities scheduled on the same Grid site are eliminated: D -edges DF = D -edges \ (N 1 , N 2 , D -port ) , ∀N 1 , N 2 ∈ Nodes ∧ S N1 = S N2 ; 3. activities scheduled on the same Grid site belong to the same partition: ∀N 1 ∈ P ∈ W DF ∧ ∀N 2 ∈ P ∧ S N1 = S N2 .
displays the workflow speedup computed as the ratio between the fastest single site execution time T M seq (altix1.jku in Linz) and the current Grid execution time T W : S =

Table 2
Austrian Grid testbed for Invmod scheduling experiments (a) Original workflow.

Table 4
The Austrian Grid testbed for WIEN2k experiments