Energy-Efficient Scientific Workflow Scheduling Algorithm in Cloud Environment

Scheduling extensive scientific applications that are deadline-aware (usually referred to as workflow) is a difficult task. This research provides a virtual machine (VM) placement and scheduling approach for effectively scheduling process tasks in the cloud environment while maintaining dependency and deadline constraints. The suggested model’s aim is to reduce the application’s energy consumption and total execution time while taking into account dependency and deadline limitations. To select the VM for the tasks and dynamically deploy/undeploy the VM on the hosts based on the jobs’ requirements, an energyefficient VM placement (EVMP) algorithm is presented. Demonstrate that the proposed approach outperforms the existing PESVMC (power-efficient scheduling and VM consolidation) algorithm.


Introduction
Large-scale complex scientific applications/workflow are executed and analyzed in the multi-disciplinary area of research such as astronomy and physics [1]. The workflow contains a large number of mutually dependent tasks which are executed according to their dependency constraint [2]. Due to dependency constraint, the child task can start its execution only when parent task finishes its execution. A directed acyclic graph (DAG) is used to represent workflows. These workflows often have disparate requirements (such as storage and CPU) and constraints (dependency) that need to be accounted during their execution. For example, the scientific workflow Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) [3] is a resource-intensive workflow with a good degree of scalability [4]. The strict necessity of the computing infrastructure makes the execution of scientific applications difficult and costly [5]. Cloud computing provides virtualized cloud resources as a service, on-demand, and pay-per-use basis [6,7]. The characteristics of cloud computing such as elasticity and flexibility make this environment a major trend for computation and storage services. These characteristics motivate to execution of scientific applications in the cloud environment [8].
Scientific workflows are the constitution of distinct tasks with complex dependency. Resource provisioning and the order in which workflow tasks are executed are challenging problems. The inefficient utilization of resources while executing the workflows wastes a tremendous number of resources. The inefficient utilization of resources increases the number of unused provisioned resources. These unused resources increase energy consumption without performing any useful operation [9]. The resource utilization can be increased by efficient resource provisioning. An energyefficient scheduling algorithm can be used to manage the resources that are required by the task while executing these scientific workflow tasks. In the literature, numerous workflow scheduling algorithms have been proposed. These scheduling algorithms focus on diminishing makespan and cost with inadequate resources. The selection and designing of a competent and operative workflow scheduling algorithm are also challenging tasks [10]. The energy-aware scheduling algorithm must be selected which can provision a proper resource from the offered resources which are efficient enough to complete the workflow tasks within their deadline constriction, and it can decrease the energy consumption. To minimize the energy consumption Dynamic Power Management (DPM) [11,12], Dynamic Voltage and Frequency Scaling (DVFS) [12][13][14][15], resource consolidation with migration techniques [6,16], virtualization [6], and green policies [17], technologies are used. Energy consumption has also been minimized by reducing the computational power of the resources. The reduction of computational power has increased the workflow makespan.
An amalgamation of software and hardware-based techniques is necessary to reduce energy consumption. In this paper, the EVMP algorithm is proposed to schedule a scientific workflow on virtual machines (VMs). The EVMP algorithm integrates both hardware and software policies to minimize energy consumption. Virtualization technology is exploited to create VMs on a server. DVFS technique is used to save energy when the server/core of the CPU is idle. Dynamic provisioning of heterogeneous types of available resources is considered to show the infrastructure-as-a-Service (IaaS) cloud service. An energy model is presented to monitor and calculate the energy consumption. During the scheduling of tasks, server overloading is also prevented by monitoring the server status [18].
1.1. Paper Outline. The next section presents the cloud workflow model, task model, and energy model to execute the workflow. In Section 3, the workflow task scheduling algorithm is presented. The experimentation and performance metrics are presented in Section 4. Section 5 demonstrates the simulation results and discussion. The conclusion of the paper along with future directions is presented in the last section.

System Model
This section describes the cloud model, workflow model, task model, and energy model.

Cloud Model.
In this research paper, large and nonhomogenous hosts or physical servers are deployed. In this paper, host/physical server host k ðk th hostÞ is depicted as host k = fcpu k , pre k , ram k , network k , storage k g, where cpu k , pre k , ram k , network k , storage k is Central Processing Unit (CPU) capacity, number of processing elements, Random Access Memory (RAM) capacity, network bandwidth capacity, and storage capacity on the host k , respectively. cpu k is equally divided into pre k . Million instructions per second (MIPS) [6], megabytes (MB), gigabits per second (Gbps), and gigabyte (GB) measurement units are used to measure the capacity of CPU, RAM, network bandwidth, and storage, respectively [19]. VMs are used to execute the workflow, and more than one VM can be deployed on the host. Let j be the number of VMs deployed on the host k , and it is depicted as VM jk = fvm 1k , vm 2k , ⋯, vm jk g. VMs are dynamically deployed/undeployed on the host as per the workflow demands. To execute the workflow, the fraction of host resources are allocated to the VM, and it is depicted as vm jk , f cpu jk , f pre jk , f ram jk , f network jk , and f storage jk are j th VM on k th host, fraction of CPU, processing elements, RAM, network bandwidth, and storage, respectively. In this paper, hosts are switched on/off dynamically. Based on the host utilization, hosts are characterized into three categories, i.e., underloaded, overloaded, and normal. If the resource utilization is less than the lower threshold value, then the host is categorized as an underloaded host. If any host is underloaded, then try to migrate the deployed VMs and switch off the host. This strategy is useful to minimize energy consumption. If the resource utilization is more than the upper threshold value, then, the host is categorized as an overloaded host. Migrate some of the VMs from the overloaded host because overloaded hosts consume more energy. Otherwise, the host is in the normal category.

Workflow
Model. Workflow (W) is described as a set of interdependent computational tasks [20]. In the literature [21], many scientific workflows such as LIGO, Montage, Cybershake, Epigenomics, and Pan-STARRS exist. In this paper, Pan-STARRS scientific workflow is considered for task execution. Pan-STARRS project continuously monitors the entire sky to detect moving or variable objects. PS1 telescope is used to monitor the sky. John Hopkins University and Microsoft manage the generated astronomy data using two types of workflows, i.e., PSLoad workflow and PSMerge workflow. PSLoad workflow is used to collect the data from the telescope and store data in the database. PSMerge workflow is used to update the database. PSLoad and PSMerge workflows are described in Figures 1 and 2, respectively. Table 1 describes the detailed characteristics of the workflows.

Task Model.
A workflow task is an activity that is carried out as part of the workflow description [20]. Workflow task has needed resources for the complete execution of the workflow with a set of constraints. For example, task length/size in million instruction (MI), number of processing elements, deadline in seconds, data transfer file size in MB, list of child tasks, and list of parent tasks, these are modeled as

Wireless Communications and Mobile Computing
(3) Start time: t i be the entry task; then start timeðst ijk Þof a taskt i onvm jk athost k is calculated as where rt jk is the ready time of the vm jk at host k . Task is not an entry task, and the same VM is used to execute the child task and its parent task; then, the start time of the child task is calculated as where ft pjk is the finish time of the parent task on vm jk at host k . If task t i is not an entry task and is allocated on the different VM on which its parent is not executed, then, the start time of the task is calculated as If task t i is not an entry task and a new VM ðvm jk Þ is deployed for its execution, then, st ijk is calculated as where ct jk is the creation time of the vm jk at host k . If the VM v xk is migrated to a new host and a new VM v jk is positioned on the host, then the start time of the task is estimated as where mtðv xk Þ is the migration time of the vm xk : If a new host is activated, then the start time of the task is evaluated as where stðhost k Þ is the start time of the host k: (4) Finish time: finish time (ft ijk ) of the task t i on v jk at host k is calculated as (5) Makespan time: workflow makespan (w makespan ) is the total time that is taken to complete the execution of the workflow and is calculated as where subTime workflow is the submission time of the workflow.

Energy Model.
CPUs, network interfaces, memory, and storage devices are the most energy-intensive components of host servers. The CPU consumes approximately 37% to 43% of total server energy [22,23], and network devices consume approximately 33% of total data center energy [24]. In the proposed work, the energy consumption of the CPU [6] and data transfer between VMs [24] are taken into account, and the total energy consumption is calculated in five different scenarios. These scenarios are defined as follows.
Scenario 1. This scenario is used to calculate the energy consumption during the execution of the task t i on vm jk on host k (i.e., ec ijk ) and is calculated as where ecr jk is the energy consumption rate of the vm jk on host k . Energy consumption to execute the whole workflow is where x ijk symbolizes the mapping of task t i on vm j at host h k . The x ijk remains "1" if the task t i is scheduled on VM v j at host k for execution; otherwise, x ijk is equal to "0." Scenario 2. This scenario is used when the server/host is active, but no VM is running on it; this situation is used to reduce energy consumption by switching the host to low voltage and frequency over some time (up to a threshold duration). Energy consumption of the idle hosts (i.e., ec all Idle ) is calculated as where ecr′ k and it k is energy consumption rate of host k at idle mode and idle time of host k .
Scenario 3. This scenario is used when the server/host is partially idle such as some idle VM is installed on the host. The VM is left idle up to the threshold period. The energy consumption of the partially idle host (i.e., ec part Idle ) is calculated as where t part Idle j , dt jk , and ct jk are the idle time of vm j at host k , time at which vm j is un-deployed from host k , and time at which vm j is deployed at host k , respectively. Scenario 4. This scenario is used to calculate the energy of unused resources of the servers/hosts. Energy consumption is minimized by applying core-level DVFS. It is evident from the paper [25] about 50% energy usage is minimized by reducing the voltage at 70% from its peak voltage. Minimum time is taken during scaling in which the operating frequency of the resources is in nanoseconds [6]. Therefore, during the calculations, scaling time of frequency is neglected. Energy consumption of unused resources of the hosts (i.e., ecur) is calculated as where s is the time in which the reckoning of VMs in a host is distinct from the former time. Total computational energy consumption (ec computational ) is the addition of the above four scenarios as shown in Equation (16). ec computational = ec exec + ec allIdle + ec partIdle + ecur: Scenario 5. This scenario is used to calculate the energy 5 Wireless Communications and Mobile Computing consumption during the data transfer from one VM to another VM when parent and child tasks are not executed on the same VM. The energy consumption to transfer data (ec transfer ) is calculated as where ecrBW xj and tt c p are energy consumption rate of network bandwidth and transfer time of data from one VM to another VM, respectively. Total energy consumption of a data center during workflow execution is calculated by using Equations (16) and (17) as ec total = ec computational + ec transfer : ð18Þ

Energy Efficient VM Placement (EVMP) Algorithm
This section describes the proposed algorithm which is used to execute the workflow in an energy-efficient manner and within the deadline constraint as shown in Figure 3. To execute the workflow, there is a need to follow some set of rules, and these rules are presented in the form of the algorithm.
The following steps are used during the workflow scheduling: (Step 1) On the arrival of a new workflow, it is analyzed to get the type of the workflow, number of tasks, and dependency between them in the workflow. After that tasks are stored in the task pool queue (task Pool ). Check the parent tasks of the tasks. If the task is an entry task, then activate the new host and create a new VM on it based on the task requirement and allocate the task to VM for its execution. After that, update the start time, execution time, finish time of the task, and ready time of the VM ( Step 2) When any task executes successfully then check its child tasks. If any child task is ready for execution, then transfer the child task from task Pool to ready queue (ready task ) ( Step 3) When any task is in ready task , then, check the relationship of that task with its parent tasks. If the task can be executed on the same VM on which its parent task(s) are executed without violation of its deadline, then allocate the task to that VM (Step 4) If step 3 is not possible, then, sort the already deployed VM based on their energy consumption rate. If any VM fulfills the task requirement and the deadline is not violated, then, allocate the task to that VM (Step 5) If step 4 is not possible, then, a new VM is created based on task requirement and allocated the task to the newly created VM. There are three cases to deploy the new VM on the host. In the first case, a new VM is deployed on the already active host. If this case is not possible, then, try to migrate any VM from one host to another and deploy the new VM on that host. If this is also not possible, then, try to activate the new host and deploy the VM on the newly created host (Step 6) System status is updated such as energy consumption, makespan, and resource utilization These scheduling steps are used to execute the workflow and are described in Algorithm 1.
Algorithm 1 is used to get the ready tasks for their execution. In this proposed algorithm, initially, all the tasks are stored in the task pool queue (task Pool ) and set ready task queue (ready task ) to null (see lines 1 and 2). If all the immediate parents of the task finish their execution or task is the entry task, then, that task is ready for its execution. Store that task in the ready task , and remove it from task Pool (see lines 4-7). If there is any task in ready task , then, the EVMP algorithm is used to schedule the tasks for their execution (see lines [8][9][10]. This algorithm is automatically called on the arrival of a new workflow or completion of any task within the workflow.
Algorithm 2 is used to schedule the tasks. Initially make the tags such as findVM and findFlag, null and false, respectively (see lines 2 and 3). If the task is an entry task, then, select the VM type which can fulfill the task requirement. After that, start a new host and add this host to the active list H a . Deploy the VM to the new host, and schedule the task on the new deployed VM. Also, update the ready time of the VM (see lines [4][5][6][7][8][9][10][11][12]. If the task is not an entry task, then, firstly try to execute the task on the same VM on which its parent is executed. If it is possible, then schedule the task on the parent VM and update the ready tome, the transfer time (see lines [13][14][15][16][17][18][19][20][21][22][23]. If this step is not possible, then call the alreadyDeployedVM() function (see lines [24][25][26].
Algorithm 3 is proposed to use the deployed VMs for workflow execution to save the VM creation time as well as energy consumption. In this function, firstly sort the deployed VM according to energy consumption rate (see line 2). If any VM can execute the task without violating the deadline, then, schedule task on that VM and update the system parameters such as ready time, the transfer time (see lines [3][4][5][6][7][8][9][10]. If this step is not possible, then, call the sca-leUp() function (see lines [11][12][13]. The scale-down function is adopted from [18] to shut down the VMs and host to save energy consumption. Algorithm 4 is used to add new resources for workflow execution. When already deployed VMs are unable to complete the workflow tasks then the scheduler calls this algorithm to install a new VM. This function is implemented from [26] with some variations. In this algorithm, firstly VM is selected which can fulfill the task requirement (see line 1). The new VM may be positioned on an already active host without migration based on the host resources (see lines 6 Wireless Communications and Mobile Computing 5-8). If this is not possible, then, a new VM may be deployed on the already active host with live VM migration (see lines [9][10][11][12][13][14][15][16][17][18][19]). If migration is not possible, then, a new host is triggered and a new VM is installed on it (see lines [20][21][22][23][24]. Allocate the task to new VM, and remove the task from ready task (see line 25). VM ready time and if this task has parent, then, data transfer time from parent task to child task is restructured (see line 25).

Experimentation and Performance Metrics
In this section, the workflow model, simulation parameters, and performance metrics used in the proposed model are presented.

Considered Workflow
Model. Workflow (W) is defined as a set of interdependent computational tasks [20]. In the literature [21], many scientific workflows such as LIGO, Montage, Cybershake, Epigenomics, and Pan-STARRS exist.
In this paper, Pan-STARRS scientific workflow is considered for task execution. Pan-STARRS project continuously monitors the entire sky to detect moving or variable objects. PS1 telescope is used to monitor the sky. John Hopkins University and Microsoft manage the generated astronomy data using two types of workflows, i.e., PSLoad workflow and PSMerge workflow. PSLoad workflow is used to collect the data from the telescope and store data in the database. PSMerge workflow is used to update the database. PSLoad and PSMerge workflows are described in Figures 1 and 2, respectively. Table 1 describes the detailed characteristics of the workflows.

Simulation Parameters.
CloudSim framework is exploited to simulate the cloud environment [27] and to check the usefulness of the anticipated scheduling model. Detailed simulation parameters are described below: (i) HP ProLiant ML110 G4 and HP ProLiant ML110 G5 are two types of hosts are deployed [28] (ii) The energy consumption rates of these two different types of hosts are 117 Watts per second ðWs −1 Þ and 135 Ws −1 [28] (iii) The energy consumption rate to transfer 1GB of data is 2.3 W [29] (iv) Four types of VM [19]  On the arrival of a new workflow,it is analysed to get the type of the workflow, number of tasks, and dependency between them in the workflow When any task executes successfully then check its child tasks.
When any task is in then check the relationship of that task with its parent tasks.
If step 3 is not possible, then sort the already deployed VM based on their energy consumption rate If step 4 is not possible then a new VM is created based on task requirement and allocated the task to the newly created VM System status is updated such as energy consumption, makespan, resource utilization   8 Wireless Communications and Mobile Computing (vi) In between VM, the average bandwidth is set to 20 MBPS, which is the imprecise bandwidth offered by Amazon Web Services [31] (vii) Pan-STARRS real-world scientific workflow is considered. Each scientific workflow is divided into three groupings based on the number of tasks as defined in Table 1 [21] 4.3. Performance Metrics 4.3.1. Average Resource Utilization (ARU). ARU is defined as the ratio of assigned computing resources to accomplish the scientific workflow tasks and total computing resources available on the server. ARU is intended as: where at k is the active time of the host h k .

Total Energy Consumption ðTECÞ.
It defines the total energy which is consumed by the servers to execute a scientific workflow. TEC is computed using Equation (19).

Makespan or Total Execution Time.
Makespan is the time taken to execute the scientific workflow from start tasks to the end task. It is computed using Equation (11).

Results and Discussion
The proposed EVMP algorithm is compared with an existing algorithm PESVMC algorithm [32] to establish the enhanced performance. In the existing PESVMC algorithm, the workflow tasks are allocated to the VM which depletes less energy. The deadline of tasks was not considered while assigning to the VM. Tasks were selected as per their parent-child relationship but during VM allocation for the task; the parent-child task relationship was not considered. As a result, the execution time and data transfer time both were increased which also affected both makespan as well as energy consumption. The performance of the EVMP algorithm is evaluated based on the ARU, total energy consumption, and workflow makespan.
5.1. Performance Impact on Resource Utilization. ARU of EVMP and PESVMC is observed for PSLoad and PSMerge scientific workflows with varying numbers of workflow tasks. Experimental result in terms of average resource utilization is shown in Figure 4. The result shows that EVMP performs better in terms of resource utilization in comparison to PESVMC. EVMP performs better because of its dynamic nature. In the proposed algorithm, when currently deployed VMs are not sufficient to complete the tasks within the deadline, then, only new VMs are created. So, resources are properly utilized. VM migration policy is also used to consolidate the resources which impressively increases resource utilization. On average, 8.6% resource utilization is increased in comparison to the existing algorithm.

Performance Impact on Total Energy
Consumption. The total energy consumption of EVMP and PESVMC is observed for PSLoad and PSMerge scientific workflows with the varying number of workflow tasks. Experimental result in terms of total energy consumption (measured in Kilowatt (kW)) is shown in Figure 5. In the existing algorithm, all the resources are active which consumes more energy without doing any useful work. But the EVMP algorithm deploys the resources as per the need of workflow tasks which impressively reduces the energy consumption. During the scheduling of workflow tasks, the existing algorithm does not consider the parent-child relationship which leads to the high data transfer energy consumption. But the proposed algorithm considers the parent-child relationship during task scheduling VM which helps to reduce the data transfer energy consumption and workflow makespan. On average, 42.3% of energy consumption is reduced by the EVMP algorithm in comparison to the PESVMC algorithm.

10
Wireless Communications and Mobile Computing in seconds (s)) is shown in Figure 6. Makespan result shows that EVMP performs better in terms of makespan. On average, 98% makespan is reduced in comparison to the PESVMC algorithm. This is due to limited resources being considered in the PESVMC algorithm which impressively reduces the parallel execution of tasks. In the existing algorithm, the parent-child relationship is not considered during task scheduling to the VMs which has also affected the makespan of the workflow. Hence, makespan of PESVMC is significantly increased for a large dataset of workflow.

Conclusion
The paper presents an energy aware VM placement model for the dependent scientific workflows in the cloud which achieves scheduling objectives and energy efficiency and improves the system performance for real-world scientific workflows. The proposed EVMP algorithm has reduced the energy consumption by applying DVFS (hardware technique) for the VMs/hosts which are not performing any work or idle computing resources, and software techniques for VMs and hosts which are idle beyond the preestablished threshold time. The data transfer energy consumption is minimized by scheduling tasks on or around the parent VM (where parent task is executed), and it also helped in reducing the execution delay by decreasing the transfer time and VM creation time. The EVMP algorithm is implemented on the CloudSim framework. The Pan-STARRS real-world scientific workflows are considered for evaluating the performance of the EVMP algorithm. The EVMP algorithm has increased resource utilization by 8.6% in comparison to the PESVMC algorithm. The energy consumption has been decreased by 42.3%, and makespan has been reduced by 98% in comparison to PESVMC algorithms. The proposed EVMP algorithm will also be implemented on a public cloud platform along with the evaluation of additional performance metrics of security and fault tolerance in the future.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.