Scheduling Parallel Jobs Using Migration and Consolidation in the Cloud

An increasing number of high performance computing parallel applications leverages the power of the cloud for parallel processing. How to schedule the parallel applications to improve the quality of service is the key to the successful host of parallel applications in the cloud. The large scale of the cloud makes the parallel job scheduling more complicated as even simple parallel job scheduling problem is NP-complete. In this paper, we propose a parallel job scheduling algorithm named MEASY. MEASY adopts migration and consolidation to enhance the most popular EASY scheduling algorithm. Our extensive experiments on well-known workloads show that our algorithm takes very good care of the quality of service. For two common parallel job scheduling objectives, our algorithm produces an up to 41.1% and an average of 23.1% improvement on the average response time; an up to 82.9% and an average of 69.3% improvement on the average slowdown. Our algorithm is robust even in terms that it allows inaccurate CPU usage estimation and high migration cost. Our approach involves trivial modification on EASY and requires no additional technique; it is practical and effective in the cloud environment.


Introduction
Cloud computing provides an easy-to-use and cost-effective solution for running high performance computing HPC parallel applications nowadays.There are efforts 1-4 to run HPC parallel applications in commercial cloud computing platforms.Some public IaaS providers such as Amazon EC2 have launched their HPC cluster instance 5 to cater for HPC parallel applications.Low utilization is a major issue in datacenters mainly due to the fact that datacenter operators prepare computing resources based on the peak load 6 ; thus, a lower than 50% utilization is very common in many datacenters 7 .For a datacenter that 3 the estimated CPU usage of the process es of the job, which can be obtained by historical data or test runs 26 .
Our evaluation results show that our algorithm significantly outperforms the commonly used EASY algorithm on well-known parallel workloads.
The remainder of this paper is organized as follows: Section 2 proposes the parallel workload consolidation method and our migration and consolidation based algorithm.Section 3 describes the evaluation on our algorithm, and Section 4 concludes the paper.

Scheduling Algorithms
In this section, we describe our migration and consolidation based parallel job scheduling algorithms.We first introduce the parallel workload consolidation method, then a basic algorithm followed by two refined algorithms that are discussed; the refined ones try to improve the basic algorithm.

Parallel Workload Consolidation Method (See [27])
The hardware virtualization technology used in cloud computing gives an easy-to-use way to consolidate parallel workload in a datacenter.In order to improve the CPU utilization of the physical processor, we first partition each processor into two-tier virtual machines VMs by pinning two virtual CPUs VCPUs on the processor and then allocating these two VCPUs to the two VMs.But the execution time of jobs running in the VMs is stretched if there exists no CPU priority control on the VMs.Thus, we secondly assign the VMs on one tier with highest CPU priority and assign the VMs on the other tier with lowest CPU priority.We call the tier with high CPU priority foreground (fg) tier and the one with low CPU priority background (bg) tier.In this setting, the background VM only uses CPU cycles when its corresponding foreground VM is idle.Under this prioritized two-tier architecture, experiments in our small cluster conclude the following.1 The average performance loss of jobs running in the foreground tier is between 0.5% and 4% compared to those running in the processors exclusively one-tier VM ; we simply model the loss as a uniform distribution.
2 Jobs running in the background tier can make good use of the idle CPU cycles left by the foreground tier.For a single-process background job, the utilization of the idle CPU cycles is between 80% and 92%, which roughly obeys uniform distribution; for a multiprocesses background job, the value is between 19.8% and 76.6%, which is likely to obey normal distribution with μ 0.428 and σ 0.144.
3 When a foreground VM runs a job with higher CPU utilization than 96%, collocating a VM to run in the background tier does not benefit the job running in it due to that the VM does not have much chance to run and the context switch overhead incurred.
Based on the two-tier architecture, we discuss our scheduling algorithms in the following sections.input: J fg : jobs running in fg VMs; J bg : jobs running in bg VMs; J q : jobs waiting in the queue.begin / * Step 1: Schedule runnable jobs according to FCFS * / sort J q ∪ J bg according to job arrival time; for each job j in J q ∪ J bg do N j ← process number of j; N idle ← idle fg VM number; if N j > N idle then break; else If Deploy (j, 'K') then remove j from J q or J bg , insert it into J fg ; if J q ∪ J bg is empty then return; / * Step 2: Make reservation for the first job in J q ∪ J bg , then backfill * / let T S 0 shadow time , N E 0 extra fg VM number ; HdJob ← first job in J q ∪ J bg ; N R ← process number of HdJob; N future ← current idle fg VM number; Sort J fg in ascending order of their termination time; for each job j in the sorted J fg do N j ← processe number in j; N future N future N j ; if N future ≥ N R then T S the termination time of j; N E N future − N R ; break; / * Backfill runnable jobs * / for each job j in J q ∪ J bg do N j ← processe number of j; N idle ← idle fg VM number; if N j > N idle then continue; t r ← the runtime of j; t c ← current time; if t r t c ≤ T S or N j ≤ N E then if Deploy (j, 'K') then remove j from J q or J bg , insert it into J fg ; if t r t c > T S then N E N E − N j ; / * Step 3: Try to deploy jobs to the background tier * / sort J q in ascending order of their runtime; for each job j in the sorted J q do N j ← processe number of j; N b idle ← idle bg VM number; if N j ≤ N b idle then Dispatch 'BG',j ; Algorithm 1: FCFS with KEASY backfilling and consolidation.

Algorithm Description
The basic algorithm was proposed by us in 27 , which is named FCFS with KEASY job Kill based EASY in this paper, as shown in Algorithm 1.In the KEASY algorithm, shadow time is the latest time that the reservation of foreground VMs here and hereafter for a job starts at.The scheduling progress of FCFS with KEASY shortened as KEASY in the rest of this paper is as follows.
Step 1. Use FCFS to schedule all possible jobs to foreground VMs.
Step 2. Use EASY to make reservation to the currently head job of the unforeground-running job list a merged list contains both jobs waiting in the queue and jobs running in the background tier, see J q ∪ J bg in Algorithm 1 , and then backfill all possible jobs to foreground VMs.
Step 3. Use SJF to deploy all possible jobs into background VMs.
The following four points need to be emphasized.
1 KEASY is called to schedule jobs to run in foreground VMs when job arrives or job departs the foreground tier; SJF is called to schedule jobs to run in background VMs when job departs the background tier.
2 When selecting jobs to run in foreground VMs, jobs both waiting in the queue and running in the background VMs are candidates.
3 Only the background VM whose corresponding foreground VM's CPU utilization is less than a threshold 96% can accommodate process.
4 Jobs' execution time estimates are used whenever this information is needed for the scheduling decision making.

Examples
Figure 1 a gives an example of the KEASY algorithm.Let there be 5 processors P1-P5 and 12 jobs J1-J12 at the initial time.Each job is denoted by n, t , in which n is the processor requirement and t is the execution time.Each processor has two tier VMs denoted as fg and bg in Figure 1.For the convenience of illustration here, we assume that the process in a singleprocess job incurs a CPU usage of 100% and the processes within a multiprocesses job involve a CPU usage less than the utilization threshold.At time 0, J1 is placed onto P4-5 according to FCFS; J3 and J4 are deployed onto the foreground VMs of P1-3 according to EASY backfilling; J5, J6, J7, and J8 are scheduled onto the background VMs of P2-5 by SJF.We use a simple process to collocate a background VM with a foreground VM, as shown in the Dispatch function in Algorithm 2. The process matches the background VM that is likely to incur high processor utilization to the foreground VM that is likely to incur low processor utilization.
At time 5 we assume J5-8 all advance 2 time units during time 0-5 here , J3 and J5 depart the system.J6 is backfilled onto the foreground VM of P3 by killing its run in P4 and then restarting its run from the very beginning in the foreground VM of P3; J7 is backfilled onto the foreground VM of its original processor by swapping the CPU priorities of the foreground VM and background VM of P2.Then, J9 and J10 are arranged onto the background VMs of P5 and P4, respectively.Note that although J8 still stays at the background VM of P3, it hardly advances because there is no idle CPU cycle left by J6.
At time 8, J6 and J7 depart the system.Although J8 cannot be backfilled as its original execution time is 4, it still finishes at time 10 by running in the background VM it is actual the foreground VM because its foreground VM is idle .J1 finishes along with J8 at time 10 as well.At time 11 we here assume J9 advances 2 time units during time 5-11 , J4 finishes.J9 is restarted from the very beginning on the foreground VM of P1 because it is the head job now.
If J10 finishes earlier than time 16, the reservation for J10 should be deleted and the reservation for J11 should be made.In this situation, J11 can run on the foreground VMs  of P1-5 immediately at time 16.If J10 cannot finish before going to the foreground tier, the reservation for J11 is made at time 16 and, in this case, J12 leapfrogs J11.
In the example described above, the idle CPU cycles unused by the multiprocesses jobs give KEASY the opportunity to improve the scheduling performance.We further observe that even when all processes of all jobs consume the whole computing capacity 100% CPU usage of the processors, KEASY may still produce better performance than EASY.Figures 2  and 3 show two scenarios that KEASY outperforms EASY if jobs' runtime is accurate.Figure 4 shows a case in which KEASY outperforms EASY due to the overestimation of jobs' runtime.From the three examples, we can see that KEASY is capable of dispatching a job to run in background VMs while it is not qualified for backfilling according to EASY.There is a chance Initial queue: J1(1, 10), J2 (3,5), J3 (5,5), J4(4, 25), J5 (5,5), J6 (1,20) that the corresponding foreground VMs of these background VMs are idle during the job's lifetime, which leads to performance improvement.

Two Improved Algorithms
The basic algorithm described above adopts job kill during the scheduling; this somewhat leads to a waste of computing resource.In this section, we try to remove job kill and present two refined algorithms.

REASY-Using Reservation instead of Job Kill
The REASY algorithm is shown in Algorithm 3. In REASY, job kill is not allowed in the scheduling; once a job is deployed onto background VMs of a set of processors, its run FG then p ← sorted idle processors in ascending order of their utilization; for each process j i in the sorted list do place j i to p i and run j i in the foreground VM; else p ← sorted idle processors in ascending order of their utilization; for each process j i do place j i to p i and run j i in the background VM; Algorithm 2: Deploy j, BackfillType -job deploy function.
is pinned onto this set of processors.Only all the foreground VMs of this set of processors are available can this job run in the foreground tier.Thus, REASY differs than KEASY in making reservation for the head job and invoking Deploy function.For the reservation making, if a reservation is being made for a job running in the background tier, the shadow time is the last termination time of the jobs running in its foreground VMs; the extra foreground VMs are the ones now idle and no process of the job is running in their background VMs. Figure 5 illustrates an instance of this situation; when making reservation for J3, the shadow time is 10 as J1 finishes at time 10 and the extra foreground VM is the foreground VM on P1.For the Deploy function invoking, REASY passes "R" rather than "K" to the Deploy function Algorithm 2 so that it only schedules a job now running background tier to foreground tier when all the foreground VMs of its host processors are idle.
An example of REASY is given in Figure 1 b .In the example, we assume that J6 advances 3 time units during time 0-8 and J9 advances 5 time units during time 5-16.One can find that, in time 5, J6 still stays at its original background VM other than restarts on the foreground VM of P1.At time 11, J9 also still runs on the background VM of P5 but a reservation for J9 is made because it is now the head job.
Input: J fg : jobs running in fg VMs; J bg : jobs running in bg VMs; J q : jobs waiting in the queue.begin / * Make reservation for the first job in J q ∪ J bg , then backfill * / let T S 0 shadow time , N E 0 extra fg VM number ; HdJob ← first job in J q ∪ J bg ; N R ← process number of HdJob; N future ← current idle fg VM number; if HdJob ∈ J q then Sort J fg in ascending order of their termination time; for each job j in the sorted J fg do

MEASY-Using Migration instead of Job Kill
The only difference between KEASY and MEASY exists in the Deploy Algorithm 2 .MEASY does not kill a selected job but uses migration instead.

Evaluation (See [27])
The evaluation is performed by trace-driven simulation.During the simulation, once a job arrives, the simulator is informed of the processor need, the execution time estimate, and the execution time but the execution time is not the input of the scheduling algorithms .Upon jobs departure, the simulator is also notified.The number of processors is 320 in the simulated system; a default cost of 20 seconds is configured in MEASY.

Workloads
We use the following commonly used workload models in our simulation.As the generated workload does not contain CPU usage information for each process, we assign the CPU usage to a process according to the following rules.
1 If a job has only one process, the process is assigned a CPU usage of 100%.
2 If a job has more than one processes, the average CPU usage of each process is a random number between 40% to 100%.
All the models mentioned in this section are available in 32 .

The Progress of a Process in the Two-Tier Architecture
Running a foreground VM and a background VM simultaneously on a physical processor incurs overhead due to context switch.We model the progress of a process running in the foreground VM as follows: in which T is the length of a time slice, denoting the progress of a process running on a dedicated processor in a time slice.loss is the performance degradation of jobs running in the foreground tier.According to our experimental results described in Section 2.1, loss is randomly generated by a uniform distribution between 0.5% and 4%.In a time slice, the progress of a background process is calculated as follows: in which eff is a variable between 0 and 1.It represents how much time in a time slice effectively contributes to the progress of the process.A background process is frequently preempted, and eff characterizes the overhead associated.According to Section 2.1, for a single-process job, eff is a random number between 0.80 and 0.92, which is also modeled by a uniform distribution; for a multiprocesses job, eff is randomly generated by a normal distribution with μ 0.428 and σ 0.144.CPU req is the CPU utilization of the background process on a dedicated processor.CPU idle is the portion of unused CPU cycles in the processor, that is, the portion that is not fully utilized by the foreground VM.Furthermore, the progress of a job depends on the progress of the slowest process in the job.

Performance Metrics
The performance metrics used are the average response time and the average bounded slowdown.The response time of a job is defined as rt t f − t a , and the bounded slowdown of a job is defined as b sld t f − t a / max Γ, t e , where t f is the finish time of the job; t a is the arrival time of the job; t e is the execution time of the job.Compared with slowdown, the bounded slowdown metric is less affected by very short jobs as it contains a minimal execution time element Γ 14, 33 .According to 14 , we set Γ to 10 seconds.The batch means method 34 is used in our simulation analysis.
The system load of the simulated system is modified by multiplying the interarrival times by a factor.For instance, if the model produces a default load of 0.45, a higher load of 0.9 can be created by multiplying all interarrival times by a factor of 0.5 0.45/0.914 .

Results
Figures 6 and 7 show the scheduling results of our algorithms for FWload and JWload; the average number of migration is shown in Figure 8.We have the following observations from the figures.
1 REASY produces the worst performance among our algorithms.For JWload, it is even worse than EASY.This is due to that JWload contains about 40.8% of singleprocess jobs and the percentage in FWload is less than 18.0% as shown in  3 The number of job migration in JWload is greater than in FWload when the system load is high, but JWload needs less migration than FWload when the system load is low.This is also due to the characteristic of JWload.In JWload, more jobs can be deployed onto foreground tier straightforward when the system load is low; in the case of high system load, jobs running in background tier are easy to meet the backfilling criteria thus improve the number of migration.
In the results described above, MEASY shows better performance than other algorithms in all aspects.We will use MEASY for comparison in the following discussions.

CPU Usage Estimation Error Tolerance in MEASY
Our scheduling algorithms rely on the CPU usage information of parallel processes of jobs to make scheduling decisions.The information can be obtained from profiling a job in test runs or based on users' estimation.Either way, information inaccuracy can be a problem.In this section, we assume that the CPU usage estimation is not available at all.
As shown in Figures 9 and 10, both MEASY and KEASY outperform EASY even without any CPU usage information of parallel processes, and MEASY is always better than KEASY.

Average CPU Usage of Processes Tolerance in MEASY
Our algorithms make use of remaining computing capacity of each processor.In this section, we further investigate the impact of average CPU usage of parallel processes on the performance of our algorithms.We change the average CPU usage of multiprocesses jobs and examine the performance change of our algorithms.
Figures 11 and 12 show the results for FWload and JWload under the situation of 100% CPU usage.The results reflect that even the average CPU usage is 100%; our algorithms   perform better than EASY even with context switching overhead.MEASY is always better than KEASY as well.This is due to the situations shown in Figures 2, 3, and 4.

Migration Cost Tolerance in MEASY
In this section, we examine the impact of migration cost on the performance of MEASY.
Figures 13 and 14 show the results for FWload and JWload under different job migration costs.From the figures, better performance than KEASY can be obtained by MEASY in the two workloads even with high job migration cost.For FWload, if the migration cost is less than 900 seconds, better performance than KEASY can always be obtained by MEASY; if the migration cost is bigger than 1500 seconds, MEASY can only produce comparable a little worse than KEASY performance to KEASY.For JWload, these values are 480 and   900, respectively.This is due to the number of migration in JWload is greater than in FWload as mentioned above.Note that if the migration cost is greater than an upper bound, KEASY which needs no migration may be the best choice in practice.

Conclusions
Parallel job scheduling is increasingly important for a datacenter in the cloud computing era.It is in charge of the quality of service of the datacenter.In this paper, we employed a twotier processor partition architecture and put forward three parallel job scheduling algorithms KEASY, REASY, and MEASY , culminating with MEASY.KEASY is the basic algorithm, and REASY and MEASY extend KEASY.Both KEASY and MEASY produce significant better performance than EASY, but REASY fails in some cases.Moreover, MEASY is always better   than KEASY and it is robust in terms of allowing inaccurate CPU usage estimation of parallel processes and high job migration cost.
For the future work, we will further study REASY and try to improve its performance, such as only allowing jobs with some constraints to run in background VMs; we will evaluate the performance using other policies other than SJF when scheduling jobs to background VMs as well; we will also exploit mechanisms that can effectively partition the computing capacity of a processor into k-tiers, which may further improve the processor utilization and job responsiveness for parallel workload in the cloud, particularly for CPU with multicores.

Figure 2 :
Figure 2: Benefit for head job.

Figure 6 :
Figure 6: Performance for FWload.The bars represent the standard deviation.

Figure 7 :
Figure 7: Performance for JWload.The bars represent the standard deviation.

Figure 8 :
Figure 8: Average number of migration in FWload and JWload.

Figure 9 :
Figure 9: Performance for FWload without the CPU usage information.

Figure 10 :
Figure 10: Performance for JWload without the CPU usage information.

Figure 11 :
Figure 11: Performance for FWload with the average CPU usage of 100%.

Figure 12 :
Figure 12: Performance for JWload with the average CPU usage of 100%.

Figure 13 :
Figure 13: Performance for FWload with varying migration costs.

Figure 14 :
Figure 14: Performance for JWload with varying migration costs.

Figure 4 :
(1,15),15)Benefit for overestimated execution time: a job is described by tuple n, t e , t a , in which n is the number of processes, t e is the estimated execution time, and t a is the actual execution time.

Table 1 :
termination time of j; N E N future − N R ; Size distribution of workload models used in the experiment.
fg h ← the jobs running in the foreground VMs of HdJob; SumSize ← the size sum of jobs in J * Backfill runnable jobs * / for each job j in J q ∪ J bg do N j ← processe number j; N idle ← idle fg VM number; if N j > N idle then continue; t r ← the runtime of j; t c ← current time; if t r t c ≤ T S or N j ≤ N E then if Deploy (j, 'R') then remove j from J q or J bg , insert it into J fg ; if t r t c > T S then N E N E − N j ; Algorithm 3: REASY backfilling.J1(2, 10), J2(2, 5), J3(4, 5) Figure 5: Example of shadow time and extra foreground VMs.s

1
Feitelson workload, denoted by FWload: a general model based on data from six different traces 14, 28 .It contains 200,000 jobs in our simulation, the average parallelism of it is 23.4,and the average runtime is 2606.6 seconds. 2 Jann workload, denoted by JWload: a workload model for MPP and it fits the actual workload of Cornell Theory Center Supercomputer 29 .It contains 100,000 jobs in our simulation, the average parallelism of it is 10.5, and the average runtime is 11267.0seconds.The job size distributions of the two workloads are shown in Table 1.The execution time estimates of jobs are generated by the model proposed by Tsafrir et al. in 30, 31 , denoted by TModel.