A Correlated Model for Evaluating Performance and Energy of Cloud System Given System Reliability

The serious issue of energy consumption for high performance computing systems has attracted much attention. Performance and energy-saving have become important measures of a computing system. In the cloud computing environment, the systems usually allocate various resources (such as CPU, Memory, Storage, etc.) on multiple virtual machines (VMs) for executing tasks.Therefore, the problem of resource allocation for running VMs should have significant influence on both system performance and energy consumption. For different processor utilizations assigned to the VM, there exists the tradeoff between energy consumption and task completion time when a given task is executed by the VMs. Moreover, the hardware failure, software failure and restoration characteristics also have obvious influences on overall performance and energy. In this paper, a correlated model is built to analyze both performance and energy in the VM execution environment given the reliability restriction, and an optimization model is presented to derive the most effective solution of processor utilization for the VM. Then, the tradeoff between energy-saving and task completion time is studied and balanced when the VMs execute given tasks. Numerical examples are illustrated to build the performance-energy correlated model and evaluate the expected values of task completion time and consumed energy.


Introduction
One of the important criteria for appraising the superiority of modern computing systems is whether it satisfies the increasing demand for high performance and energy-saving [1,2].Due to the issue of increasing energy consumption in large-scale computing systems, many efficient techniques, such as dynamic voltage and frequency scaling [1] and virtual resource management [3], have been proposed to control energy consumption.On the other hand, distributed resource sharing technology [4] which effectively improves the performance of system has been more widely employed in computing systems, especially for cloud computing systems.Meanwhile, how to guarantee the reliability of a complex system is always an important research issue.
Although these technologies and methods can somehow solve the corresponding issues, it is inadequate to handle these metrics separately.The existing approaches cannot be used in the situation of studying the correlation among energy, performance, and reliability.
Cloud computing is an emerging technology recently, which has numerous novel features, such as large-scale resource sharing, dynamic and flexible resource management, and on-demand resource provisioning [5].Cloud computing takes advantage of Gird technology which enables integration of resources across distributed heterogeneous dynamic virtual organizations [6].A grid service is designed to execute a certain task under the control of the resource management system (RMS) [7].Similarly, the cloud computing system has cloud operating system (COS) to flexibly schedule computational resources (including CPU, memory, storage, and bandwidth) for task execution.Moreover, autonomic computing technologies can be applied in local computational environments, which enables dynamic application scale-out and live migration of virtual machine (VM) to achieve more efficient resource utilization and address dynamic workloads requirements [8,9].
In a cloud computing system, a task is usually executed by a VM of which computational ability directly depends on the resources assigned by the COS.If the COS decreases the number of CPU cores or the CPU utilization for the VM, the power consumption could be effectively reduced.However, such approach also causes lower computational speed of the VM, which results in longer task completion time and more chance of failures.On the other hand, the occurrence of failures usually causes the increase in task completion time which results in the decrease of performance waiting for redoing the task.The reexecution further consumes more electricity power.Thus, energy, performance, and reliability are closely related and affect each other, which should not be separated in modeling and scheduling.
To solve the problem that huge energy waste typically exists in large-scale distributed systems, there are many studies on energy reduction.The low average utilization rate of resources in computing systems generally creates enormous waste of computing ability and causes high energy consumption in cooling and other overheads [10].As this situation is the most obvious factor that induces energy problems, most of the existing researches focus on energy-efficient consolidation of computing resources based on energy consumption prediction [11], required Quality of Service [12], memory-aware virtual machine scheduling [13], load balancing strategy [14], and control-theoretic techniques of multiple high-density severs [15].However, as mentioned above, it is inadequate to solve the energy problem without considering reliability and performance.Taking consolidated processors for example, once hardware failures of a processor occur, all tasks executed by the same processor cannot perform immediately, which causes a common cause failure (CCF) and induces a decrease in reliability.This situation is typical in cloud computing systems which widely employ virtualization technology to improve the average utilization of computing resources.Thus, the precise evaluation of energy consumption should consider not only software failures but also hardware failures.Dai et al. [16] studied the correlated software failures of multiple types and analyzed the uncertainty of software reliability based on maximum-entropy principle [17].However, as mentioned above, the analysis for the reliability of distributed computing systems should take hardware failures into account.So Dai et al. [18] studied the combination of various failures interacting with one another and presented the hierarchical model for the grid service reliability analysis and evaluation.There are also many other reliability models for software-hardware systems; Markov models are usually used to analyze and evaluate the reliability [19,20].For the performance, it is always a research focus.Meyer [21] proposed the notion of performability which can effectively evaluate both performance and reliability.Then, the performability evaluation for multiprocessor systems [22], faulttolerant computer systems [23], and distributed real-time systems [24] were studied.For a grid computing system, Dai et al. [25] presented a combined model of performance and reliability, in which the precedence constraints caused by data dependence and the common cause failure were considered.Moreover, the optimal resource allocation for both performance and reliability in grid systems also was studied [26].
Since reliability, performance, and energy cannot be treated separately, this paper proposes a correlated model for evaluating both performance and energy based on the analysis for hardware and software reliability.The primary innovation of this correlated model focuses on the essential connection between performance, energy, and reliability which is provided by resource allocation in cloud systems.A semi-Markov process is formulated for the modeling of software/hardware reliability and the evaluation of performance and energy is based on Laplace-Stieltjes transform and Bayesian approach.A new functional relationship between expected energy consumption and processor utilization is constructed.According to the analysis of the derivative of the function, it is easy to derive an optimal resource allocation to minimize energy consumption in a task completion procedure.This optimal resource allocation also implies the balance of tradeoff between power consumption and task completion time.
The remainder of the paper is organized as follows.Section 2 describes a performability model considering both hardware failures and software failures in cloud computing systems.Section 3 presents a power consumption model to evaluate the expected energy consumption.Based on the evaluation of energy consumption, a feasible approach that derives an optimal processor utilization to reduce energy consumption has been proposed.Section 4 illustrates several numerical examples.

Performability Model for Task Process
In a cloud computing system, tasks are usually executed by virtual machines which provide isolation technology to ensure noninterfering share of various computing resources, such as CPU, memory, and hard disk.Considering that the energy consumption caused by processor operation is the major constituent of total energy consumption of servers [10], reasonable CPU allocation for running VMs has significant effects on the tradeoff balance between reliability, performance, and energy consumption.The following model will first analyze and evaluate performability based on processor frequencies which are assigned to VMs for completing given tasks.

2.1.
Hardware and Software Reliability Model.In this paper, the presented reliability model considers hardware failures of the processor and software failures of VMs.In cloud computing environments, a single physical computing node usually runs multiple VMs to execute tasks simultaneously and hardware failures of the processor will terminate the operations of all VMs.As a consequence, the influence of hardware failures plays an important role in the reliability.The design method considering the quality tradeoff between hardware and software components is called hardware/software codesign [27].According to the properties of cloud computing, the following assumptions are made for reliability modeling to run a VM: (1) Once a hardware failure of processor occurs, the system cannot operate and start to restore.A running VM is aborted when a hardware failure occurs and it will be reexecuted after the recovery of the hardware.(2) A software failure of a VM is an obvious failure which can be detected by the cloud operating system immediately.A running VM is instructed to suspend as soon as a software failure is detected.
(3) Software restoration actions halt a VM which has been suspended and create a new instance of the same virtual image.The given task which is executed by the VM will be restarted anew (preemptive repeat mode) when the software restoration action is complete.
(4) For all VMs which are created from the same VM template, the software parameters will not be changed.The execution of these VMs is independent and identically distributed (i.i.d.).
(5) If a VM finishes a given task, it will be shut down by the cloud operation system immediately.
Let the stochastic process {(),  ≥ 0} represent the state of the system at the time point , as shown in Figure 1.State  represents the start of the th run of the VM.If the VM does not finish the given task within the th run (i.e., a hardware or software failure occurs before the completion of the given task), () will finally transit to  + 1 and the VM will be restarted again to reexecute the given task.State  represents a hardware failure of the processor occurs and, according to assumption (A1), the hardware failure also induces the termination of the VM and the start of a restoration action.Both the hardware uptime (  ) or time to hardware failure and the hardware downtime (  ) or time to hardware repair are random variables.Similarly, state  represents a software failure of the VM occurs.  and   are the random times which represent the software uptime or software failure time and software downtime or the software restoration time, respectively.In general,   ,   ,   , and   follow the exponential distributions with means 1/, 1/, 1/, and 1/, respectively [28].

Cumulative Distribution Function of Time between Two
Successive Runs.In this paper, we named the random time interval from the beginning of the th ( = 1, 2, 3, . ..) run of the VM to the beginning of the next ( + 1)st run of the VM (i.e., from state  to  + 1) as th instance lifetime of the VM.According to Figure 1, the VM keeps operating until a failure (i.e., a hardware or software failure) occurs during a run of the VM.Denote the random operation time of the VM as  VM , which is not only determined by the software failure but also affected by the hardware failure; that is,  VM = min(  ,   ).
If assumption (A5) is not considered, the distribution of  VM can be obtained as Suppose the given task needs to execute a number of instructions  (i.e., work requirement).In an idealistic failure-free scenario, the completion time of the given task is determined by the computational speed of the VM.Denote such idealistic task completion time as  0 .However, in realistic hardware and software failure scenario, the task will be interrupted upon a failure.The probability that the given task is complete within a run of the VM is given by Moreover, we should note that the idealistic task completion time  0 gives a bound on the operational time of the VM under assumptions (A3) and (A5).That means all possible values of operation time  VM must satisfy 0 ≤  ≤  0 .In fact, the bound on the operation time is especially relevant to define performance and energy consumption.Similar to the analysis mentioned by Sheahan et al. [29], the cdf of bounded operation time  VM is given by Then, we have the probability density function (pdf) of  VM : The Laplace-Stieltjes transform (LST) of () is defined as For the stochastic process {(),  ≥ 0}, let  , () represent the one-step transition probability from state  to state  during the time interval .Note that the state transition from  to  implies that the hardware failure occurs before the occurrence of the software failure.Similarly, the state transition from  to  also implies that the software failure occurs before the occurrence of the hardware failure.Subject to bound  VM <  0 , the expression for  , () is given by Denote the first passage time of () from state  to state  + 1 as  ,+1 ; that is,  ,+1 ( = 1, 2, 3, . ..) represents the th instance lifetime of the VM and each of the lifetimes is an i.i.d.random variable under assumption (A4).Then we have the cumulative distribution of time between two successive runs of the VM (i.e., an instance lifetime of the VM) as follows: where  ,+1 () represents the distribution of  ,+1 and " * " denotes the Stieltjes convolution of the two functions.Applying the LST to (7), we can obtain Then, applying the LST to (6) and substituting the corresponding transformed expressions into (8) yields 2.3.Expected Task Completion Time.Since a given task is executed by the VM repeatedly until the first operational time  VM is longer than  0 , the number of runs of the VM is a random variable which follows a geometric distribution.
Suppose the completion procedure of the given task exactly passes +1 runs of the VM; that is, it contains  unsuccessful runs (the given task is not complete in these runs) and one successful run (the given task is complete in the run) of the VM.Let () be the probability that the task completion procedure occupies  unsuccessful runs and one successful run.From (2), we can obtain Here, the completion time of the given task is defined as the time interval from the starting time when the task is first executed by the VM to the end time when the task is finally finished by the VM.In general, we can set the time origin  = 0 as the starting time.As to the completion time of the given task which can be denoted as  task , under the condition of (10), it consists of the sum of  instance lifetimes of VM and a final operational time that equals  0 .Let  task ( | ) = Pr{ task <  | } represent the conditional distribution of  task .It can in principle be found by taking the convolution of  ,+1 () with itself  times and then with ( −  0 ).Since we already get F,+1 () from ( 8), the LST of the conditional distribution  task ( | ) can be obtained by where  − 0 is the LST of ( −  0 ) and [ F,+1 ()]  is the LST of  1,2 () *  2,3 () * ⋅ ⋅ ⋅ *  −1, ().As mentioned above, each of the instance lifetimes ( ,+1 ,  = 1, 2, 3, . ..) is an i.i.d.random variable.Therefore, the equation F1,2 () = F2,3 () = ⋅ ⋅ ⋅ = F−1, () = F,+1 () is satisfied.Now, for the unconditional distribution  task () = Pr{ task < }, using (10) and the Bayesian theorem on conditional probability, the condition in (11) can be removed and the LST of the task completion time becomes Due to the fact that LST has property that becomes a Moment Generator, we can derive the expected time of  task as

Power Consumption Modeling.
To estimate the energy consumed in the entire task completion procedure, the power consumption model is of critical importance besides the random task completion time.There exist several studies introducing the power consumption models for the processor.Choi et al. [11] discussed the statistic models for the power usage distribution and the analytical method for the nonlinear power consumption curves, respectively.Wang et al. [15] developed the piecewise linear function for the power consumption of the processor.Furthermore, Lee [30] considered that the imperfectly linear model for the power consumption can be linearized to decrease the complexity and computation overhead.In this paper, we apply the power consumption model introduced by Zhu et al. [31], which can be summarized as where  and  are the power consumption and the processing frequency of a processor, respectively.  ⋅   in ( 14) is the frequency-dependent active power, in which   is the effective switching capacitance and  is the dynamic power exponent. in ( 14) is the sum of the sleep power maintaining basic circuits and the frequency-independent active power.These parameters are system dependent constants which can be estimated by the statistical analysis.For easy discussion, the frequencies of processor can be normalized by processor utilization.Suppose the maximum frequency of processor is  max , that is,  =  max , in which  (0 <  ≤ 1) is the utilization of processor.Let  =   ⋅  max  ; ( 14) can be transformed to  () =   +  (0 <  ≤ 1) . ( We should notice that  is the basic power consumption to keep the computational node working and it always has a relatively large value.In fact, this also implies significant overhead of turning on/off a computational node (with corresponding  ̸ = 0, () ≥ / = 0, () = 0).To analyze the situation of the power consumption in an instance lifetime, we divide an instance lifetime into two phases: operational phase and restoration phase.The operational phase is the time interval in which the VM keeps operating until a failure occurs.The processing frequency of the processor remains unchanged during the operational phase which means the power consumption is () =   +  with a fixed utilization .In contrast, the processing frequency of the processor is different in the restoration phase in which a restoration action starts immediately after a failure occurs.In the restoration phase, the power supply for keeping the running of basic circuits, clock, and processor is still sustained, but the VM cannot operate until the restoration action is complete.Based on such case, the following assumptions are made for the power consumption in restoration and operation phases: (1) In a restoration phase, a relative small utilization of the processor for a restoration action is negligible here.Thus the power consumption in a restoration phase is (0 + ) = lim  → 0 + () = .
(2) Apply the underlying resource virtualization technology; the COS keeps the utilization of the processor  Operational phase Restoration phase and the power consumption () for any VM unchanged in an operational phase.
The power consumption for an instance lifetime of the VM is shown in Figure 2.

Expected Energy Consumption for a Task Completion
Procedure.Let  be a random variable representing random energy consumption.Obviously, energy consumption is the product of a power consumption and a time; that is,  = (), and the distribution of  can be derived by where   () is the distribution of a random time .The power consumption () is a constant with a fixed utilization .
Applying properties of LST, we can get the LST of   () as As mentioned above, a given task has an amount of work requirement .In this paper, the work requirement  is measured in the number of commands or instructions to be executed.Suppose the maximum computational speed of the processor is  max and the utilization of the processor that the cloud operation system assigns to the VM is  (0 <  ≤ 1).According to the previous similar study [25], the idealistic task completion time  0 in a failure-free scenario should be Thus, under the bound that an operation time  VM must be less than  0 , we can get the energy distribution between two successive runs of the VM.For the stochastic process {(),  ≥ 0}, let  , () represent the transition probability from state  to state  within the energy consumption .Substituting ( 17) into ( 6  Let  ,+1 denote the random energy consumed in the th instance lifetime of the VM.We can derive the distribution of energy consumed in th instance lifetime of the VM as where  ,+1 () is the distribution of  ,+1 ; that is,  ,+1 () = Pr{ ,+1 < }.Then, from (19), the LST of  ,+1 () becomes Under condition (10), the cumulative energy conditional distribution for the task completion can be denoted as  task ( | ), which can be derived from ( 11), (17), and (21) as Similar as for the expected task completion time, from (10), applying the Bayesian theorem and removing the condition on  give Then we can derive the expected energy consumption for a task completion procedure as

Expected Optimal Processor Utilization to Reduce
Energy.Based on the derivation mentioned above, we can get the expected task completion time and the expected energy consumption from ( 13) and ( 24), respectively.Moreover, for a fixed given task work requirement , these two important indices are functions of the processor utilization ; that is, by substituting ( 18) into (13), we can get the functional relationship between the expected task completion time and the processor utilization ( task ) =   ().Similarly, for the expected energy consumption, by substituting ( 15) and ( 18) into (24), the function ( task ) =   () is satisfied.The derivative of   () can be obtained as which satisfies   ()/ < 0 for all  > 0. This phenomenon can be clearly explained.The higher the utilization of the processor is, the shorter the task completion time is and the lower the risk that the task will be interrupted by a failure is.However, we should notice that the energy consumption is as important as the performance.For the expected energy consumption, we do not give the explicit expression of   ()/ here because the function of power consumption () will have different parameters which are determined by different physical computing nodes.If there exists an expected optimal processor utilization 0 <  opt < 1 to reduce the energy consumption, we can derive it by solving the following equation:  [32].Here, we apply physical node (HP ProLiant ML110G5 server) of which power consumption characteristics have been numerical analyzed by Beloglazov and Buyya [33].Based on the real numerical results for power consumption [33], we can estimate the power consumption modes for HP G5 server as in which the peak power is 135 and the basic power is 94.These two parameters usually can be exactly measured by some analytical tools.For the parameter , it may have different values in different application scenarios.Even if frequencies of the processor dominate the power consumption of a physical node, the change of power consumption also depends on a concrete configuration of the physical node and utilization of other components, such as memory, disk, and GPU.However, for a special kind of task or a special application scenario, it is feasible to estimate a special value of this parameter from numerical statistical analysis.The CPU of G5 server is Intel Xeon 3075 (2 cores × 2660 MHz).The frequency of the server's CPU can be mapped onto MIPS ratings: 2660 MIPS each core of HP G5 server.The processor utilization of 100% can be achieved when all cores of the processor work at max frequencies in parallel.In fact, a task can be allowed to be split in order to enhance the utilization of the processor [34].In this paper, the task executed by the VM is supposed to be multicore programming; that is, the maximum computational speed of HP G5 server is  max = 2 × 2660 MIPS.

Expected Task Completion Time.
Set  = 0.05,  = 6.0,  = 0.2, and  = 12.0 as numerical values for the parameters of random times   ,   ,   , and   (hours).Suppose the work requirement of the given task is  = 3×10 6 million instructions.From (18), we can derive the perfect task completion time as   Then, from ( 13), the expected task completion time can be obtained as Figure 3 displays the relationship between task completion time and processor utilization.The curves in Figure 3 show that the task completion time (idealistic or expected) varies inversely with the processor utilization, which coincides with the analytical conclusion derived from (25).Moreover, the difference between idealistic completion time and expected completion time increases gradually with the decrease in the processor utilization.This phenomenon shows that the processor frequencies also have influence on reliability.Since a lower processor frequently induces an increase of task completion time, a longer idealistic task completion time implies a higher risk of failure, which finally results in a reduction in reliability.

Expected Energy Consumption.
By substituting ( 28) and ( 29) into (24), we can obtain the expected energy consumption for the completion of the given task as E [ task ] =   () = (164  + 385.4) ( 0.0392⋅1/ − 1) .(31) Because of the tradeoff between the power consumption and the task completion time, there may exist an optimal frequency of the processor which can effectively reduce the energy consumed in the completion of a task.As mentioned above, an estimation of  is also affected by a special application scenario.Figure 4(a) illustrates that different values of  have obvious effect on an optimal processor 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 18  utilization  opt .In general, the power consumption and the expected task completion time usually are the monotonically increasing and decreasing functions of the processor utilization, respectively.Thus the optimal processor indicates the balance of tradeoff between power consumption and task completion time.It is meaningful to find the optimal processor utilization because it can achieve the minimum energy consumption for the completion of task.From (26), optimal processor utilizations for  = 5,  = 7, and  = 9 are derived as  opt = 0.9059,  opt = 0.8779, and  opt = 0.8651, respectively.We should notice that the optimal processor utilization is also determined by the other parameters of the physical node and it will be more important when the difference between the peak power and the basic power is especially large.Fox example, the power consumption of IBM iDataPlex Server dx360 M3 is () = 250  + 100 [32].The CPU of IBM dx360 M3 usually adopts Xeon E5606 of which maximum frequency is 2.13 GHz.For the same work requirement and the same reliability parameters, the expected energy consumption can be obtained as   () = (1000  + 410) ( 0.0978⋅1/ − 1) (32) and then the optimal frequency for  = 2,  = 3,  = 4, and  = 5 are  opt = 0.6887,  opt = 0.6138,  opt = 0.6241, and  opt = 0.6462, respectively.This means that the tradeoff between power consumption and task completion time for IBM dx360 M3 server is more obvious than for HP G5 server, as shown in Figure 4(b).Based on analysis above, it is reasonable to make the VM work at a state when the processor utilization is not less than the optimal processor utilization.If the present utilization assigned to the VM is  <  opt , enhancing processor utilization to  opt decreases not only task completion time but also energy consumption, which means  opt is more reasonable than all  <  opt for both performance and energy.However, for the situation in which the present utilization has been  ≥  opt , improving performance by increased processor utilization also induces an extra energy cost.In fact, for all  ≥  opt , they are the noninferior solutions in energyperformance multiobjective optimization.

Conclusions and Future Work
The problem of energy consumption has been a serious topic of research for the last decade.Cloud computing technology is a newly developing method for flexible resource assignment.Typically, a given task is usually executed by a VM.If the COS has a reasonable resource assignment strategy for the VM, the energy consumed in the task completion procedure will be effectively reduced.This research proposed a modeling framework for the analysis of reliability, performance, and energy.With this model, both hardware and software failures are considered.It is capable of evaluating the expected performance and energy consumption.In addition, this research considered the tradeoff between power consumption and task completion time.The proposed model also provides a feasible approach to find an optimal processor utilization which implies a balance of this tradeoff.Based on the analysis of optimal processor utilization, the system can achieve minimum energy consumption when it completes a given task with a VM.
In cloud computing environments, a physical node can run multiple VMs in parallel.Then multiple tasks are executed in the same physical node simultaneously.This is another effective approach to save energy consumption.However, parallel execution of multiple VMs in a single physical node is a more complicated situation.For example, even if the cloud isolation technology ensures the noninterfering between VMs, the hardware failure also has serious influence on the reliability, performance, and energy.Once a hardware failure occurs, all of the VMs will be terminated; the following restoration action and repeated executions of VMs will decrease the performance and increase the energy consumption.Reliability, performance, and energyassociated modeling of running multiple VMs parallelly remains as future work.

Figure 1 :
Figure 1: Markov interpretation of the VM execution process.

Figure 2 :
Figure 2: Power consumption for an instance lifetime of the VM.

Figure 3 :
Figure 3: Relationship between task completion time and processor utilization.

Figure 4 :
Figure 4: Optimal frequency to reduce energy.