Modeling and Analysis of Queueing-Based Vary-On/Vary-Off Schemes for Server Clusters

A cloud system consists of a of clusters various applications. To satisfy the increasing demands, especially forthefront-endwebapplications,thecomputingcapacityofacloudsystemisoftenallocatedforthepeakdemand.Suchinstallationcausesresourceunderutilizationduringtheoff-peakhours.Vary-On/Vary-Off(VOVO)schemesconcentrateworkloadsonsomeserversinsteadofdistributingthemacrossallserversinaclustertoreduceidleenergywaste.RecentVOVOschemesadoptqueueingtheorytomodelthearrivalprocessandtheserviceprocessfordeterminingthenumberofpowered-onservers.Forthearrivalprocess,Poissonprocesscanbesafelyassumedinwebservicesduetothelargenumberofindependentsources.Ontheotherhand,theheavy-taileddistributionofservicetimesisobservedinrealwebsystems.However,therearenoexactsolutionstodeterminetheperformancefor M/heavy - tailed/m queues. Therefore, this paper presents two queueing-based sizing approximations for Poisson and non-Poisson governed service processes. The simulation results of the proposed approximations are analyzed and evaluated by comparing with the simulated system running at full capacity. This relative measurement indicates that the Pareto distributed service process may be adequately modeled by memoryless queues when VOVO schemes are adopted.


Introduction
The numbers of Internet requests are not uniformly distributed over time.There are a huge number of requests during the peak hours.Cloud service providers tend to install surplus server nodes to handle the bursty load.Clearly, these servers waste a lot of energy during the off-peak periods.Dynamically adjusting the number of active servers, that is, Vary-On/Vary-Off (VOVO) scheme, improves energyefficiency of server clusters.However, overly shrinking the number of powered-on servers may lead to decreased service quality.Therefore, finding the right number of active servers to balance energy consumption and operation performance is a primary issue of an applicable VOVO scheme.
VOVO schemes can be dated back to earlier last decade [1,2].The basic idea of earlier VOVO schemes is to dynamically size a cluster according to CPU utilization or resource usage.This resource provisioning problem in a cluster can be analogous to the staff sizing problem in a telephone call center.In a call center, the customers are callers, servers are telephone agents, and tele-queues consist of callers that await service by an agent.The well-known Erlang-C model [3] has been widely applied to this problem.Many recent VOVO studies [4][5][6][7][8][9] adopt queueing analysis to manage resource usage of clusters.
Most available analytic solutions in queuing theory rely on independence assumptions and Poisson processes [10].Internet traffic patterns are well known to possess extreme variability and bursty structure [11].The heavy-tailed distributions of service times are observed in real web systems [12,13].This characteristic is characterized by self-similar process [14].Pareto distribution is a popular model of selfsimilar processes [15].However, queueing models with Pareto distributed service times are very difficult to analyze [16].Although heavy-tailed service processes in web systems are widely documented, memoryless queues are still used for evaluating system performance in many studies [17][18][19][20][21]. On the other hand, studies [22][23][24][25] that adopt general/Pareto distributions need approximations for the analytically intractable distributions to obtain the performance measures.

Mathematical Problems in Engineering
The Poisson arrival process is particularly appropriate if the arrivals are from a large number of independent sources [10], such as users of web services.However, exploring the difference between modeling service times with Poisson process and non-Poisson process governed queues remains a challenging research topic, since many queueing models remain analytically intractable [26].In order to understand the performance difference between modeling service times with Poisson process and non-Poisson process governed queues, a series of simulations are conducted in this study.Compared with the mathematical analysis and numerical methods, simulation is more time and memory consuming but it is sometimes the only way to get reasonably accurate results [27].
This paper presents the approximations of VOVO cluster sizing for systems modeled by // and //.Randomly generated workload traces with Pareto and exponential distributed service times are simulated using the approximations.Two distinct types of real web access logs are simulated as well.A relative performance evaluation method is proposed and used for gauging the simulation results.Through the evaluation, the performance difference between modeling service times with Poisson process and non-Poisson process governed queues is found.The result suggests that // based sizing approach may be adequate when a queueing-based VOVO scheme is adopted in a cluster.
This paper is organized as follows.Section 2 shows the approximation methods for cluster sizing.Section 3 details the simulation setup and the evaluation metric.Section 4 presents the simulation process and discusses the results.Section 5 concludes this paper.

Queueing-Based Cluster Sizing
Investigations in a queueing theory applied system mainly aim at getting the performance measures, which are the probabilistic properties of the random variables, including number of customers in the system, number of waiting customers, utilization of the servers, response time of a request, waiting time of a customer, idle time of the server, and busy time of a server.These measures heavily depend on the assumptions concerning the distributions of interarrival times and service times as well as number of servers and service discipline.Queueing analysis can be naturally applied to the performance measures of server clusters.Server clusters have been widely adopted in many cloud data centers to resolve the increasing user needs [28].Although heterogeneity is common in multifunctional cloud data centers, server closets or blade systems that form the basic computing units usually consist of homogeneous nodes.Therefore, this work focuses on single-queue homogeneous systems.
The symbols and definitions used in this paper to describe the performance measures of queueing systems are shown in Symbols and Definitions.
In classical queueing analysis, supposing that requests are handled by a single-queue  homogeneous server system with the First-Come First-Serve (FCFS) discipline, exponentially distributed service times, and Poisson process governed arrival intervals, the system can be modeled as // system. must be less than 1 ( < 1) for // system being in a stable state.Many performance measures of a stable // system have been thoroughly studied and are shown in (1) to (7).The calculations and proofs of these equations can be found in many textbooks, for example, [29, p. 412]: 2.1.Approximation for Sizing // Modeled Clusters.In a homogeneous // system,  of each server is identical.From (5), [] of // system can be considered as a function of  denoted by   (): Let   be an arrival rate of // system maintaining a targeted response time  given  > 1/.The curves of   () for  =  − 2,  =  − 1,  = ,  =  + 1, and  =  + 2 with a targeted response time  are shown in Figure 1. 1 can be easily obtained from (5): For  > 1, , based on (7), can be represented as Figure 1:   () versus arrival rate () with a targeted response time .
From (1) and ( 2),   is Therefore, to get   of  for  ≥ 2, the following equation has to be solved: 2 can also be easily obtained by solving (13) with  = 2: It is difficult to get a closed-form expression of   in terms of , , and  when  > 2. Therefore, an approximation is proposed for   for  > 2. Assume that this approximation can be applicable for the systems with at most  servers.Every   (),  ≥  ≥ 2, is shifted with the offset value of −( − 1) and denoted as   (): Figure 2 shows the combination of the curves of  1 () and   (),  ≥  ≥ 2, with emphasis on the intersections between the targeted response time  and these curves.By observing Figure 2, the distances between all consecutive  −1 − ( − 2) and   − ( − 1) approximately form an exponential decay series { 1 ,  2 , . . .,   }.Let the series be approximated by an exponential decay function, let  be the initial quantity, and let  be the exponential decay constant.An element   in the series can be expressed as Let the initial quantity  =  1 ;  can be obtained from (10): From ( 17), ( 16), (10), and ( 14),  2 is can be obtained by rearranging (18): Therefore,   can be represented as .
Let  = (2 − 2√ 2  2 −  − 1).For a positive integer  ≥ 1,   can be approximated as Consequently, with an anticipated arrival rate  and the measured service rate , the number, denoted as , of servers that maintain the targeted mean response time  can be approximated as

Approximation for Sizing 𝑀/𝐺/𝑚 Modeled Clusters.
Internet workload characterization has found that the probability of service times is not an exponential distribution but a heavy-tailed distribution in real web systems [12,14,30].
In other words, a single-queue -server cluster should be referred to as // queue for Internet services.There are no exact formulas for the mean response time of // system, but numerous approximations can be used.Kingman's Exponential Law of Congestion is a popular approximation that is calculated using the coefficient of variation of service times and known solutions from // queues.Kingman's approximation is expressed as Let ] = (1+ 2 )/2.The mean response time of // system can be expressed as Let  +  () represent the mean response time of // system on different arrival rates.Based on (8),  +  () can be expressed as Although  +  () rises at a more precipitous rate than   (), the correlation observed in Figure 2 and aforementioned approximation still remain valid.
Let the variables { + 1 ,  + 2 , . . .,  +  }, { + 1 ,  + 2 , . . .,  +  }, and  + be the correspondences in // model to the variables { 1 ,  2 , . . .,   }, { 1 ,  2 , . . .,   }, and  previously mentioned in // model.The mean response time for //1 system can be approximated based on the Pollaczek-Khintchine transform: Suppose that the targeted response time is still ; then Similar to the process from ( 14) to (20), the following equations can be derived: With an anticipated arrival rate  and the measured service rate , the number, denoted as  + , of servers that is expected to maintain the required mean response time  can be approximated as

Simulation Setup and Evaluation Metric
A cluster managed by a VOVO scheme periodically adjusts the number of active servers that provide the required services.In general, there are several key functional components including the following: (1) Job queue: the job queue holds the waiting requests.Each request enters the tail of the queue and waits for service in FCFS manner.In this work, all jobs share a common queue.(2) Workload distributor: the workload distributor retrieves a job from the head of the job queue and distributes the job to an available node.(3) Cluster sizing unit: this unit decides the number of active servers.The decision may be based on some predefined thresholds of certain resources, for example, CPU utilization, job throughput, and energy usage.In this work, the decision is calculated based on (22) or (29) according to the given arrival rate, mean service rate, and targeted response time.(4) On/off controller: the on/off controller periodically activates or deactivates server nodes according to the number given by the sizing unit.(5) Managed servers: the cluster consists of a group of identical computer nodes, which may be commodity servers.Each server node processes the assigned jobs and reports its working status to the workload distributor.

The Design of Simulation Program.
A simulation program for the VOVO managed system is developed to investigate the performance of the proposed sizing methods.This program is written using the C++ programming language.In a real VOVO managed system, every incoming job is queued, and an event notification is issued to the workload distributor upon the arrival of a job.If there are available nodes, the workload distributor then dispatches the queued jobs to the available nodes.If a node has completed its assigned job, it also sends an event notification to inform the distributor about its availability.The instructions of node activation and deactivation are periodically issued by the on/off controller.If a deactivation command is issued to a busy node, the node will complete the processing job before turning itself off.However, it will be extremely time consuming to simulate the system with time-based eventdriven process.Since the input workload traces have to be readily prepared for the simulation, this work adopts the sequential process that significantly reduces the simulation time.The simulation process is shown in Algorithm 1.

Randomly Generated Traces.
A set of randomly generated traces and two real-world traces are simulated in this work.The most widely used heavy-tailed distribution as the service time distribution is the Pareto distribution [31].The Poisson distribution is appropriate if the arrivals are from a large number of independent sources, such as web requests [10,32].Therefore, the randomly generated traces have Pareto distributed service times with tail indexes from 0.1 to 4.0 stepping by 0.1 and exponentially distributed arrival intervals with traffic intensities from 0.05 to 0.95 stepping by 0.05.
A randomly generated trace  with a tail index  and a traffic intensity  is represented by  , . , is a series of pairs of an arrival time, denoted by   , and a service time, denoted by   .Suppose that  , has  elements; it can be represented as Each unique combination of  and  is randomly generated 10 times.That is, there are 10 different traces for a combination of  and .Each trace contains values covering 36,000 time units.All traces are generated with the same mean service time.Therefore, there are 7,600 randomly generated traces which have been simulated in this study.The generating functions for Pareto distributed values and exponential distributed values can be found in many textbooks, for example, [10, p. 509].The coefficient of variation is often used to measure the relative variation in the data and is the ratio of the standard deviation to the mean.For Pareto distributed values, the coefficient of variation denoted by CV Pareto of  > 2 can be calculated as [10] CV Pareto = √ / ( − 1) 2 ( − 2) The coefficient of variation of exponential distributed values is supposed to be 1.The coefficients of variation of service times and arrival intervals of the generated traces are shown in Figures 3(a) and 3(b), respectively.

Real-World
Traces.This simulation adopts two realworld workload traces that include a publicly available trace and a trace acquired from a university campus.The service time of a request is assumed to be proportional to its responded page size in the simulation.The publicly available trace was recorded at the 1998 World Cup web site [30].This workload trace is one of few logs providing server activation records.It is known for having a heavy-tailed page-size distribution with a tail index of 1.37 [30].Each request recorded in the log contains an arrival time, a responded page size, and a server identification.The second workload trace is acquired from a university with a student population of 4,219, including 3,531 undergraduates.This web access log was collected from 12:03:59 September 19, 2014, through 00:01:39 October 21, 2014, a total of 31 days.The trace log is from a site hosting a student information system that provides course information, handouts/homework systems, message system, email system, and other campus information.The log exhibits the following characteristics: 7,054,170 requests, 8 hosting servers, 5,991.64bytes per response in average, 74.74 requests per second per server (the peak service rate), an average service time of 0.0134 seconds per request with a standard deviation of 0.227, and a tail index of 0.154 of the service time distribution.
The hourly traffic patterns of the 1998 World Cup log and the 2014 campus log are shown in Figures 4(a) and 4(b), respectively.The two logs represent two distinct service patterns including an occasional service pattern, that is, 1998 World Cup, and a regular service pattern, that is, student information system.The World Cup log shows a growthdecay pattern.An iterative pattern analogous to the daily working hours is observed in the campus log.Note that there are a school break and a scheduled maintenance during the recorded period.
For the World Cup log, the simulated cluster consists of 33 servers based on the information given in the log.For the campus log, the simulated cluster consists of 8 servers.As for the randomly generated traces, the simulated cluster consists of 10 servers.The on/off controller periodically sizes the simulated cluster with the interval set at 300 seconds, which are long enough to compensate the machine bootup delays and short enough to reflect the demand changes [1,2,33].

Evaluation Metric.
Three simulation scenarios, which are all-on, , and , are performed.All servers in a cluster are always powered on in all-on scenario.This scenario is expected to consume the most energy but to have the best service quality.The  scenario uses (22) to approximate the number of servers.The  scenario is similar to  except that ( 29) is used for the sizing approximation.Nielsen's [34] response time limits for usability are adopted by setting the targeted response time at 1 second and the failure threshold at 10 seconds.The objective of a VOVO scheme is to reduce the energy consumption while maintaining a reasonable service quality.To gauge the performance of an approach  (denoted by   ), relative measures to all-on are adopted instead of absolute measurements, since the all-on scenario must have the least response time and the highest energy consumption.The considered factors of a scenario  are as follows: (1) being satisfactory, denoted by   , which is the portion of responses conforming to the targeted response time; (2) acceptance, denoted by   , which is the portion of responses being admissible (i.e., under the failure threshold); (3) energy, denoted by   , which is the average number of activated servers, since all servers are identical and have the same power profile.
The relative measurements of   ,   , and   are defined as , for  all-on > 0. ( Let   ,   , and   be the weighting coefficients for  +  ,  +  , and  +  , respectively.The relative performance, denoted by   , is defined as

Simulation Results
. With this relative measurement, that is, (33), the optimal solution produces the minimal value of   .The simulation results of randomly generated traces are summarized by the relative performance of the simulated scenarios to all-on with   = 1,   = 1, and   = 1.In order to make the results be easily comprehended, the relative performances of   and   are graphically visualized using gray level.Figure 5 shows the relative performance, that is,   , of scenarios  and , with   = 1,   = 1, and   = 1.It is very difficult to visually differentiate Figures 5(a) and 5(b).Using the averaged values, as shown in Figure 6, it can be found that  has a slightly better performance than .In average, which is based on Figure 6, scenarios  and  outperform all-on under most cases except when the tail index is between 0.4 and 0.9.Furthermore, the averaged relative performances shown in Figure 6(a) are clearly correlated with the coefficient of variation of service times (as shown in Figure 3(a)).This simulation result indicates that both  and  yield a worse performance than all-on for diverse access patterns.This may imply that these approaches undersize the cluster for high variation of service times.
In Figures 6(a) and 6(b), the curves of  and  are indistinguishable under those scales.In fact, the relative performances of scenarios  and  are not identical.Figure 7(a) shows the ratios of   to   .There are some regions between tail indexes 0.3 and 1.3 where the ratios are not 1, that is, identical.In Figure 7(b), the average of   /  is always less than or equal to 1, which means that // based sizing is more effective than // based sizing.However, Figure 7 also shows that the difference is very small, that is, under 1% in average.Given the fluctuation nature of web traffic, // based sizing may be adequate for empirical practices.
In order to examine above findings, two real-world traces are simulated under previously mentioned scenarios, that is, all-on, , and .Figure 8 shows the cumulative distribution of the response times of the simulated real-world traces.As shown in Figure 8(a), all requests in scenario all-on can be served within 1 second, but only approximately 80% of requests can be handled for this targeted response time in scenarios  and .The curves of  and  are also indistinguishable in Figure 8(a).In Figure 8(b), more than 99.96% of requests in scenario all-on can be served within 1 second.More than 97% of requests can be handled for this targeted response time in  and  scenarios.
The curves of  and  are also indistinguishable in Figure 8(b).
Based on the relative performance, that is,   , Table 1 shows that  and  are very similar in both cases.As expected, all-on always has the shortest mean response time but the most energy consumption.The proposed queueingbased sizing approaches, that is,  and , can reduce significant energy consumption while maintaining a reasonable service quality.

Analysis and Comparison.
Energy consumption and service quality of the server machines are two major performance measures for a cloud service provider.The above results are fully based on simulation.To evaluate the proposed strategy on a real system, a 6-hour log is extracted from the World Cup trace and fed to a cluster consisting of 33 computers.In addition to the 33-node cluster, there In the evaluation, the on/off controller periodically sizes the cluster with the interval set at 300 seconds.Interval energy data of the cluster, excluding the external computer and the network switch, is instrumented and stored by a digital multimeter (DMM).The evaluation result is shown in Figure 9 and conforms to the simulation results.As shown in Figure 9(a), with all nodes turned on, that is, all-on scenario, all requests are responded to within 1 second, while only approximately 92% of requests can be responded to within 1 second for either  scenario or  scenario.On the other hand, both  scenario and  scenario consume much less energy than all-on scenario, as shown in Figure 9(b).Similar to the simulation results, the curves of  and  are also very close to each other in both Figures 9(a VOVO strategy has been studied for more than a decade.Many VOVO approaches [33,[35][36][37], which dynamically size a cluster according to a preset threshold of CPU utilization or resource usage, were developed based on the designs proposed by Chase et al. [1] or Pinheiro et al. [2].To compare the proposed queueing-based approach with the threshold-based approaches, Pinheiro's approach [2] is simulated and denoted as vovo scenario.In vovo, the service demand is smoothed and estimated using the cumulative moving average.vovo periodically activates one more node of the cluster when the estimated utilization rate exceeds a predefined threshold and deactivates one node otherwise.The World Cup trace is also used in the simulation of vovo.Since vovo uses the threshold of CPU utilization rate instead of the response time as a controlling factor, 3 different threshold values, which are 0.7, 0.8, and 0.9, are simulated to get a comparable result.
The simulation results of vovo are evaluated with the metric proposed in Section 3.4 and compared with all-on and , as shown in Table 3. From this comparison, the threshold of the CPU utilization rate has to be less than 0.8 for vovo to get a comparable result with .Although vovo outperforms  with the threshold set at 0.7, it requires more nodes and therefore consumes more energy than .In order to get a reasonable threshold value for vovo, it may be necessary to go through several runs of simulation or other lengthy procedures.On the other hand, the proposed approach minimally requires only the anticipated arrival rate Mathematical Problems in Engineering

Conclusion
This paper proposes two queueing-based sizing methods to periodically adjust the number of servers in a cluster.The proposed method aims at achieving a fair energydelay performance trade-off of server clusters.The proposed approximation formulas, that is, ( 22) and ( 29), are simple closed-form expressions, which may be implemented in a network switch for real-time processing.
From the simulation results, the schemes with the proposed approximation formulas reduce considerable amount of energy consumption while maintaining comparable service performance for gentle service time fluctuations.However, the proposed methods tend to underestimate the number of required servers for service processes with high variability, that is, tail index between 0.3 and 1.3.Similar observation has also been documented in [5].
The relative measurements of  and  are almost undifferentiated, except that  is very slightly better than  for service processes with high variability.Although Internet workload characterization has found that the probability of service times is a heavy-tailed distribution, periodically resizing the cluster is possible to alleviate the situation of long jobs blocking short jobs in the waiting queue.Because once a deactivation command is issued to a busy node, the node becomes a pending-off node that has to complete the unfinished job before turning itself off.If a long job is handled by this pending-off node, the queued jobs can be quickly assigned to other newly activated nodes of the next period without waiting for the finish of that long job.Therefore, sizing the cluster based on // model or // model makes little difference.Based on the simulation results, the simpler // model may be adequate and preferable for sizing clusters adopting queueing-based VOVO schemes.
Server clusters are widely adopted in cloud data centers [28].In order to support various kinds of services including user-end applications and back-end activities, heterogeneity becomes common in multifunctional cloud data centers.It is popular that a data center has different group of servers with different computation capacities.Since the basic computing units that are grouped for specific function usually consist of the same type of machines, the proposed approach is built based on the assumption of homogeneous nodes.Therefore, the proposed approach is particularly pertinent for the computing units forming the underling base of cloud data centers.Nevertheless, extending this work to the heterogeneous environments is an immediate future work of this study.The multitier system is an obvious case of server heterogeneity and is widely adopted in many enterprise systems.There are many approaches which have been proposed to address the applicability of queueing models on multitier systems, such as Multitier Internet Applications [25], Heterogeneous Multitier Web Clusters [38], Layered Queueing Networks (LQN) [39,40], and Power-Saving Server Farms [41].The job dispatching [42,43] and scheduling [44,45] also arise as important issues in a heterogeneous environment.Considering these related developments and integrating the proposed approach with the existing work may be a practical way to extend this study to a heterogeneous environment.

𝜆:
The job arrival rate of a queueing system : The mean service rate of a server in a queueing system 1/: The mean service time of a server in a queueing system : The standard deviation of the service times in a queueing system : The number of servers in a queueing system : The traffic intensity,  = / : A system state, which is the same as the number of jobs in the system   : Theprobabilityofastate : The coefficient of variation of service times in a queueing system,  =  : Thenumberofjobsin// system  1 : Thenumberofjobsin//1 system : The number of busy servers in // system []: The mean value of  [ 1 ]: The mean value of  1 []: The mean value of  : The response time of a job in // system  1 : The response time of a job in //1 system  + : The response time of a job in // system  + 1 : The response time of a job in //1 system : The waiting time of a job in // system  + : The waiting time of a job in // system

Figure 3 :Figure 4 :
Figure 3: The coefficients of variation (CV) of service times and arrival intervals of the generated traces.
[]: The mean value of  [ 1 ]: The mean value of  1 [ + ]: The mean value of  + [ + 1 ]: The mean value of  + 1 []: The mean value of  [ + ]: The mean value of  + .

Table 1 :
Relative performance.GB of memory.All nodes use Linux 2.6 as the operation system with Apache 2.2 installed.The average power demand is 20.83 Watts when an idle node waits for a request with all its parts being turned on.The peak power level of a node that was instrumented is 26.33 Watts.The node profile of the test cluster is shown in Table2.

Table 2 :
Node profile of the test cluster.

Table 3 :
Performance comparison of all-on, , and vovo.
(22)e service rate , and the desired response time  to approximate the required number of servers , that is,(22).