Dynamic Energy Storage Control for Reducing Electricity Cost in Data Centers

As the scale of the data centers increases, electricity cost is becoming the fastest-growing element in their operation costs. In this paper, we investigate the electricity cost reduction opportunities utilizing energy storage facilities in data centers used as uninterrupted power supply units (UPS). Its basic idea is to combine the temporal diversity of electricity price and the energy storage to conceive a strategy for reducing the electricity cost. The electricity cost minimization is formulated in the framework of finite state-action discounted costMarkov decision process (MDP).We applyQ-Learning algorithm to solve theMDP optimization problemandderive a dynamic energy storage control strategy, which does not require any priori information on theMarkov process. In order to address the slow-convergence problemof theQ-Learning based algorithm,we introduce a SpeedyQ-Learning algorithm. We further discuss the offline optimization problem and obtain the optimal offline solution as the lower bound on the performance of the online and learning theoretic problem. Finally, we evaluate the performance of the proposed scheme by using real workload traces and electricity price data sets. The experimental results show the effectiveness of the proposed scheme.


Introduction
Cloud computing is an emerging Internet-based computing paradigm which offers on-demand computing services to cloud consumers.To meet the increasing demands of computing and storage resources in cloud computing, there is an increasing trend toward large-scale data centers.As more data centers are deployed and their scale increases, energy consumption cost is becoming the fastest-growing element in their operation costs, including the computing energy cost, cooling energy cost, and other energy overheads.It has been estimated that energy consumption cost may amount to 30%-50% percentage of operation cost of large-scale data centers built by companies such as Google, Microsoft, and Facebook [1].In fact, data centers consumed approximately 1.5% of all electricity consumption worldwide in 2010, which was about 56% higher than the preceding five years [2,3].In the near future, the energy consumption cost problem of data centers is likely to worsen and be more challenging since the technology infrastructures emerge and upgrade from a recessionary period.Hence, efficiently controlling the electricity cost of data centers has attracted an intensive concern of broader research community participating from both academia and industry in the recent years.
As we know, electricity cost generation depends not only on the total amount of energy consumed by the data centers, but also on the electricity price.Therefore, the electricity price is also an important factor in the electricity cost of data centers.With the development of smart grid technology which is a technology for the next generation power grid, more and more electricity markets are undergoing deregulation where the electricity market operators offer dynamic electricity rates to large industrial and commercial customers instead of traditional flat rates at the retail level.Thus, there is an opportunity for us to achieve the electricity consumption cost saving in data centers by observing and utilizing the time-varying electricity price in the deregulated electricity markets.
Normally, the UPS units may be deployed in data centers, and provide emergency energy to power them up using stored energy before the backup diesel generators (DG) can start up and operate as a secondary power source when the main power system experiences an outage.Usually, the transition from the main power system to the secondary power source takes 10-20 seconds.As an improvement of the rechargeable battery, the UPS units have enough energy storage capacity for keeping a data center working 5-30 minutes at its maximum power demand [4].Hence, the excessive energy storage capacity gives us a good opportunity for electricity cost saving utilizing the UPS units to dynamically control energy storage.
Based on the above two facts, the basic principle for achieving the electricity cost saving is recharging the UPS units residing in the data center when the outside electricity price is low and discharging for powering the data center when the outside electricity price is high.Hence, this paper focuses on a dynamic energy storage control strategy for reducing the electricity cost of the data centers.Dynamic energy storage control is expected to adapt the fluctuation of the electricity price and the workload by dynamically making recharge/discharge decisions for the UPS units.It aims for achieving substantial electricity cost saving without performance degradation.
In this paper, we formulate the electricity cost reduction problem utilizing energy storage facilities as the discounted cost Markov decision process.Since the statistical information about the workload arrival and the electricity price is not available, we propose an online algorithm based on -Learning and Speedy -Learning approaches to solve the optimization problem.Particularly, the main contributions of this paper are summarized as follows.
(i) The problem of electricity cost minimization in data centers with energy storage facilities for time-varying electricity prices under deregulated electricity markets is modeled by a discounted cost Markov decision process, which achieves the cost saving by making decisions to recharge/discharge the battery.(ii) In order to solve the optimization problem, we propose a dynamic energy storage control strategy based on the -Learning algorithm, which avoids the reliance on any prior knowledge of the workload and the electricity prices.Furthermore, we introduce a Speedy -Learning algorithm to accelerate convergence of the standard -Learning.(iii) We formulate an offline optimization problem of electricity cost minimization for obtaining the optimal offline solution as the lower bound on the performance of the online and learning theoretic problem.The offline optimization problem is solved by mapping it into a tractable mixed integer linear programming instead of nonlinear programming.(iv) Finally, the experiments are carried out based on real workload traces and electricity price data sets to show the performance of the proposed scheme.By using the real traces that may not provably follow the Markovian assumption, the result also shows that the proposed scheme generally performs well.
The rest of the paper is organized as follows: in Section 2 some related works in this area are presented and discussed.
Section 3 describes a system model for energy management system using energy storage facilities in date centers.Section 4 formulates the problem of electricity cost consumption in the data centers with energy storage facilities as a discounted cost Markov decision process.Section 5 is devoted to designing a dynamic energy storage control strategy of battery based on -Learning and Speedy -Learning algorithms to solve the optimization problem.The optimal offline solution is discussed in Section 6.In Section 7, we provide the numerical evaluation results and performance comparisons.Finally, conclusions are drawn in Section 8.

Related Work
The severe energy consumption problem in data centers has motivated many works on reducing their electricity cost.These works may be roughly categorized into two basic types of mechanisms: (1) reduce the energy consumption or improve the energy efficiency of the data centers; and (2) exploit the temporal and geographical variation of electricity prices to achieve the electricity cost saving.
Regarding the first mechanism, new hardware designs and engineering techniques such as energy-efficient chips, multicore servers [5], DC power supplies [6], advanced cooling systems [7,8], and virtualization [9,10] have been developed in order to improve the power utilization efficiency (PUE) of data centers.From the perspective of algorithm design, the energy consumption saving can operate at two different levels: the server level and the data center level [4].At the server level, dynamic voltage-frequency scaling (DVFS) [11] offers a way to reduce power consumption by adapting both voltage and frequency of CPU with respect to changing workloads.However, DVFS can be applicable only for components (like CPU) that support multiple speed and voltage levels.DVFS based power saving policies can be found in [12,13].Dynamic power management (DPM) is another energy conservation approach, which turns off the power or switches the system to a low-power state when inactive.It can be employed for any system component with multiple power states.In [14], DPM is applied to achieve energy-efficient computation by selectively turning off (or reducing the performance of) system components when they are idle (or partially unexploited).
At the data center level, dynamic cluster reconfiguration (DCR) [15], VM migration and consolidation for load balancing and power management [16], and so forth, approaches are widely discussed for reducing energy consumption in the data centers.DCR in [15] develops an online measurement based algorithm to decide the number of servers to power on/off to achieve energy saving while keeping the overload probability below a desired threshold, which makes a decision without any prior knowledge of the workload statistics.VM migration and consolidation [16] achieve energy saving by continuous consolidation of VMs according to current resource utilization, virtual network topologies connecting VMs, and thermal state of computing nodes.These meth-ods mentioned above mainly focus toward reducing energy consumption to save electricity cost.They can operate as a complementary way to assist the method proposed in this paper to further reduce the electricity cost.
The second mechanism for reducing electricity cost relies on the fact of the notable temporal and geographical variations in electricity prices.In [1], Qureshi et al. develop and analyze a new method for reducing the electricity costs when running large Internet-scale systems.The key idea of the method is to distribute more traffic to data centers with low electricity price.In [17], Rao et al. utilize both the location diversity and the time diversity of electricity prices in the multiple electricity markets environment to minimize the total electricity cost while guaranteeing the quality of service (QoS).Luo et al. [18] propose a novel spatiotemporal load balancing approach to leverage both geographic and temporal variations of electricity price to minimize energy cost for distributed internet data centers (IDC).However, those works mentioned above do not utilize energy storage facilities residing in data centers, which may be used to achieve further electricity cost saving.Compared with existing techniques for electricity cost reduction, the methods of energy storage have no performance degradation of the data center.In this paper, our work focuses on the problem of electricity cost minimization in data centers with energy storage facilities under deregulated electricity markets where the electricity prices exhibit temporal variation, which is mainly motivated by [19].In [19], an online control algorithm using Lyapunov optimization theory is proposed for reducing the time average electric utility bill in a data center, and the solution has the threshold structure.Although simple, the technique of Lyapunov optimization is unable to learn the system dynamics, which may not lead to an optimal control of energy storage.Alternatively, by exploiting a Markov decision process approach and reinforcement learning tool, the proposed algorithms learn the system dynamics and adapt the control decision accordingly for saving more electricity cost.Generally, the optimal control policies for Markov decision process suffer from the "curse of dimensionality." In our work, we consider the total energy consumption of all components in the data center as the energy consumption state instead of each component's individually.Furthermore, there are only three actions on the battery, that is, recharging, discharging, and doing neither.Thus, all of those considerations may effectively alleviate the problem of "curse of dimensionality."

System Model
In this section, we describe system architecture model for energy management in data center, present the models for battery, energy consumption, and electricity cost, as well as formulating the problem of dynamic energy storage control to minimize the expected total electricity cost.

System Architecture.
A general system architecture model for data center with energy storage facilities, depicted in Figure 1, is composed of an energy management system (EMS) and a data center facility.EMS acts as the heart of the energy management framework and manages the energy provision in data center, while the data center facility provides computation and storage resources for executing the submitted tasks.In EMS, the key components include information collector (IC) and energy storage management unit (ESMU).IC is to collect the information of the electricity prices, energy storage, and the energy demand generated by the data center periodically, while ESMU is to make the optimal decision on whether recharging or discharging the energy storage facilities for electricity cost minimization according to the information collected by IC.The energy storage unit (ESU), that is, UPS, has the capability of storing energy drawn from the power grid and discharging the stored energy to power the data center.Below, we use the terms UPS and battery interchangeably.The main work of this paper is to propose a dynamic energy storage control strategy for ESMU.

Grid
The basic running process of EMS can be generally described as follows.IC periodically collects the battery level information as well as the electricity price information from the grid.The data center submits its energy demand information to IC, and ESMU uses this information to make the decision that the energy supply draws from grid or the battery.Finally, the data center can provide services using the energy managed by EMS.

Mathematical Model.
In this subsection, we introduce the time-slotted system model used in this paper, and the time is divided into slots of equal duration of  minutes.It should be noted that small value of the time slot size, , is beneficial for characterizing the state variation of the system in a small time granularity, thus achieving a better cost saving policy due to its prompt adaptation to the changes of the system state.But it may increase the battery cost owing to the increased switching frequency switching of recharge/discharge battery.Therefore, a time slot size should be appropriately selected.The energy storage control decisions are made at the beginning of each slot, and the system's state is assumed to be constant throughout each slot.
From [4], we know that the energy consumption demand of data center in each slot is proportional to the total number of workload requests needed to be served in that slot.The workload requests served in each slot consist of the unfinished requests in the last slot and the new incoming requests in current slot, which implies that the energy consumption demand of data center in each slot depends upon the previous energy demand, not upon other history demands, and it fulfills the Markov property.Thus, we model the energy consumption demands of data center in each slot as correlated time processes following a first-order discretetime Markov model.The energy consumption demand in each slot is assumed to be known at the beginning of a time slot.In reality, this has to be estimated.There are several effective methods for estimating the workload, such as autoregressive and moving-average (ARMA).Let   be energy consumption demand of data center in the slot ,   ∈ L ≜ { 1 ,  2 , . . .,   L }, where  L is the number of elements in L. The elements in L have nonnegative and finite values; that is, 0 ≤   ≤  max , for  = 1, . . .,  L .  (  ,   ) denotes the state transition probability which means that the probability of state transition from   to   is   (  ,   ).
The energy market usually consists of Day-Ahead market and Real-Time market [1].In this paper, we consider the data centers in Real-Time market.Real-time market is a spot market in which the current real-time price is calculated every five minutes or so, based on actual grid operating conditions, rather than expected load.The electricity price in Real-Time electricity market in the slot  is denoted by   , where   ∈ P ≜ { 1 ,  2 , . . .,   P },  P is the number of elements in P, and the elements are assumed to be nonnegative and finite values; that is, 0 ≤   ≤  max , for  = 1, . . .,  P .Following [20], we model the electricity price   as a Markov chain, and   (  ,   ) denotes its state transition probability.
In current data centers, UPS units use lead-acid batteries typically.There are several characteristics of battery operation when using a lead-acid battery practically.For a given battery, each recharge-discharge cycle has energy loss due to AC-DC conversion, so the battery may not be completely efficient, and its performance is affected by the recharge efficiency   ∈ (0, 1] and discharge efficiency   ∈ (0, 1] [21].The energy in the battery is also subject to dissipation over time; it exhibits a leaky character.However, considering that storage leak loss is much smaller than that of interest to us, it is negligible for lead-acid batteries [19].The recharging rate  is assumed to be constant.This is a reasonable assumption when the battery recharges in the constant current way.To assess the impact of repeated recharging and discharging on the battery's lifetime, we assume that each recharge and discharge operation incurs a fixed cost of   and   , respectively.From [19], we have   =   =   / when a new battery costs   dollars and it can sustain  recharge/discharge cycles.Let   be the battery energy level in the slot , which is no more than battery capacity of  max ; that is,   ≤  max for all .The UPS unit is mainly employed to power data center using the stored energy in case of power failure before the backup diesel generators start up and provide power.In order to ensure the reliability of the data center, the battery energy level   is required to maintain a minimum energy level  min ≥ 0; that is,   ≥  min for all .Hence, the battery energy level   is subject to a constraint: Let   ∈ {−1,0,1} be the decision variable of the event that the battery is recharged/discharged in the slot .Without loss of generality, we assume that recharge/discharge operations cannot be done simultaneously; that is to say, we can either recharge or discharge the battery or do neither, but not both.Thus,   can be defined as follows: if recharging the battery in the slot  −1, if discharging the battery in the slot  0, otherwise. ( Let   represent the amount of energy bought to recharge the battery in the slot , and   denote the energy used towards satisfying demand in the slot .Then,   and   can be expressed as follows: where () is an indicator function, defined as The update equation for the battery energy level  +1 in the slot  + 1 can be expressed as where     implies that the energy purchased to recharge the battery is reduced by the recharge efficiency   , while (1/  )  implies that only a fraction   of the discharged energy is converted into electricity under the discharge efficiency   .According to inequality (1), the battery level  +1 cannot exceed its maximum capacity and be lower than the minimum battery level.Therefore,   and   have to satisfy the constraints as follows: Let   represent the external energy drawn from the power grid in the slot , which is used to power data center and recharge the battery.As shown in Figure 1, in order to meet the energy consumption demand for powering the data center in the slot , we have Thus, the total amount of energy drawn from the grid in the slot  can be written as For notational simplicity, according to the indicator function (),   can also be denoted as Define   as the total immediate cost incurred in the slot .Then, we have for all where the term     in the first equation is the electricity cost for the energy consumption in the slot , while the term (  )  + (−  )  represents the battery cost for each recharge and discharge operation.
In this paper, the goal of dynamic energy storage control is to minimize the expected total electricity cost in the data centers with energy storage facilities.Based on the above models, the problem can be formulated as follows: min (5) and ( 6) , where E[⋅] denotes expectation operator, and 0 <  < 1 is the discount factor that represents value reduction over time.
The reason for considering discounted electricity costs is to emphasize early decisions and costs, in order to emulate the effect of reduced battery efficiency over time.Note that the total discounted electricity cost is finite, since the per-slot costs are bounded.We call this problem the expected total electricity cost minimization problem (ETC-problem) as the data center aims at minimizing the total electricity cost.
According to (10), (11) can be rewritten as min (5) and (6) . ( 3.3.Discussion.In data center, the lower-level management routines like server consolidation and instantiation of new VMs may be executed.Different management routines may have different demand profiles of energy consumption.But once the lower-level management routine is given, the demand profile for the workload is determined and can be mathematically modeled.Hence, we can still apply the above mentioned model to achieve the electricity cost saving.

Cost Management Problem as an MDP
In this section, we will map the problem (12) into the framework of Markov decision process (MDP).A Markov decision process, also referred to as a discrete time stochastic control process, provides a mathematical framework for modeling decision-making situations where outcomes are partly random and partly under the control of the decision maker [22].An MDP can be defined via a 4-tuple ⟨S, A,   (,   ),   (,   )⟩, where (i) S is the finite set of states, (ii) A is the finite set of actions, (iii)   (,   ) = Pr( +1 =   |   = ,   = ) denotes the probability that the system is in state   ∈ S at the ( + 1)th slot when the decision maker chooses action  ∈ A in state  at the th slot, (iv)   (,   ) denotes the immediate cost yielded when the state of the system at the th slot is , action  ∈ A is selected, and the system occupies state   at the ( + 1)th slot.
The energy management system in data center, as described above, can be formulated as a finite-state discretetime MDP.In the model, let   denote the joint state (hereafter state) of the system at the th slot, and   consists of the energy consumption demand   , the battery energy level   , and the electricity price   .Thus   can be expressed as the triple (  ,   ,   ).Since all components of   are discrete and finite, the number of elements in S is finite, and the set of states can be denoted by S = { 1 ,  2 , . . .,   S }, where  S is the number of elements in S.An action set represents all allowable actions in all possible states.According to the definition of   in (2), let A = {−1,0,1} be the set of actions for the system, where action −1 indicates that the battery is discharged while action 1 indicates that the battery is recharged, and action 0 indicates that the battery is neither recharged nor discharged.A policy specifies the decision rule to be used at all decision epoches, and here the time of making decision-making is referred to as decision epoches.
The policy provides the decision maker with a prescription for action selection under any possible future system state or history.The policy  = {  ,  ≥ 0} maps the state space to the action space.In this paper, we restrict our attention to stationary deterministic policies that do not depend on time but only on the current state.Let    (  ,   ) denote the transition probability from state   to state   when action   is taken.The immediate electricity cost is    (  ,  +1 ) when the action   ∈ A is taken in state   at the th slot, and then the state changes to  +1 at the ( + 1)th slot.Thus, the objective of an MDP is to find the optimal energy storage control policy (⋅) : S → A that minimizes the expected total discounted electricity cost for the energy consumption in the data center over an infinite time horizon.Here the immediate cost function can be expressed as and the expected total discounted electricity cost is equivalent to (12), and   = (  ) is the action taken when the system is in state   .
As described in Section 3, the energy consumption demand and the electricity price can be described by the state transition probability functions, while the battery energy level  +1 can be uniquely derived by the update equation ( 5) under the given policy  and the current system state   .Since the system state consists of the energy consumption, the battery energy level, and electricity price, the transition of the system state depends only on the current state and the current action.This means that the model described above fulfills the Markov property which indicates that a state depends only on the previous state not on more previous states.Thus, we can make use of dynamic programming (DP) and reinforcement learning (RL) theories to solve the problem (12).For convenience, we will introduce the definition of the state-value function and action-value function before solving the MDP problem [23].
Being in search of an optimal policy, the decision maker needs a facility to differentiate the desirability of possible successor states, in order to decide on the best action.A common way to rank states is by computing and using a so-called state-value function which estimates the expected discounted sum cost when starting in a specific state   and taking actions determined by policy .Accordingly, the statevalue function for policy  is defined as follows: Equation ( 14) is also called the Bellman equation for   , and it expresses a relationship between the value of a state and the value of its successor states.
Similarly, define the value of taking action   in state   under the policy , denoted by   (  ,   ), as the expected discounted cost starting from   , taking the action   ∈ A, and thereafter following policy .  (  ,   ) is expressed as is referred to as action-value function for policy .
For finite MDPs, an optimal policy can be precisely defined in the following way.A policy  is defined to be better than or equal to a policy   if its expected discounted cost of  is less than or equal to that of   for all states.In other words,  ≥   if and only if   (  ) ≤   (  ) for all   ∈ S. Let  * be the optimal policy which is better than or equal to all the other policies.Accordingly, the state-value function under the optimal policy  * is Intuitively, (16) expresses the fact that the value of a state under the optimal policy  * must equal the expected discounted cost for the best action from that state.So we can see that the optimal policy is the greedy policy.According to (16), the optimal action-value function   * (  ,   ) under the optimal policy  * can be written as As seen from the above analysis, in order to minimize the expected total electricity cost, we can obtain the optimal policy by learning Q-value, instead of estimating the demand and real-time electricity prices to solve (11) directly.Thus, solving the ETC-problem requires the prior information on the values of    (  ,   ).Unfortunately, accurate probability distribution of state transition is usually difficult to be known beforehand in practice.Consequently,   * and  * cannot be computed using value iteration.To overcome this difficulty, we consider applying a model-free learning theoretic algorithm based on RL to arrive at an optimal policy  * which minimizes the expected discounted total cost by taking actions and observing their corresponding costs.In the next section, the detailed learning theoretic algorithm is presented.

Learning Theoretic Algorithm
In this section, we will introduce learning theoretic algorithms, namely, -Learning and Speedy -Learning, which we have used to find optimal energy storage control policy.-Learning is a reinforcement learning (RL) algorithm for solving the MDP problems and it directly estimates   * under the assumption that the system's dynamics are completely unknown a priori.It is a well-known model-free algorithm, so the main advantages of the algorithm are simple and easy implementation as well as online operation [24].Hence, -Learning is well-suited for our ETC-problem.The core of -Learning algorithm is a -table and an algorithm for updating the -table and choosing actions.A -table (  ,   ) is a matrix indexed by state   and action   , which is the expected discounted cost of taking action   in state   .
According to (15), we can see that the action-value function   (  ,   ) can be expressed as a combination of the expected immediate cost and the state-value function   (  ) of the next state when following the policy .Note that   (  ,   ) provides the expected long-term consequences for each state-action pair.Then the action incurring the lowest expected cost can be taken as the optimal action just by observing   * (  ,   ).Hence, the optimal actionvalue function allows optimal actions to be selected without where   is the learning rate in the th iteration, and it is responsible for weighing the newly learnt experience.The sequence   (  ,   ) can be proven to converge with probability 1 to   * (  ,   ) as  → ∞ when   satisfies the stochastic approximation conditions 0 <   < 1 and further, .  0 (  ,   ) can be initialized arbitrarily for all (  ,   ) ∈ S × A.
Based on the above discussion, the estimate   (  ,   ) can be used for determining an action.However, the optimal action is determined depending on the accurate estimate for   * (  ,   ).Otherwise, there will always be cases that the actions with current minimum cost are not producing the real lowest cost return.During the learning process, unguided randomized exploration cannot guarantee acceptable performance, while taking greedy actions exploiting the available information in   (  ,   ) can guarantee a certain level of performance, but exploiting what is already known about the system prevents the discovery of better actions.In order to estimate   * (  ,   ) accurately, the action selection method should harmonize the trade-off between exploitation and exploration such that EMS can reinforce the evaluation of the actions it already knows to be good but also explore new actions.Here, we consider the -greedy method.This method selects a random action (explores) with probability  and the best action (exploits), that is, the one that has the lowest Qvalue at the moment, with probability 1 −  at each slot, where 0 <  < 1.Therefore, exploration probability  provides -Learning to be able to continuously explore itself in the new environment for other possibilities of actions despite of the current lowest cost.
Although it has been shown that the sequence   (  ,   ) converges to the optimal action-value function   * (  ,   ), -Learnging suffers from slow-convergence when the discount factor  is close to one.To address this problem, asynchronous Speedy -Learning (ASQL) method is applied to improve the convergence rate.At each slot step, ASQL uses two successive estimates of the action-value function to update the Q-values for achieving a faster convergence rate than standard -Learning.The update process for ASQL is described as follows: where the action   is chosen in state   using the -greedy exploration method, and the system occupies state Then, ∀ ≥ 0, Qest1 (  ,   ) and Qest2 (  ,   ) are calculated, respectively, by Qest2 (  ,   ) =    (  ,   ) +  min In the ASQL algorithm, let   decay linearly with time; that is,   = 1/( + 1), where  is the number of learning iteration.Note that other (polynomial) learning steps can also be used with Speedy -Learning.However, it has been shown that the rate of convergence of ASQL is optimized for   = 1/( + 1) [26].Intuitively, the third term in the righthand side of ( 19) does not play a role for small ,   ≈ 1, and the aggressive steps are taken as  increases when the error in the estimate Qest2 (  ,   ) − Qest1 (  ,   ) is large.Further, when  is very large, the error of the estimate goes to zero as   approaches its optimal value  * , and then there has Qest2 (  ,   ) ≈ Qest1 (  ,   ), thus the third term does not affect the updates.
By applying the proposed scheme, we can obtain the optimal energy storage control policy using storage facilities in data centers for electricity cost minimization.The more detailed procedures of the proposed scheme are presented in Algorithm 1.

Optimal Offline Solution
In this section, we give a lower bound on the performance of the learning theoretic problem by the optimal offline solution, which is employed as a benchmark to evaluate the optimality of the proposed learning theoretic algorithm.In order to formulate the offline optimization problem, we assume that all the future workload arrivals as well as the electricity price variations are known noncausally before the decisions of energy storage control are made.This information can be obtained from the traces of the workload and electricity price in advance.Online learning theoretic problem optimizes the expected total electricity cost over an infinite horizon while the offline solution does that over a realization of the MDP for a finite number of time slots.As previously described, an MDP realization is a sequence of state transitions of the workload, the battery energy level and the electricity price state processes for a finite number of time slots.Hence, we can optimize   such that the expected total electricity cost is  minimized for a given MDP realization in the offline problem.According to (12), the offline optimization problem can be written as follows: min ∈ {0, 1} ,  = 0, . . ., , where X = [ 0 ,  1 , . . .,   ] and B = [ 0 ,  1 , . . .,   ].
From definition (4), it can be seen that the function () is nonlinear.So the problem in (22a), (22b), (22c), (22d), and (22e) is a nonlinear programming (NLP) problem where the objective function or some of the constraints are nonlinear [27].As we all know, it is difficult to solve the nonlinear optimization problem.For this reason, we will show that (22a), (22b), (22c), (22d), and (22e) can be mapped into a tractable linear programming before solving it.
Let us define the following variables regarding recharge and discharge operations in the slot , respectively: where   indicates the recharge operation in the slot , while   indicates the discharge operation in the slot .Under the assumption that the recharge/discharge operations cannot be done simultaneously, there is a constraint on   and   in each slot as follows: Here, we define the vector   = (  ,   ) as the joint decision variable to control the recharge/discharge operation in the slot .As a result, the optimization problem (22a), (22b), (22c), (22d), and (22e) can be rewritten as min ,   ∈ {0, 1} ,  = 0, . . ., , where A is an optimal sequence of control decisions to (26a), (26b), (26c), (26d), (26e), and (26f), and A = [ 0 ,  1 , . . .,   ].
From (26a), (26b), (26c), (26d), (26e), and (26f), we can observe that the objective and constraint functions are linear.Moreover, the optimization variables   and   are constrained to be binary.Therefore, the problem in (26a), (26b), (26c), (26d), (26e), and (26f) is a mixed integer linear programming (MILP) problem.Currently, many existing tools can solve the MILP problem, such as GLPK [28], YALMIP [29], and Ip solve [30].In this paper, we employ Ip solve to solve the proposed MILP problem.Ip solve is a free linear (integer) programming solver based on the revised simplex method and the Branch-and-bound method for the integers, and it can solve pure linear, (mixed) integer/binary, semicontinuous and special ordered sets (SOS) models.

Performance Evaluation
In this section, the performance of the proposed dynamic energy storage control scheme is characterized quantitatively.Real-world workload traces and electricity price data sets are employed to evaluate the performance of the proposed scheme.In the following, we elaborate on the design of the experiments and presenting the experimental results.

Experimental Setup.
In the experiments, we simulated a cloud-scale data center which hosts up to 2 × 10 4 servers [4].For simplicity, we assume that the servers in data center are homogeneous, and it is easy to extend the experiments for the heterogeneous servers with little modifications.In order to evaluate the performance of the proposed scheme, we conducted experiments based on real-world workloads and electricity price data sets.

Workload Data.
The real workload request is extracted from trace data gathered from Intel Netbatch Grid in 2012 [31].We set the time slot size to 15 minutes, and count the number of job requests executed in each slot.The original traced period is only one month, so we repeat it for obtaining a three-month workload trace to complete the performance evaluation.Figure 2 shows the variations of workload requests in each 15 min period for four days.In order to perform the experiment with a larger-scale workload of data center, the number of requests extracted from Intel Netbatch Grid has to be scaled up.One of the approaches to scale up the workloads is to capture the underlying structure of the trace by separating the steady part and random part from the original workloads trace, and to scale the steady part up and add random part to it.However, it requires an appropriate method to capture the characteristics of the random part.We will try this approach to perform our experiment in the future work.In the current experiment, we assume that the number of users is scaled up by 1000 times, and accordingly, the number of requests should be statistically scaled up by 1000 times.The normal power consumption demand   for the workload request of data center in each slot  can be approximated by the following formula [4]: where , , ], and PUE are constants determined by the data center.Particularly,  is the average energy consumption of a server in one slot when it is idle, and   denotes the number of workload requests served by one server in the slot .Hence,  ]  +  gives the energy consumption of one server when it serves   requests in one slot.  denotes the number of active servers in each slot  and has the maximum value  = 2×10 4 .PUE is the ratio of total power drawn by a data center facility (including cooling power) to IT equipment power.In today's energy-efficient data centers, the value of PUE is in interval from 1.1 to 2.0 generally, for example, Google data center has the average PUE of 1.12 in 2012 [32].In our experiment, we set PUE = 1.2.The Intel Netbatch Grid is used for running its chip-simulation workloads, and it takes considerable time to serve one request by one server.According to the real workload trace, the average service time of each workload served by one server is 7-8 minutes, so we set   = 2.
According to [4],  = 12.5,  = 150 Watt, ] = 3 when the CPU type of server in data center is AMD Athlon and service rate is 2 requests/s.Accordingly, we set  = 11250,  = 13500 Watt after calculation and ] = 3 in our experiment.

Electricity Price Data.
We use real-time electricity prices at Houston obtained from the Electric Reliability Council of Texas (ERCOT), and the real-time electricity prices vary on a 15 min basis [33].The time horizon we consider in the experiment covers the period from January 1 to March 31, 2013.In these three months, there are 8640 realtime electricity price samples.Figure 3 shows the real-time electricity price variation characteristics at Houston from January 3 to January 6, 2013.
In the experiments, we simulated a time slotted system with slot duration of 15 minutes, that is,  = 15.The unit for energy consumption or battery energy level is MWh, and the unit for real-time electricity price is $/MWh.We discretize the energy consumption demand into 4 equal interval bins, with the boundaries specified by {[0, 950) , [950, 1000), [1000, 1050), [1050, +∞)} MWh, and choose the energy consumption demand state space to be L = {950, 1000, 1050, 1100} MWh.Similarly, real-time electricity price is also discretized into 4 equal interval bins, with the boundaries specified by {[0, 20), [20,30), [30,40), [40, +∞)} $/MWh, and C = {20, 25, 35, 40} $/MWh is chosen to be as the electricity price state space.For a given maximum battery capacity  max , we also discretize the battery energy level into 4 equal interval bins, with the boundaries specified by {[0, 0.25 max ), [0.25 max , 0.5 max ), [0.5 max , 0.75 max ), [0.75 max ,  max ]} MWh, and choose the battery energy level state space to be B = {0.25max , 0.375 max , 0.625 max , 0.875 max } MWh.Meanwhile, we let the power used for recharging the battery in one slot is 500 MWh, that is,  = 500 MWh, and let the discount factor  = 0.9, which has been justified in [34].Since the minimum energy level  min is a constant, the value of  min has no effect on the experimental results, and we set  min = 0 in the experiments.control algorithm, we considered the Lyapunov optimization algorithm [19] and the offline optimization problem.The Lyapunov optimization algorithm makes decisions to recharge/discharge the battery for minimizing the electricity cost using the solution with threshold structure.The solution of the offline optimization problem can be considered as an lower bound on the performance of the proposed learning theoretic algorithm and the Lyapunov optimization algorithm.

Impact of the Number of Learning Iteration.
In the first experiment, we intend to investigate the convergence rate and performance improvement of the proposed scheme using the real-world workload and electricity price traces.Let   denote the number of learning iterations.The value for   covers {1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000}, and the battery maximum capacity  max is chosen to be 5000 MWh.The initial battery energy level is set to be zero, that is,  0 = 0, and we evaluate the optimal policy for a fully efficient battery (  =   = 1).The new battery costs involve a unit price   (in $ per MWh) [35].That is, for a new battery with the capacity , the battery cost is given by   =   .Here, we set   to 100 $/MWh, and the recharge/discharge cycles  to 2800.The other parameters for the Lyapunov optimization are set as follows: the constant  min =  max , where  max denotes the maximum of the real-time electricity price, the control parameter  =  max , and the maximum power that can be drawn from the grid in any slot  peak =  max + , where  max is the maximum energy consumption demand in any slot.The parameters set above for Lyapunov optimization are justified in [19].
In Figure 4, we illustrate the expected total electricity cost by the -Learning based approaches against the number of learning iteration,   , together with the performance of the Lyapunov optimization approach.As shown in Figure 4, it can be observed that for   ≥ 3000 the -Learning based approaches have less total electricity cost than Lyapunov optimization, and for   ≥ 5000 the Speedy -Learning algorithm with  = 0.07 yields approximately 11% more electricity cost than the offline solution.We also see that the expected total electricity costs for the -Learning based approaches decrease as the number of iteration learning   increases, while the costs for the Lyapunov optimization and the data center without energy storage facilities do not vary with   .The reason for the trend of the total costs of -Learning based approaches with   is that the larger   implies that more accurate   * (, ) is estimated, thus the policy taken by estimated   * (, ) is closer to the optimal policy, so the lower cost is yielded.The result shows that the -Learning based approaches can approximate the optimal policy with increasing accuracy as   increases.From the Figure 4, it can also be observed that for Speedy -Learning algorithm with a low exploration probability ( = 0.001) causes low learning rate, compared to the exploration probability ( = 0.07).Speedy -Learning algorithm with  = 0.07 has faster convergence rate of   (, ) to   * (, ) than the standard -Learning ( = 0.07).This is because that the speedy -Learning algorithm uses two successive estimates of the state-action value function to update the Qvalues in order to achieve faster convergence.Since larger  is more likely to explore better action which might remain unexplored, it can accelerate convergence of   (, ) to   * (, ).Therefore, with a suitable choice of  > 0, Speedy -Learning algorithm may be able to strike a balance in the exploration versus exploitation trade off, and achieve a faster convergence rate.
Figure 5 shows the long-run average electricity cost for different number of learning iteration   .It can be observed from Figure 5 that the long-run average electricity costs of the -Learning based approaches decrease as the number of learning iteration   , while the costs for the Lyapunov optimization and the data center without energy storage facilities remain unchanged as   varies.For   ≥ 4000, the Speedy -Learning algorithm ( = 0.001 or  = 0.07) yields lower average cost than the Lyapunov optimization algorithm.Compared with the Speedy -Learning algorithm with  = 0.001 and standard -Learning algorithm, the Speedy -Learning algorithm with  = 0.07 has better performance.The reason is that for smaller number of learning iteration   , the error between the Q-value estimated by -Learning based algorithm and the optimal Q-values is larger, then the policies are not optimal and this results in higher average costs.As the number of learning iteration increases, more accurate Q-values are estimated and Speedy -Learning with larger  also accelerates convergence of Qvalues, and then more cost can be saved.

Impact of Battery Capacity.
In this subsection, we further carried out an experiment in order to investigate the impact of the battery capacities of data centers by setting  max = 2000, 3000, 4000, 5000, 6000 MWh.We chose the number of learning iteration   = 6000, the exploration probability  = 0.07.The other parameters and the simulation settings were the same as those in Section 7.2.1.In Figure 6 we show the impact of battery capacity,  max , on the expected total electricity cost for   = 6000.It can be observed that the expected total electricity cost decreases upon increasing  max , that is, the larger the battery capacity is, the more cost saving by the Speedy -Learning based scheme can be obtained.Additionally, we also see that the Speedy -Learning algorithm with  = 0.07 yields at most approximately 10% more electricity cost than the offline solution, and lower than the Lyapunov optimization algorithm.The reason is that for larger battery capacity the Speedy -Learning based scheme would be likely to make the optimal policy to store more power at lower prices, while the threshold structure of the optimal solution for the Lyapunov optimization algorithm has no capability of learning system dynamics, and it stores power at the prices lower than  the thresholds, but higher than the prices used by the Speedy -Learning based scheme.
Figure 7 shows the long-run average electricity costs for different battery capacities  max .We plot the performance of the Speedy -Learning based scheme for   = 6000 and  = 0.07, compared with the other approaches.From Figure 7, it can be observed that as  max increases, the average costs yielded by Speedy -Learning algorithm and Lyapunov optimization algorithm decrease, while the Speedy -Learning algorithm achieves more average cost saving than the Lyapunov optimization algorithm.

Conclusion
In this paper, we investigated the problem of electricity cost minimization of data centers using energy storage for timevarying electricity prices under deregulated electricity markets, which was formulated as a discounted cost Markov decision process.A dynamic energy storage control strategy based on the -Learning algorithm was designed to reduce the electricity cost, and we also applied the Speedy -Learning algorithm in order to accelerate convergence.The advantage of the proposed scheme is that it makes decision without any priori information about the energy management system of the data centers, and it can also adapt to the variations of the workload and the electricity prices.We also studied the offline optimization problem which was characterized as an MILP problem, and its optimal solution can be considered as a lower bound on the performance of the proposed algorithm.In the experiments, real workload traces and electricity price data sets were used for verifying the performance of the proposed scheme.The results illustrated the effectiveness of the proposed scheme in saving the electricity cost via comparison with the benchmark algorithm.Results for the real traces that may not provably follow the Markovian assumption also show that the proposed scheme generally performs well.

Figure 1 :
Figure 1: Energy management framework using energy storage facilities in data center.

Algorithm 1 :
Dynamic energy storage control strategy based on -Learning and Speedy -Learning algorithms.

Figure 2 :
Figure 2: Workload arrival patterns of Intel Netbatch Grid for four days.

Figure 3 :
Figure 3: Real-time electricity prices at Houston from January 3 to January 6, 2013.

Figure 4 :
Figure 4: Expected total discounted electricity cost with respect to the number of iteration learning   .

4 Figure 5 :
Figure 5: Long-run average electricity cost with respect to the number of iteration learning   .

Figure 6 :
Figure 6: Expected total discounted electricity cost for different  max and   = 6000.

Figure 7 :
Figure 7: Long-run average electricity cost for different  max and   = 6000.
knowing anything about    (  ,   ), and we can derive the optimal policy  * by estimating   * (  ,   ).The -Learning process tries to find   * (  ,   ) in a recursive manner.Let   (  ,   ) be the estimate of   * (  ,   ) in the th iteration.Then, in each slot the update process of the estimate   (  ,   ) can be described as follows: (i) observe the current state   ←   ∈ S, (ii) choose action   ←   ∈ A, and then perform the chosen action   , (iii) observe the next state   ←  +1 ∈ S, and receive an immediate cost    (  ,   ), (iv) update the estimate   (  ,   ) according to   (  ,   ) = (1 −   )  −1 (  ,   ) +   [   (  ,   ) +  min