Power Optimization of Multimode Mobile Embedded Systems with Workload-Delay Dependency

This paper proposes to take the relationship between delay andworkload into account in the power optimization ofmicroprocessors in mobile embedded systems. Since the components outside a device continuously change their values or properties, the workload to be handled by the systems becomes dynamic and variable.This variable workload is formulated as a staircase function of the delay taken at the previous iteration in this paper and applied to the power optimization of DVFS (dynamic voltage-frequency scaling). In doing so, a graph representation of all possible workload/mode changes during the lifetime of a device, Workload Transition Graph (WTG), is proposed. Then, the power optimization problem is transformed into finding a cycle (closed walk) in WTG which minimizes the average power consumption over it. Out of the obtained optimal cycle of WTG, one can derive the optimal power management policy of the target device. It is shown that the proposed policy is valid for both continuous and discrete DVFS models. The effectiveness of the proposed power optimization policy is demonstrated with the simulation results of synthetic and real-life examples.


Introduction
Today's mobile embedded systems often interact with physical processes or external environments, referred to as Cyber-Physical Systems (CPSs).Such systems are usually modeled with interactions between the physical world and the devices [1].For instance, handheld or stationary embedded systems need to continuously interact with environments in the example of smart building [2].The system performs a computational task and responds through an actuator to the physical side, while the resulting change at the physical side, in turn, makes a variation on the input (sensor) of the device.In order not to make this control loop unstable, it is common that the embedded system has a real-time constraint within which all the computation should be completed.
In a class of applications, the computational workload of the embedded systems depends on the variation of the sampled input value, while the computation delay, in turn, affects the input variation of the next iteration.Usually, if it invests more time at one iteration for processing information, it would have more work to do at the next iteration.
One example of such delay-workload dependency can be found in an object tracking which is frequently used in drone, surveillance camera, or augmented reality [3][4][5].The image obtained from the camera is processed by the object tracker to follow an object.As the object may continuously change its position meanwhile, the object tracker should reactively take an image from the adjusted position/angle to make the next decision.The more time the object spends in the tracker, the more distance the object will move by.
Such workload-delay relations can be popularly found in modern mobile embedded systems, which rely on computer vision algorithms to capture what happens in the external world.In those applications, it is typical that the current internal state is maintained to figure out the difference caused by what happened in the external world.The examples of such internal states range from a simple snapshot of a sensor reading to a complicated model of the scene obtained from camera.No matter what the model is, it is generally true that the longer execution delay between two consecutive invocations of the algorithm results in the larger workload in the successive iteration as the degree of the heterogeneity gets bigger.

Mobile Information Systems
The workload-delay dependency can also be found in many different types of applications.Real-time pattern matching over event streams [6], for instance, exhibits similar behavior: the queries can be handled either by small amount (shorter delay, less workload) or in an aggregated manner (longer delay, more workload).Similarly, haptic rendering in Human-Computer Interface (HCI) uses adaptive sampling techniques to deal with the stringent real-time constraint [7] and the rendering algorithm can be warm-started to exploit the temporal coherence [8].In essence, applications which exploit temporal coherence have possible workloaddelay dependencies.That is, any iterative algorithms that can be warm-started can lead to one.
Nowadays, most modern microprocessors used in mobile embedded systems support dynamic voltage-frequency scaling (DVFS) [9] for power-efficient operations.Generally, delay and energy in the systems with DVFS are in a tradeoff relationship for a given workload.That is, given a certain amount of work to be handled, a faster solution (with a higher frequency) is less energy efficient.Considering this control knob with the aforementioned delay-workload dependency, the power optimization problem gets very challenging.Conventionally, it has just been understood that "working as slow as possible" within the real-time constraint is the best discipline in terms of minimizing the power dissipation.However, with the existence of the workload-delay dependency, it is no longer valid since a slower execution may cause a bigger workload at the next iteration.On the other hand, "as fast as possible" is not optimal either, as the power consumption is a strong function of the operating speed [10].
The workload-delay dependency has been firstly modeled and applied to the DVFS optimization in [11].It is assumed that the workload is a continuous and monotonically increasing function of the delay, under which a simple yet effective power management technique has been proposed.Specifically, it has been shown that staying in a certain DVFS mode is better than alternating between different DVFS modes dynamically.Later, the optimization is generalized to various power models and formally proven to be optimal [12].
This work differs from our previous work [12] in that we take different optimization approach tailored for discrete workload levels.We observed that the continuity assumption does not always hold true in reality.Rather, there are a number of applications that have discrete levels of workload.For instance, recall that many image or signal processing algorithms handle input data in the unit of macroblock or frame.In such application domains, the workload tends to grow in a discrete manner.In this paper, the workload is modeled as a staircase function of the delay taken in the previous iteration.Since the solution obtained by the previous work [11,12] is no longer optimal or nonexisting at all in the staircase model, a new power management technique is proposed.The contributions of this paper can be summarized as follows: (i) The workload-delay dependency is modeled in a staircase function generalizing the previous model and validated with a real-life example.
(ii) A novel data structure, Workload Transition Graph (WTG), is proposed to represent all possible operation workload/mode changes of a device.
(iii) Based on WTG, a power management policy is derived and shown to be optimal.

Related Work
Bogdan and Marculescu [13] observed that workloads from physical processes tend to be nonstationary but exhibit some systematic relationship in space and time.They proposed a workload characterization approach based on statistical physics and showed how the workload-awareness can improve the design of electronic systems.Zhang et al. [14] studied the relationship between the control stability and workload in inverted pendulum control.While enlarged invocation periods may lower the degree of stability, more inverted pendulums can be controlled by a system as the lengthened invocation periods lower the utilization of the algorithm.This can be seen as trading off the control stability for resource efficiency.In other words, they proposed to sacrifice the stability to accommodate more workload in a system.The proposed technique also deals with variable workload in electronic systems but differs from the abovementioned works in that the effect of execution delay on workload is systematically considered.Recently, Pant et al. [15] proposed a codesign of computation delay and control stability based on anytime algorithm.Anytime algorithm is a kind of algorithms that can be stopped at any point in time but still provides a decent solution.Typically, the quality of the solution is increasing function of the computation delay.In their work, it is the duty of the control algorithm to adaptively change the real-time deadline constraint and error bound (quality of control).On the contrary, the relationship between execution delay and workload is formally described in the form of workloaddelay function; thus no explicit runtime monitoring/control is required in the proposed technique.
A design guideline for flexible delay constraints in distributed embedded systems was proposed by Goswami et al. [16], where some of the samples are allowed to violate the given delay deadline.They presented the applicability of the proposed approach using the FlexRay dynamic segment as a communication medium.This work is similar to the proposed approach in the sense that they do not stick to a given fixed real-time deadline.While they could avoid the resource overprovisioning by trading off the hard real-time constraints, the workload dependency to the delay has not been considered.Moreover, from a real-time standpoint, the proposed work is more rigorous as it allows no real-time constraint violations.

Problem Definition
This section presents the system model assumed in this paper, which is followed by the formulation of the power optimization problem.

Dynamic Voltage-Frequency Scaling.
In this paper, we assume that a system has multiple operation modes due to DVFS feature, where the operating frequency and voltage can be modulated.For simplicity, we first assume that there are infinitely many operation modes available, among which one is chosen at each iteration.It will be shown that the proposed technique can be applied to a discrete DVFS as well in Section 5.The operation mode at the th iteration is represented with the speed scaling factor   ranging from  min to  max = 1 ( min ≤  ≤  max = 1).Then, the operating frequency of the th iteration,   , is where  max is the maximum frequency of the microprocessor.

Workload.
The workload is defined to be a number of clock cycles elapsed to complete the given computation.We denote the number of cycles elapsed to handle the workload of the th iteration at the full speed of the microprocessor (  = 1) as  ref, .That is, Note that the elapsed time   increases as the speed is scaled down (  < 1).Then, the delay   is automatically determined when a speed scaling factor   is chosen for the given workload (  =   /  ).

Real-Time Constraint.
The delay   cannot be unboundedly long as the system is associated with real-time constraint .For all iterations, the elapsed time should be no more than : 3.1.4.Delay-Workload Dependency.As stated earlier, the workload is dependent upon the previous execution delay.
Usually, the workload is not a continuous function of the delay variation.Rather, the changes happen in a discrete manner.Therefore, the workload at the th iteration is a monotonically increasing staircase function of the delay of the previous iteration,  −1 :   = ( −1 ).If the given system has  workload levels, the workload function  can be formulated as follows: in which the workload levels are wl 1 < wl 2 < ⋅ ⋅ ⋅ < wl  and the delay thresholds (workload changing moments) are th 1 < th 2 < ⋅ ⋅ ⋅ < th −1 .

Execution Trace.
At the th iteration, the speed scaling factor   uniquely defines an execution mode as the delay is fixed accordingly by (2).The initial workload is assumed to be given as  1 .Then, an execution trace tr of length  is defined to be a sequence of the speed scaling factors of  iterations: tr fl ( 1 ,  2 , . . .,   ) . (5) 3.1.6.Average Power Consumption.The dynamic power consumption of CMOS circuits is V 2  , where , V  , and  are capacitance, operating voltage, and frequency, respectively.As the operating frequency is proportional to (V  − V th ) 2 /V  [10], the power consumption is an increasing function of .It is worth noting that the proposed model is not dependent upon any specific DVFS model.We denote the energy consumption of a unit workload at the full speed ( = 1) as   and assume that energy dissipation grows linearly to the size of workload.Then, the reference energy of a workload   at the full speed is   ⋅   .Given a DVFS energy model  as a function of the speed scaling factor , the energy consumption at the th iteration EG  is formulated as follows: in which (0) = 0 and (1) = 1.Then, the average power consumption of a trace tr can be formulated as follows: It is worthwhile to mention that the proposed technique is not specific to a certain workload-energy model.While we adopt linear model for the workload-energy relation for ease of presentation, any, possibly nonlinear, model can be used in (6).

Problem Formulation.
Our objective is to minimize the average power consumption of a given system as follows: Given the modeling constant   , DVFS energy modeling function , workload function , and the realtime constraint , determine an execution trace tr such that the average power consumption formulated in ( 7) is minimized.

Proposed Technique
In this section, we describe the proposed operation management policy as an answer to the problem defined in the previous section.In doing so, we first derive the condition for feasible and schedulable systems.Then, we study when the workload changes and how it affects the power dissipation.
Based on that, we propose a novel graph representation that captures all possible workload transitions in the poweroptimal operation.Finally, we derive the power-optimal operation policy with the given workload function .

Feasibility.
In this subsection, we examine under which condition a given system is feasible.First, the system should be schedulable within the given real-time constraint  at every iteration.
Proof.Suppose that the delay at the th iteration is   .Then, That is, the delay is increasing as iteration goes by and will eventually reache the real-time constraint:   = .At the next iteration, the system becomes unschedulable even with the full speed, as (  = ) = 1 ⋅  +1 >  requires  +1 > .
Once the workload gets bigger than , the system is trivially not schedulable afterwards even with the full speed,   = 1.Thus, the workload must not be bigger than  at any time.Moreover, once the workload reaches ,  should remain the full speed  = 1 afterwards.We can make the upper bound of workload even tighter if there exists   such that () >  for all  ∈ (  , ].In this case, the workload larger than (  ) is not allowable as it makes the execution delay longer and longer, eventually violating the deadline.
Given the workload function  and the initial workload  1 , one can calculate the lower bound of the workload as well.If a value t exists which satisfies ( t) = t and ( t) <  1 , the workload will never become smaller than ( t).In other words, even with the full speed, the execution delay never goes below t.
Then, the valid workload levels and the execution delay range during the lifetime of a given system can be formulated as below.
Definition 2 (valid ranges).Given the workload function  and the initial workload  1 , the minimum and maximum workload levels of a system are defined to be Then, the valid range of the execution delay is formulated as [ min ,  max ] according to wl min and wl max with (10)

Workload Transitions.
In this subsection, we examine when a workload transition between valid workload levels possibly occurs and how it affects the system.As presented in (4), workload is a function of the delay taken at the previous iteration.If the delay taken at an iteration is  and th −1 < t ≤ th  .Then, if the system works fast enough to result in a shorter delay,   ≤ th −1 , the next workload will get smaller than wl  .Similarly, in case that the delay gets longer (  > th  ), the system will need to handle a larger workload than wl  at the successive iteration.
However, such workload transitions can occur only within limited ranges.Figure 1 depicts valid and invalid transitions from one workload level.Figure 1(a) shows two transitions from a workload level wl  to lower ones wl  and wl  ( <  < ).To make the next workload level wl  , the delay should be in the range of (th −1 , th  ].Given the current workload wl  , the speed scaling factor should be larger than or equal to wl  /th  from (2).If wl  /th  ≤  max = 1, this workload transition can possibly occur.In contrast, if wl  /th  > 1 for another workload level wl  , the transition from wl  to wl  never happens because the delay never goes below th  even with the full processing speed.
The same principle is also applied to the transition from a workload level to higher ones.If the delay can be lengthened properly with a speed scaling factor within the range [ min ,  max ], the transitions are valid.Figure 1(b) illustrates that the transition from wl  to wl  is valid, while the one to wl  is not.One can tell if a transition can happen or not with the following definition.
Definition 3 (valid transition).A workload transition from wl  to wl  is said to be valid if wl  /th  ≤  max = 1 and wl  /th −1 ≥  min .

Workload Transition Graph.
The essential difficulty of the presented power optimization problem lies in the fact that two conflicting forces should be handled at the same time.In order to minimize the power, on one hand, it tries to scale down the speed (thus lengthen the delay) as much as possible as described in ( 2) and ( 6).On the other hand, the lengthened delay is not desirable as it imposes a bigger workload in the successive iteration, as shown in (4).
Therefore, no one simple intuition can be exploited to solve the problem.Rather, we need to compare different modes in a comprehensive way.In order to be able to explore all possible execution modes and quantify their effects, we need to devise a data structure that includes elementary information on how workload transitions change the system status and power dissipation.In line with that purpose, we propose a graph representation of the workload evolution, Workload Transition Graph (WTG), which captures all possible transition scenarios of the workload changes during the lifetime of a system.
A valid workload transition from one workload level to another can be caused by any delay within the corresponding range.A transition from wl  to wl  in Figure 1(a), for instance, can be caused by any delay within the range of (th −1 , th  ].In other words, when handling a workload of wl  , any scaling factor within the range of [wl  /th  , wl  /th −1 ) can cause the transition.In the power-optimal execution trace, however, only one specific scaling factor is always chosen for a certain transition even though it happens many times.We show this in the next theorem.
Theorem 4 (optimal scaling factor).In the power-optimal trace  = ( 1 ,  2 , . . .,  || ), if the workload level handled by   is   and the next workload level is   , Proof.We prove this by contradiction.Let us suppose that there exists a power-optimal trace tr where   (which results in the transition from wl  to wl  ) is not equal to min(wl  /th  ,  min ).Then, we make a new execution trace tr  from tr by replacing   with    = min(wl  /th  ,  min ).Note that all other execution modes in tr  are the same as in tr.That is, tr () = tr  () , ∀ ̸ = .
By the definition of the scaling factor, since the transition is from wl  to wl  .Thus, tr () =   > tr  () =    (14) and accordingly we obtain Then, from ( 6) and ( 7), the average power in tr  is lower than that of tr and this contradicts the proposition.
From Theorem 4, we know that only one speed scaling factor is associated with all transitions of a certain type in the power-optimal operation scenario.So, we define a scaling factor of a valid transition as follows.
Definition 5 (scaling factor of a transition).The scaling factor of a valid workload transition from wl  to wl  is sf (wl  , wl  ) = max ( wl  min (th  , ) ,  min ) .
Now, we define WTG.
Definition 6 (Workload Transition Graph).WTG is defined to be a graph (, ), where  and  are the sets of vertices and edges, respectively.Each valid workload level forms a vertex while a valid transition from a workload level to another forms an edge between vertices.That is, there exists an edge from V  to V  corresponding to a valid transition from wl  to wl  : We denote the source and destination vertices of an edge  ∈  as src() and dst(), respectively.
With these definitions, a power-optimal execution trace tr = ( 1 ,  2 , . . .,  |tr| ) can be represented as a walk (a sequence of vertices where any pair of consecutive vertices are connected through an edge) of length |tr| + 1 in WTG.In a different form, the execution trace is a sequence of edges in WTG of length |tr|( 1 ,  2 , . . .,  |tr| ), where the workload transition from src(  ) to dst(  ) is caused by the execution mode sf(src(  ), dst(  )) =   .
Algorithm 1 shows how WTG is generated out of the given workload function  and the initial workload  1 .After the initialization, all valid workload levels are added as vertices in lines (5)- (6).Then, for each permutation of two workload levels (lines (7)-( 8)), it is checked if the in-between transition is valid or not in line (9).If valid, it is added as an edge in line (10).
Let us take Figure 2 as an example of workload function .Once given the initial workload, one can easily get the valid workload levels according to (8) and (9).When  1 = wl 2 , for instance, wl min = wl 2 and wl max = wl 6 .Figure 3(b  shown in Figure 2 with the initial workload of wl 2 .It is not a complete graph as some pairs of vertices cannot be connected directly since it is not valid according to Definition 3. The feasible delay range to handle workload wl 4 is highlighted in Figure 2, justifying that vertex wl 4 has three outgoing edges to wl 6 , wl 5 , and itself.To be more specific, when the workload is given as wl 4 , the shortest possible delay is the case when the speed is chosen as  max .Then, the delay ( coordinate of the cross point of  = wl 4 and  =  max ) is between th 3 and th 4 as shown in Figure 2.This means that the lowest possible workload in the next iteration is wl 4 .Similarly, the biggest possible workload can also be calculated as wl 6 as the biggest possible delay is between th 5 and .Note also that some of the vertices may not have an edge directed to itself (self-loop) such as vertex wl 3 .It has outgoing edges only to higher workload levels.This means that the computation burden of that state is so big that it only results in higher workload levels at the next iteration even with the full speed.
Different initial workload levels may result in different WTGs as shown in Figures 3(a)-3(c).The WTG derived from the initial workload level of wl 1 is illustrated in Figure 3(a).In contrast to Figure 3(b), vertex wl 1 is included in the graph.The WTG derived from higher initial workload levels, wl 4 , wl 5 , and wl 6 , is shown in Figure 3(c).Note that the WTGs in Figures 3(a) and 3(c) are not strongly connected.Vertex wl 2 in Figure 3(b), for example, is not reachable from wl 4 .However, from the definition of valid workload levels, all vertices are reachable from the initial workload level.This property is important for deriving the optimal operation policy that will be presented in the next subsection.

Proposed Operation Policy.
In this subsection, we present the proposed operation policy that compromises the energydelay tradeoff caused by the delay-workload dependency.
As stated earlier, a power-optimal execution trace tr can be represented as a walk of WTG.Then, we have the following definition.
A cycle (closed walk) of WTG is a walk whose starting and ending vertices are the same.That is, a walk ( 1 ,  2 , . . .,   ) is a cycle if src( 1 ) = dst(  ).Hence, if the corresponding walk of a trace in WTG is a cycle, the trace is ever repeatable.We argue that, in case that the length of the trace is sufficiently long (|tr| → ∞), the average power consumption is minimized when the cycle which minimizes (20) repeats over and over again in the trace.

Theorem 8 (optimal cycle of WTG). Suppose that a cycle of WTG,
has the minimum value of ( 20) among all cycles of the WTG.Then, if the length of the trace is long enough, the average power consumption of the optimal trace converges to Proof.An arbitrary walk of a WTG, ( 1 ,  2 , . . .,   ), can be decomposed into a path (a walk with distinct vertices) from src( 1 ) to dst(  ) and a set of cycles (see, e.g., Section 10.3 of [17]).Figure 4 depicts a walk example of length 8, where the initial workload level is wl 2 and the last vertex that it traverses is wl 6 .If a path from wl 2 to wl 6 , ( 1 ,  2 ,  5 ), highlighted in the dashed arrows, is removed from the walk, the remaining part is a set of cycles.Now, consider a power-optimal trace tr of length  that starts from the workload level of  1 .The corresponding walk of tr can be decomposed into a path pt starting at  1 and a set of cycles CY.Then, the average power consumption of the trace tr is Therefore, the average power consumption of the poweroptimal trace tr, AVG(tr), will get infinitesimally close to AVG(cy opt ).
From Theorem 8, it is understood that changing the DVFS mode of the given system following the optimal cycle presented above results in asymptotically optimal average power consumption.Thus, we propose following rules for DVFS operation policy: (i) If the current workload level vertex is in the optimal cycle cy opt , just follow the cycle repeatedly ever; that is, at the th iteration, the speed scaling factor is chosen to be sf() such that  is in the optimal cycle and src() =   .

Start No Yes
Generate WTG Input w 1 and W i = 1 Calculate cy opt and pt opt Is w i in cy opt ?
Set k i = sf(pt opt (i)) Set k i = sf(cy opt (i))

Update w i+1 i++
Update w i+1 i++ (ii) Otherwise, take a path pt opt = ( 1 ,  2 , . . .,   ) that has the minimum AVG(pt opt ) value and dst(  ) is in the optimal cycle cy opt .In other words, try to get in the optimal cycle with the minimum cost.
The optimal cycle cy opt of WTG can be searched by using an existing cycle enumeration algorithm.In this paper, we use the one proposed by Tarjan [18].The minimum path to the optimal cycle, pt opt , can also be searched by simple enumeration.It is worthwhile to mention that we cannot simply use the minimum weight cycle searching algorithm as the weight is not simply a summation of the weights but a complex function of them as presented in (20).
Figure 5 shows the proposed operation policy in a flowchart diagram.Given the initial workload and the workload function, we generate WTG, from which cy opt and pt opt are derived in the next steps.It is worth mentioning that this can be done in a tractable time and is just one-time effort taken offline.As long as the initial workload vertex is not included in cy opt , the system follows the trace represented in pt opt until it reaches the optimal cycle of WTG.Then, it simply repeats the trace implied in cy opt from that iteration on.

Extension to Discrete DVFS
Whilst we assume a continuous DVFS model for ease of presentation and generality, modern microprocessors in reality have finite DVFS modes with a set of predefined operation voltages and frequencies.In this section, we show that the continuity of the model presented in (1) can be relaxed by modifying Definitions 3 and 5 without harming the effectiveness of the proposed technique.
Let us suppose that we now have a system with a discrete and finite DVFS model, where only   ∈  can be chosen as a speed scaling factor at the th iteration.Valid transitions, in Definition 3, are redefined: the transition is valid if there is  ∈  that meets the same requirement.
Definition 9 (valid transition in discrete DVFS).Given a set of feasible scaling factors , a workload transition from wl  to wl  is said to be valid if Likewise, the scaling factor of a valid transition, in Definition 5, is also reformed to be minimum  ∈  that keeps the elapsed delay fallen into the range which results in the same transition.One has the following definition.(27)

Experiments
In this section, we validate the proposed model and operation policy with experimental analysis and simulations.

A Case Study: Object Tracking.
It is firstly shown that the proposed workload-delay dependency is evidently observed in an object tracking application.The performance of a commonly used object tracking method [19] is profiled, as a tracking solution, using the publicly available implementation [20].We choose Exynos5422 [21] as the target mobile embedded computing platform, which has 2 GB main memory running Linux operating system.The actual power dissipations of the processor are measured individually for five different DVFS modes.That is,  = {0.2,0.4, 0.6, 0.8, 1.0} and at the maximum speed the core is operating at 2 GHz.
Leveraging on a priori knowledge on the maximum speed of the object, the maximum distance that the object could have moved between two iterations is calculated.In this experiment, it is assumed that the object's speed never exceeds 10 pixels/ms.If an iteration takes  ms, for instance, the search area for the next iteration is given as a square with the side length of (10 ⋅ 2 ⋅  + 125) pixels.This search area is growing in a discrete manner at every 5 ms.The real-time constraint  is set to 25 ms.The workload function in a shape of staircase is illustrated in Figure 6 with five guiding lines, each of which denotes  ⋅  for  ∈ .
We compare the proposed power management policy with two others.The first one is ALAP, where the speed is chosen to be the slowest one with respect to the real-time constraint.The other comparison is made against a stable trace with the maximum speed as another extreme (ASAP).When the initial workload is wl 3 , that is,  1 = wl 3 = 9.833, ASAP and ALAP result in the average power consumption of 2.5032 W and 1.2623 W, respectively.The average power consumption of the proposed power management policy outperforms the others as 0.7592 W. The optimal cycle of its WTG is the self-loop of node wl 2 , which implies a stable operation mode, ∀,   = 0.6.

Stable versus Alternating Operation Modes.
There are two kinds of cycles in WTG: the first one is a self-loop which implies a stable operation mode, where no mode changes happen over the edge.Other than these self-loops, the WTG has one non-self-loop cycle as well.From the perspective of the operation policy, this non-self-loop cycle implies a predefined sequence of mode changes that can repeat over and over again.We call this alternating operation mode.Many examples including a real-life application shown in the previous subsection, as well as one presented in [11], tend to be optimal in a stable operation mode.However, in principle, the optimal solution cannot be achieved in a stable mode in some configurations.In order to illustrate this, we show a counter example shown in Figure 7 and apply the proposed technique to the example, with a DVFS power modeling function of () = 5 ⋅  3 and  = {0.4,0.8, 0.9}.The delay threshold and the real-time constraint are th 1 = 0.2, th 2 = 0.6, th 3 = 0.7, and  = 1.0, respectively.Figure 7(b) shows the derived WTG for the synthetic example.
The average power consumption of all self-loops in the synthetic example is tied to 2.5600 W. On the other hand, an alternating operation mode implied in wl 2 → wl 4 → wl 3 → wl 2 shows the least average power consumption in the discrete model.This is due to the fact that a computer system cannot simply operate on an ideal design point.In case that the theoretically optimal design point cannot be captured by a commodity hardware, the proposed technique is particularly useful.It can effectively explore design space and find out the best one in alternating modes.
In principle, a stable operation mode is the case that the staircase workload function () has a crossing point with  ⋅ .If this  is sufficiently small, it is likely to be a nearoptimal operation mode.However, it does not always result in a near-optimal power consumption.Particularly, in case that only a limited number of DVFS modes are available in a microprocessor, this  which crosses the current workload level may not exist in .

Conclusion
This paper formulates the delay-workload dependency in power optimization problem of embedded systems as a staircase function of the delay taken at the previous iteration.In applying it to the power optimization of DVFS-enabled electronic devices, a novel graph representation, called WTG, is proposed for exploring all possible workload/mode changes.Then, it is shown that the power optimization problem is equivalent to finding a cycle of the graph that has the minimum average power consumption.The effectiveness of the proposed operation policy is proven by the power simulations of synthetic and real-life examples.It has been observed that staying in a low speed scaling factor in a stable operation mode is often the best discipline (self-loop in WTG).However, alternating modes, where the DVFS modes change over a predefined pattern, sometimes outperform the stable ones.

Figure 1 :
Figure 1: Examples of workload transitions: (a) transitions to lower workload levels and (b) transitions to higher workload levels.The transitions from wl  to wl  are valid, while the ones from wl  to wl  are not.

Figure 2 :
Figure 2: A workload staircase function example with six workload levels.

Figure 3 :Figure 4 :
Figure 3: The corresponding WTG of the workload function  shown in Figure 2 with the initial workload of (a) wl 1 , (b) wl 2 or wl 3 , and (c) wl 4 , wl 5 , or wl 6 .

Figure 5 :
Figure 5: Flowchart diagram of the proposed operation policy.

Definition 10 (
scaling factor in discrete DVFS).Given a set of feasible scaling factors , the scaling factor of a valid workload transition from wl  to wl  is sf (wl  , wl  ) = arg min ∈ {th −1 < wl   ≤ min (th  , )} .