Power-Conscious Scheduling for Real-Time Embedded Systems Design

Power eÆcient design of real-time embedded systems based on programmable processors becomes more important as system functionality is increasingly realized through software. We address a power optimizationmethod for real-time embedded applications on a variable speed processor. The method combines o -line and on-line components. The o -line component determines the lowest possible maximum processor speed while guaranteeing deadlines of all tasks. The on-line component dynamically varies the processor speed or bring a processor into a power-down mode to exploit execution time variations and idle intervals. Experimental results show that the proposed method obtains a signi cant power reduction across several kinds of applications.


INTRODUCTION
Recently, power consumption has been a critical design constraint in the design of digital systems due to widely used portable systems such as cellular phones and PDAs, which require low power consumption with high speed and complex functionality.The design of such systems often involves reprogrammable processors such as mi- croprocessors, microcontrollers, and DSPs in the form of off-the-shelf components or cores.Further- more, an increasing amount of system func- tionality tends to be realized through software, which is leveraged by the high performance of modern processors.As a consequence, reduction of the power consumption of processors is important for the power-efficient design of such systems.
Recognizing the need to reduce the power con- sumption of processors, a number of methods have been proposed at the hardware and software levels.The methods at the software level can be loosely classified into power-aware compilation techni- ques [1-3] and Operating System (OS) directed power management techniques.The importance of latter approach increases recently because OS is recognized to play a central role in power manage- ment of overall system components.
Broadly, there are two kinds of methods to reduce power consumption of processors in OS level.The first is to bring a processor into a powerdown mode, where only certain parts of the processor such as the clock generation and timer circuits are kept running.Another method is to use a variable speedprocessor (VSP), which can change its speed by varying the clock frequency along with the supply voltage when the required performance on the processor is lower than the maximum.
Reducing power consumption of processors is fundamentally equivalent to exploiting idle inter- vals of processors.Thus, we should first identify sources of idle intervals to efficiently reduce the power dissipated by processors.Our approach is strongly motivated by the fact that there are several kinds of sources for idle intervals in a schedule of a real-time task set.Especially in case of a priority-based preemptive scheduling, which is one of the most widely used scheduling methods for real-time systems, we identify three kinds of sources.The first one occurs when a system is not tightly designed for a given processor, meaning that there is room for design change or improve- ment; introducing some more tasks, replacing certain tasks with their version-ups, using other processors with lower performance, and so on.Even if the system is tightly-designed, there are still idle intervals in case of fixed-priority scheduling which are strongly dependent upon the relative values of the periods of the tasks comprising the system; the second source of idle intervals.The third one is from run-time variation of execution time of each task, that is, the execution time of each task in run-time is not constant due to data- dependent computation, over-estimation of worst- case execution time, and so on.Each of these will be elaborated in more detail in Section 3.  To exploit these idle intervals for low-power, we propose a power optimization method for real- time embedded applications on a VSP with a power-down mode.The proposed method consists of two components: off-line component based on real-time analys of a task set that exploits the first source of idle intervals and on-line component based on priority-based real-time scheduling that exploits both the second and the third sources.Specifically, for a given real-time task set, we first compute the lowest possible maximum processor speed such that at least one of deadlines are vio- lated if the processor is running below that speed.With the maximum speed of the VSP set to the computed value, we then dynamically varies the speed of the VSP or bring the VSP into a power- down mode to exploit execution time variation of each task and idle intervals present in the schedule.
Note that all kinds of idle intervals can be exploited by on-line component only [4].However, we show that combined off-line and on-line com- ponents bring about more power-saving.
The remainder of the paper is organized as fol- lows.In the next Section, we review related work, which focuses on the reduction of power con- sumption of processors.In Section 3, we present the system model for power optimization, off-line component, and on-line component.In Section 4, experimental results are presented to evaluate the proposed method.Finally, a conclusion follows in Section 5.

Power-down Modes
In most embedded systems, a processor often waits for some events from its environment, wasting its power.To reduce the waste, modern processors are often equipped with various levels of power modes.In the case of the PowerPC 603 processor [5], there are four power modes (Full On, Doze, Nap, and Sleep), which can be selected by setting the appropriate control bits in a register.Each mode is associated with a level of power saving and delay overhead.In the conventional approach employed in most portable appliances, a processor enters power-down mode after it stays in an idle state for a predefined time interval.Since the POWER-CONSCIOUS SCHEDULING 141 processor still wastes its energy while in the idle state, this approach fails to obtain a large reduction in energy when the idle interval occurs frequently but its length is short.In [6,7], the length of the next idle period is predicted based on a history of processor usage.The predicted value becomes the metric to determine whether it is beneficial to enter power-down modes or not.This method focuses on event-driven applications such as user-interfaces where the latency caused by the mismatch between the predicted value and the actual value can be tolerated.However, an exact value or a lower bound are needed instead of a predicted value for the next idle period when the power-down modes are to be applied in a hard real-time system.
2.2.Scheduling on a Variable Speed Processor It is a well-known fact that power consumption in CMOS circuits can be decomposed into two parts: static and dynamic.The dynamic power consump- tion, which is a dominant factor, is described by Pdynamic a f Cz.V 2 dd where a is the expected number of transitions per cycle, called switching activity, f is the clock fre- quency, CL is the average load capacitance, and Vdd is the supply voltage.The reduction of Vd is the most effective way to reduce the power con- sumption as expected in (1).However, reducing Vaa leads to an increase in Circuit delay, denoted by t, which can be approximated by where k is a constant, Vt is the threshold voltage, and c is a constant satisfying < c < 2. A digital system designed with a fixed supply voltage (Vd) works at a fixed speed and then can be made idle if the computational demand is less than the maxi- mum.If the supply voltage is lowered dynamically to the lowest value satisfying the required speed constraint of the system as exhibited by (2), less power would be consumed.This kind of adaptive scaling of the supply voltage was exploited in self- timed circuits [8] and DSP systems [9].Recently, the same mechanism was adapted to a micro- processor architecture [10,11].For example [11] reports a processor based on the ARM micro- processor core, where the operating voltage is set by a feedback loop which compares the current and target frequencies.
A scheduling method to reduce power consump- tion of a VSP was first proposed in [12] and was later extended in [13].The basic method is that short-term processor usage is predicted from a history of processor utilization.From the pre- dicted value, the speed of the processor is set to the appropriate value.Because latency exists when the prediction fails, these methods cannot be applied to real-time systems.
Static scheduling methods for real-time systems were proposed in [14][15][16].The underlying model of their approaches is a set of tasks with a single period.When periods of tasks are different from each other, which is the conventional model employed in real-time system design, we can transform a problem by taking the LCM (Least Common Multiple) of tasks' periods as a single period and treating each instance of the same task occurring within the LCM as a different task.This can cause a practical problem because we require excessively large memory space to save a statically computed schedule, whereas the size of memory is one of the design constraints in a typical embedded system.Furthermore, LCM becomes excessively large when periods of tasks are mutually prime.Another problem is that a schedule is computed based on the assumption that a fixed amount of execution time is required for each task.As a result, the full potential of power saving cannot be obtained when variations of execution time exist.
A dynamic scheduling method, called Average Rate Heuristic (AVR), was also proposed in [14]   with the same model as in the static version.Associated with each task is its average-rate requirement, which is defined by dividing its required number of cycles by its time frame (deadline-arrival time).At any time t, the AVR sets the speed of a processor to the sum of average- rate requirements of tasks whose time frame includes t.Among available tasks, AVR resorts to the earliest deadline policy [17] to choose a task.Because average-rate requirements are computed statically with fixed numbers of execution cycles, the same problem occurs when variations of execution time exist.

System Model
For a processor model, we assume a VSP similar to [11].The reference clock frequency, denoted as fref, and the reference supply voltage, denoted as Vref, of the VSP is 100 MHz and 3.3 V, respec- tively.The clock frequency can be varied from 100MHz down to 8MHz with a step size of MHz.The supply voltage is 3.3 V for 100 MHz clock and, for lower clock frequency, follows (2).We assume that there is only one power-down mode available.The average power consumed by the processor when it is in power-down mode is 5% of the fully active mode and it takes 10 clock cycles to return from the power-down mode to the fully active mode.The processor model described above is only for the purpose of simulation which is to be presented in Section 4. Therefore, our method can be applied for other processor models, for example of a processor with only two speed levels [18], though the result of power saving may be different.
In a typical real-time embedded application, there are many periodic tasks that share hardware resources.To ensure that each task satisfies its timing constraint, the execution of tasks should be coordinated in a controlled manner.This is often done via priority-based preemptive scheduling algorithm.There are two kinds of algorithms based on priority assignment: fixed-priority (or static-prior- ity) algorithms such as rate-monotonic (RMS) [17]   and deadline-monotonic (DMS) [19] and dynamic- priority algorithms such as earliest deadline first (EDF) [17].A priority-based scheduling is quite simple to implement in most kernels, and it typically requires little if any extra hardware support.Also, there are many analytical methods to check the schedulability of the system.
The real-time embedded application is modeled as a set of tasks, -= {7-1,7-2,...,q-n}, which are numbered in order of decreasing priority in case of fixed-priority scheduling (FPS).The parameters of -i include its period (the minimum inter-arrival time between successive requests in case of a sporadic task) Ti, deadline Di, and worst case execution time (WCET) Ci.A task set is called feasible if deadline of each task is satisfied at all times.Note that Ci is measured or estimated [20] when the VSP is running in reference speed (fref and Vref).
To minimize energy consumption while guaran- teeing the feasibility of a task set, we first deter- mine the lowest possible speed such that the task set is feasible if the VSP is running in that speed entirely, and will be infeasible if running in lower speed.This can be done with off-line method as illustrated in the next subsection.Note that worst- case scenario (all tasks execute in WCET at all times) must be assumed in off-line method.How- ever, during operation of the system, the execution time of each task frequently deviates from its WCET, sometimes by a large amount.In many cases, the possibility of a task running at its WCET is usually very low.Furthermore, the complex architecture of modern processors (pipe- line, instruction cache, data cache, and so on) makes the static estimation of WCET difficult thereby resulting in over-estimation of WCET.As examples of this variation in execution time, Figure shows the ratio between the best-case execution time (BCET) and WCET obtained from [21] for a number of applications.
These execution time variation cannot be ex- ploited with off-line method alone.Furthermore, with fixed-priority scheduling, there are still idle intervals remained even if the VSP is running in

FIGURE
The ratio between BCET and WCET for a number of applications.the lowest possible speed entirely.To exploit these execution time variation and idle intervals, we use an on-line method, where we dynamically vary the speed of the VSP or bring the VSP into a power- down mode according to the status of the task set.
Example 1 Consider the three tasks given in Table I.Rate monotonic priority assignment is a natural choice because periods (Ti) are equal to deadlines (Di).Priorities are assigned in row order as shown in the fifth column of the table (lower value means higher priority).Assume all tasks are released simultaneously at time 0. A typical schedule, which assumes that tasks run at their WCETs (Ci), is shown in Figure 2a.If the speed of the processor is lowered by half or if the processor with half performance is used meaning that Ci is doubled, the schedule becomes as shown in Figure 2b.It is noted that the task set scheduled in Figure 2b just meets its feasibility.For example, if 7-2 were to take a little longer to complete, 7" 3 would miss its deadline at time 100.Even though the system is tightly constructed, there are still idle intervals, as can be seen in Figure 2b.When some task instances are completed earlier than their WCETs, there are more idle intervals as shown in Figure 2c.These idle intervals are sources of power reduction by on-line method, m 3.2.Computation of Maximum Speed For a given task set, in order to determine the lowest possible maximum processor speed (thus the lowest possible maximum clock frequency, denoted as fmax, and the lowest possible maximum supply voltage, denoted as Vm,,x), the analysis of the schedulability of the task set is required.We first present the approach for fixed-priority algo- rithms and then the approach for dynamic-priority algorithms.
The schedulability analysis for fixed-priority scheduling is based on the critical instant theorem [17] which says that if a task meets its deadline whenever the task is requested simultaneously with requests for all higher priority tasks, then the WCETs on a processor with the speed lowered by half.(c) When the execution times of some task instances are smaller than their WCETs.
deadline will always be met for all task phasings.This implies that it is needed to perform the analysis from time 0 up to LCM of all task periods under the assumption that all tasks are requested simultaneously at time 0. This again requires the analysis to be performed in the continuous time interval.Lehoczky et al. [22] shows that the analysis is actually needed only at discrete time points instead of continuous time interval.The set of time points, called scheduling points, for task ri is defined by when Ti Di.If Di is different from Ti, (3) can be modified as S' (Si {tit ESi, > Di}) tO {Di}.
T can be scheduled without violating its deadline, if there exist one or more scheduling points Si, which satisfy _<t.(5) Note that the left hand side of the inequality represents the cumulative demands on the proces- sor imposed by rl, r2, ri.Now, it is assumed that elements of Si are sorted in ascending order.Sij is defined as the jth element of Si, that is, jth scheduling point of ri.Thus, for each scheduling point Si/, 7-just meets its scheduling point if it satisfies (6) where T]i ff is speed scaling factor for 7-at Si.For example, rlid=(1/2) means that the speed of the processor is reduced by half thus execution times of tasks are doubled.Solving for T]i ff gives T]i0 -2=1C[(Sia/T)].(

7) Si
Because 7" is schedulable if it completes its execu- tion before or at any scheduling points and the minimum possible speed scaling factor is needed for ri for minimum power consumption, speed scaling factor for 7"i, denoted by ]i, is given by i m.in r/ig. (8)

J
In order to get a feasible task set, all tasks are required to be schedulable.Thus, speed scaling factor for the task set, denoted by , is given by m.ax T]i. ( Note that if is larger than 1, the original task set is already infeasible meaning that it cannot be scheduled with fixed-priority scheduling even with fref and Vref.Hence, fmax (correspondingly Vmax) is obtained by fmax In practice, we should take Irlfref for fmax because discrete levels of frequencies are assumed.We also need clamping operation so that fm falls between 8 MHz and 100 MHz.For dynamic-priority scheduling, especially for EDF scheduling with Di Ti, a task set is feasible if and only if the processor utilization is less than or equal to [17].Thus, is straightforward to com- pute because it is equal to the processor utilization, given by Ci r/-. (11)  Vri It should be noted that there are no idle intervals meaning that the power consumption of the processor is minimized if the processor is running entirely in the speed obtained with (11) provided that fractional value is possible for fmax, and each task always execute in constant execution time of WCET.When Di < Ti, we can use Di instead of Ti in the denominator of the right hand side of Eq. ( 11), called total density in this case instead of processor utilization.Note that, however, obtained in this way is conservative in that the task set is feasible with EDF if the total density is equal to or less than but the opposite does not hold.
Example 2 Consider again the three tasks given in Table I with rate monotonic priority assign- ment.From Eq. ( 3), the set of scheduling points for each task is given by {r,}, s2 {v1, 2,r3}.
Thus, we can reduce the maximum speed by as much as half or can use the processor with half performance (see Fig. 2b).m 3.3.Low-power Priority-based Real-time

Scheduling
Even if the processor is running in the speed obtained with the method of the previous subsec- tion, there are still idle intervals that arise from two sources (see Example 1).The first source is idle intervals inherently present in fixed-priority sche- duling (thus it is not the case with EDF) because of different period of each task.The second one is run-time variation of execution time of each task.
In more specific, although constant execution time of WCET should be assumed in the method of the previous subsection, the execution time of each task in run-time is not constant due to data-dependent computation, over-estimation of WCET, and so on.To exploit these idle inter- vals, we propose a power-efficient version of priority-based real-time scheduling method, which we call ipps for brevity.
The basic mechanism of the proposed schedul- ing algorithm is based on the implementation model in [23,24].The scheduler maintains two queues, one called run queue and the other called delay queue.The run queue holds tasks that are waiting to run and the tasks in the queue are ordered by priority.The task that is running on the processor is called the active task.The delay queue holds tasks that have already run in their period and are waiting for their next period to start again.They are ordered by the time at which their release is due.When the scheduler is invoked, it searches the delay queue to see if any tasks should be moved to the run queue.If some of the -tasks in the delay queue are moved to the run queue, the scheduler compares the active task to the task at the head of the run queue.If the priority of the active task is lower, a context switch Occurs.
Because most information about the tasks is available through the queues and lpps depends on this information, the proposed scheduler can be implemented with a slight modification of the conventional scheduler.Figure 3 shows the pseudo code of the lpps scheduling algorithm.The code lines between L5 and Lll (except L9 to be explained shortly) conform to the behavior of the conventional scheduler, lpps works when the run queue is empty (L12).This is further divided into two cases: one where all tasks have completed their executions in each of their periods and are waiting for their next arrival times while residing in the delay queue (L13) and the other where all tasks except the active task have completed their execution (L16).In the first case, we can bring the processor into a power-down mode because there are no tasks that need it.Furthermore, we know how long the processor will be idle because the task at the head of the delay queue is the first one that will require the processor (recall that the delay queue is ordered by the tasks' release times).This is the key ingredient of lpps.Thus, we set a timer to expire at the next release time of the task at the head of the delay queue and then put the processor into the power-down mode.Because, there is a delay overhead to wake up from the power-down mode, the timer actually should be set to expire earlier by that amount of delay (L14).
In the second case, we can control the speed of the processor because there is just one task (the active task) to execute and the processor will be available solely for that task until the minimum of the deadline of the active task and the release time of the task at the head of the delay queue.The amount of time that will be needed by the active task equals its WCET less its already executed time.This can be obtained when a task is preempted because of a request for a task with higher priority during its execution (L8).When  this occurs, we get the executed time of the task from the timer (L9) that is based on an external clock, which is independent of the variation of processor's speed.Note that we assume the execu- tion of the whole task takes its WCET because at the time of scheduling we have no information whether it will take less than WCET or not.When the active task COlpletes its execution, the sche- duler gets the control and increases the speed of the processor to the maximum to prepare for the next arrival of tasks (L1 through L4).This involves a delay for raising the supply voltage and subsequently the clock frequency.Thus, the active task actually should complete its execution earlier by an amount equal to this delay.Considering all these factors, we obtain the ratio of the processor speed needed for the active task to the full speed (L17).From the computed ratio, we find an appropriate clock frequency (L18).In practice, only discrete levels of frequencies are available, and among them we should select a frequency larger than or equal to the computed one to guarantee the timing constraints.All these processes are illustrated in the following example.
Example 3 Consider Figure 2b, that is, the same task set in Table I with Ci doubled.At time 160 when a request for 7" 2 arrives, the status of queues and the information associated with each task are shown in Figure 4a.For simplicity of illustration, assume that the delay required to wake up from the power-down mode and that required to change the speed of a processor are all 0. Because the run queue is empty with the active task of 7"2, the scheduler computes the desired ratio of speed that yields ((20-0)/(200 160)) 0.5 (see L17 of Fig. 3).
Thus, we can slow down the processor by half.Now, assume that the instance of 7"2 started at time 160 executes at the lowered speed, but completes its execution at time 180 instead of 200, meaning that it executes in half its WCET.At this time, the status ofqueues becomes that ofFigure 4b.
Because all tasks reside in the delay queue, the scheduler brings the processor into a power-down mode (see L 14 and L 15 of Fig. 3) with the timer set to the next arrival time of 7"1 (200).

EXPERIMENTAL RESULTS
To evaluate the proposed method, we perform simulations with several examples and compare the average power consumption with the proposed method against that with the conventional priority-based scheduling.In the conventional priority-based scheduling, the processor is as- sumed to execute NOP (no operation) instructions, when it is not being occupied by any tasks.The average power consumed by a NOP instruction is assumed to be 20% of that consumed by a typical instruction [25].We also compare the result with that of [4].
The first three examples are mission critical applications and the last one is a digital controller for a CNC machine, which is an automatic machining tool that is used to produce user- defined workpieces.For each task comprising an application, three timing parameters (Ti, Di, and Ci) are given.Because the statistics of the actual execution times of instances of the tasks are not available, it is assumed that the execution time of each instance of a task is drawn from a random Gaussian distribution with mean of m ((BCET + WCET)/2) and standard deviation of or= ((WCET-BCET)/6), where WCET Ci.Then, + [4] with EDF 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. the BCET is varied from 10% to 100% of the WCET for each task.This ensures that almost all generated values fall between BCET and WCET because the probability that a random variable x takes on a value in the interval [m-3a, rn + 3a] of a random Gaussian distribution is approximately 99.7%.If we set WCET to be equal to m+ 3or and solve for r with the help of equation for m, we get equation for r.After the generation of execution time, we apply clamping operation so that the generated value does not exceed WCET.
First, fmax and V,ax are obtained for each application using Eqs.( 9) and (11), which are summarized in Table II.Clearly, they are smaller with EDF than with FPS, because EDF sets the lower bound for f,,,ax and V,,,ax.In case of +/-ns, fmax with FPS is very close to that with EDF meaning that very high processor utilization is possible even with FPS.This is because most periods of tasks in +/-ns is harmonic, that is, period of each task is divisible with each other.
Next, with the maximum speed of the VSP set to the corresponding value shown in Table II, each task set is simulated with lpps.The results are shown in Figure 5, where lpps/l$ indicates that RMS is used for basic scheduling algorithm of lpps and lpps/EDF similarly for EDF.The vertical axis indicates average power reduction with each method compared to the conventional priority-based scheduling (see Fig. 2).Note that the power gain from off-line method is independent on the horizontal axis because worst-case scenario is assumed in that method.The power gain from on-line method increases as the BCET gets smaller (variation of execution time gets larger).This is because the chances both for dynamically varying the speed of the VSP and for bringing the VSP into a power-down mode increases as the variation of execution times increases.The largest gain is obtained in cnc.This can be understood from Table II because cnc can be operated in the lowest speed, meaning that its processor utilization in reference speed is the lowest.Compared to on-line method alone, we can obtain more power saving with combined off-line and on-line methods.

CONCLUSION
In this paper, we propose a power optimization method for a real-time embedded application on a variable speed processor.The method consists of two components.First, we determine the lowest possible processor speed such that the task set is feasible if the processor is running in that speed entirely, and will be infeasible if running in lower speed.Then, to exploit execution time variation and idle intervals, we relies on low-power priority- based real-time scheduling, which dynamically varies the speed of the VSP or brings the processor into a power-down mode.Experimental results show that the proposed method obtains a signifi- cant power reduction across several applications.

FIGURE 2 A
FIGURE 2 A schedule for the example task set.(a) When tasks always run at their WCETs.(b) When tasks always run at their WCETs on a processor with the speed lowered by half.(c) When the execution times of some task instances are smaller than their

FIGURE 4
FIGURE 4 The status of queues and the information associated with each task (a) at time 160 and (b) at time 180.

TABLE II
Maximum frequency and voltage computed for each application, fref= 100 MHz and Vref= 3.3 V