Processor Energy Characterization for Compiler-Assisted Software Energy Reduction

Energy consumption is a fundamental barrier in taking full advantage of today and future semiconductor manufacturing technologies. The paper presents our recent research activities and results on characterizing and reducing the energy consumption in embedded systems. Firstly, a technique for characterizing the energy consumption of embedded processors during an application execution is presented. The technique trains a per-processor linear approximation model for ﬁtting it to the energy consumption of the processor obtained by postlayout simulation. Secondly, based on the energy model mentioned above, the paper shows techniques for reducing the energy consumption by optimally mapping program code, stack frames, and data items to the scratch-pad memory (SPM) of the processor memory space.


Introduction
There is a wide consensus that the energy consumption is a fundamental barrier in taking full advantage of today and future semiconductor manufacturing technologies.The paper presents our recent research activities and results in two categories: estimating software energy consumption and reducing the energy required for accessing the memory subsystem.Firstly, the paper presents a technique to estimate the instantaneous energy consumption of embedded processors during an application execution.The technique trains a per-processor energy model which receives statistics from the processor instruction-set simulator (ISS) and gives the instantaneous energy consumption.Secondly, the paper presents techniques for reducing the energy consumption by optimally mapping program code, stack frames, and static data items to the scratch-pad memory (SPM) which is integrated on a processor chip.
In the rest of the paper, Section 2 presents the instantaneous energy estimation technique, Section 3 provides various techniques for reducing the energy consumption of the memory subsystem, and Section 4 summarizes and concludes the paper.

Software Energy Analysis
This section shows an overview of our energy characterization tool which helps designers in developing a fast and accurate energy model for a target processor-based system.Our tool uses a linear model for energy estimation and finds the coefficients of the model using multiple linear regression analysis.For more detailed information of our tool see [1].

Related Work.
The most accurate and fastest approach to find the energy consumption of software running on a processor is to measure the power consumption of the actual chip.Tools like PowerScope [2] and Itsy [3] use computercontrolled multimeters or A/D converters to measure energy consumption.The major drawback of using PowerScope is that it cannot measure the energy consumption of individual subsystems (e.g., a memory system) separately.Although Itsy overcomes this issue, it cannot measure the energy consumed in a short period of time because the energy consumption is averaged out over the entire execution time.In recent years, many instruction-level energy modeling and analysis techniques have been proposed.The idea of instruction-level energy modeling by measuring the power consumption of each instruction while executed in a loop was introduced in [4,5].The accuracy of these methods were improved by accounting for data dependencies, the effects of instruction and data addresses, register file addresses, and operand values [6].The main drawback of these techniques is that they rely on exhaustive simulation to find the energy consumption of each instruction and the interinstruction effects on the power consumption.The efficiency of the characterization can be improved by performing measurements on a limited subset of instructions and instruction sequences [7,8].In [9,10], the same average energy is assumed for all instructions and the average power consumption while running an application program is calculated using the operating voltage of the processor and the clock frequency.In [11], the authors measure the average energy consumed in each pipeline stage of a VLIW processor using a cycle-accurate simulator (e.g., Trimaran [12]) to improve the accuracy.The techniques estimate the average energy consumption over the entire program execution, while many software-level energy optimization techniques need cycle-accurate energy estimation.The technique described in [13] estimates the power consumption of the target processor cycle by cycle.However, this requires calculating the power consumption of every gate for each instruction which is very time consuming.Most of existing energy estimation techniques including the techniques presented in [4][5][6][7][8][9][10][11][13][14][15][16][17][18][19][20][21] assume a linear approximation model for estimating the energy consumption of software running on a processor.However, none describes how to generate the linear approximation model for accurately characterizing the energy consumption of the processor.In [21], Tan et al. modeled the software energy consumption using a linear equation and discussed the parameters required to accurately estimate the energy consumption.However, they did not provide any method to find parameters, corresponding coefficients, and test benches required for the accurate energy modeling.Our energy characterization framework assists designers to find an accurate linear model for estimating the energy consumption of a processor-based embedded system.In our framework, designers can find a training bench suitable for the energy characterization of the target embedded system and can optimize the corresponding coefficients through multiple regression analysis.We use parameters which can be extracted through GNU C debugger which is provided for almost all types of commercial embedded processors.Therefore, our approach does not need a cycle-accurate ISS which is not provided for many types of embedded processors and is usually more expensive than a cycle-inaccurate one like a GNU debugger.

A Target
System.We target a processor system which consists of a CPU core, an instruction cache, code and data on-chip scratch-pad memories (SPMs), and SDRAM as an off-chip main memory as shown in Figure 1.
The following three types of processors have been used in the experiments validating our approach in [1]: (1) M32R-II, a 32-bit RISC microprocessor originally developed by Renesas Technology Corporation, (2) SH3-DSP, a 32-bit RISC microprocessor of Renesas Technology Corporation, and (3) multiperformance processor [22,23] which is based on Media embedded Processor (MeP) developed by Toshiba.All of the above three microprocessors have on-chip cache memories and scratch-pad memories (SPMs).The processors are synthesized using a 0.18 μm CMOS standard cell library and an SRAM module library.In this paper, for sake of concision, only (3), the multiperformance processor, is used in the results section.

Processor Energy Characterization.
The energy consumption of a processor can be estimated using the following linear formula: where P i 's, c i 's, and N are the parameters of the model, the corresponding coefficients, and the number of parameters, respectively.The first step for the modeling is to find the P i 's required for estimating the energy consumption of the target processor system.The P i 's should be parameters whose values can be easily obtained using a fast simulator like an ISS.For example, P i 's can be the number of load and store instructions executed, the number of cache misses, and so forth.Once the required set of parameters is obtained, the next step is to find a training bench for the energy characterization.Note that the number of cycles simulated for the training bench is much smaller than that for the target application programs.In our tool, the generated training bench is simulated during about 500,000 cycles only while the full simulation of the target application programs needs billions of cycles.More detailed explanation for our method to generate the training bench is presented in the following subsection.The final step is to find the coefficients, c i 's corresponding to the P i 's.This is done by using multiple linear regression analysis.The energy consumption E estimate is then calculated using (1).Figure 2 shows an overview of our energy characterization flow.
To obtain the reference energy values, we simulate the processor system at gate level for a fixed number of instructions.We refer to this number of instructions for a test sequence as the instruction frame.The width is the same for all the instruction frames as shown in Figure 3. Since we perform gate-level simulation and calculate the energy consumption values for all the instruction frames, this step is time consuming.However, it needs to be done only once for a given processor system.
Next, we produce an instruction trace for each application program using an instruction-set simulator.The traces are divided into small segments each corresponding to one instruction frame.The parameters P i 's are obtained from

Library
Training bench

Linear programming
Linear equation

Gate-level simulation
Instruction-set simulation . . .these instruction traces.Then, for a set of P i 's, we find the coefficients which minimize |E estimate (i) − E gate-level (i)| 2 , where E gate-level (i) and E estimate (i) are the energy consumption values obtained by gate-level simulation and through (1) for the ith instruction frame, respectively.

Training Bench Generation.
As described in the previous subsection, the selection of training bench is the most important process for the energy characterization.Figure 4 shows a motivational example.Both (a) and (b) of Figure 4 show energy estimation results for JPEG encoder and MPEG2 encoder run on a target processor system.For the results of the left figure, a file compression program compress is used as a training bench and is executed for 500,000 instructions by gate-level simulation for the energy characterization.Through the characterization, a linear equation for estimating the energy consumption of a target processor system is obtained.Then the estimated energy consumption values are compared with the gate-level energy consumption results.As can be seen from the Figure, the energy estimation error is huge.The error is on an average 67% and more than 1000% for the worst case.There are two major reasons of the huge estimation error as follows.
(1) Standard deviations of some parameter values are too small.For example, the numbers of cache misses are constant for all the instruction frames.In this case, it is difficult to identify an impact of the cache misses on the energy consumption of the processor.(2) Some parameters are strongly correlated to each other.For example, the numbers of cache misses and the numbers of branch misses have similar trends in an entire training bench.In this case, it is difficult to distinguish the impact of cache misses from that of branch misses.
If we carefully generate the training bench, the accuracy of the energy estimation can be improved drastically.The right of Figure 4 shows estimation results of our approach.The only difference between the left and right results in Figure 4 is the training bench.We generate the training bench considering the standard deviations of every parameter values and correlation factors between any two parameters as criteria for generating the training bench.As a result, the estimation error of our approach is on an average 2% and 16% even for the worst case.Figure 5 shows the flow of our approach for generating the training bench.The training bench generation starts from a template of training bench which consists of subroutines which execute power-hungry instructions like a data-access instruction repeatedly and produce many cache misses, many read-afterwrite hazards, and other pipeline stalls.The parameter values are extracted using ISS.This process takes a few seconds.Then the standard deviations of every parameter values and correlation factors of any two parameters are evaluated.If the standard deviations of some parameter values are lower than a specific value or if the correlation factors of some parameters are higher than a specific value, the initial training bench is modified so that the standard deviations and the correlation factors are improved.This process is repeated until those two criteria are satisfied.

Debugger-Based Energy Estimation.
Once the energy model is developed, the energy consumption of software running on the processor system can be estimated using a cycle-inaccurate instruction-set simulator (ISS) whose speed is 350,000 instructions per second.We use our tool for characterizing the energy consumption of three commercial microprocessors with their on-chip caches, SPMs, and an offchip SDRAM.Experimental results using three benchmark programs demonstrate that the error of our technique is on an average 5% compared to the postlayout gate-level estimation results.Figure 6 shows the energy consumption results estimated for an MPEG2 encoder program executed on MeP processor.The line chart represents the results of postlayout gate-level energy estimation.The colored portion of the graph represents the energy consumption estimated by our approach.The results show that the accuracy of our energy estimation model is very good.
Today's SoC chips are usually implemented with off-theshelf processor IPs.Even for those SoC chips, our method can accurately model the energy consumption since it does not need to know the detailed internal architecture of the target processor.Another key point of our method is that it works very well even with a cycle-inaccurate simulator like a GNU debugger which is a de facto standard of software debugger.This helps compilers or programmers to customize software codes to meet customers' needs for low power.

Memory Energy Reduction
SPMs are small on-chip memories which are much faster and consume much less energy than off-chip memories.Compared to a cache, an SPM requires an explicit software management but is more deterministic and consumes significantly less energy.In this context, two fully software techniques are presented which place into these SPMs the code and the data that are the most often accessed.Both techniques are then merged to achieve a more general and efficient management of the SPM.The first technique, detailed in [24], is applied on singletask applications and assigns frames of the stack to the data SPM.This is motivated by the fact that the stack is usually one of the most often accessed data memory objects.For instance, with the MiBench [25] benchmark suite, stack accesses represent about 60% of the total data memory accesses.The second technique, detailed in [26], shares the spaces of the code and the data SPMs among the static memory objects of several tasks.This second technique is then merged with the stack technique for an efficient usage of the SPMs with the majority of the memory objects, that is, the code, the static data, and the stack.
Both techniques utilize profile information from the target application for generating integer linear programming (ILP) formulations whose objectives to minimize model the energy consumptions related to memory accesses.For each technique, the solutions of the formulations give the optimal configurations of the respective managements for each SPM.Usually, only the static data and the code memory objects are considered since they are allocated at compile time.Some techniques decide at compile time which of those objects are to place in the SPM, and which are to be left in the main memory (MM) [27][28][29][30][31][32][33][34][35][36][37].In [38] the previous approaches are extended with the possibility to split some arrays so that only their often accessed parts are placed into the SPM.Recent works like [39] also consider the case where the size of the SPM is not known at compile time: they propose a technique which modify the code at run time, during an initialization phase, in order to take advantage of the discovered SPM size.
Dynamic memory objects like the stack and the heap are less often considered.Nevertheless, a few methods do exist for managing the stack between the SPM and the MM [40][41][42][43].The main idea is to place in the SPM the most frequently accessed stack variables while leaving the others into the MM.Some of them require specific hardware features like [42,43] which use a memory management unit (MMU) or [44] which presents an SPM controller device.Embedded systems with tight constraints often cannot afford such features, hence fully software methods like [24,40,41] are required.In [40], it is proposed to manage dynamically a circular buffer of stack frames inside the SPM.While interesting, especially because it supports recursive functions, this approach is suboptimal.By contrast [41] proposes to find at compile time an optimal allocation for the stack variables.Two levels of granularity are studied: the frame level, where the stack frames are considered monolithically, and the variable level where the stack variables are considered independently of each other.The implementation of the variable is questionable though, since when compiled, the local variables are usually assigned to registers, and the content of the frames corresponds actually to the spill code.This is why in [24] we prefer using the frame level only and get optimization opportunities by supporting moves of a part or the totality of frames during the execution of the application.
The heap remains the most difficult memory object to manage and place in an SPM.Usually, prefetch and additional hardware are the solutions [43,44].Among the recent works, [45] proposes a simple API for a semiautomatic management of the heap between the SPM and the MM for the case of a CELL processor.
Techniques which move at run time the objects between the SPM and the MM like [24,43,[46][47][48][49][50] are complementary with the compile-time ones which can improve the usage of the limited SPM space.They can support code memory object [46,50], static memory objects [48,49], the stack [24], or all the memory objects [43].However, these techniques have an overhead for moving the memory objects, which can surpass the gain of using the SPM.This overhead can be bounded though, when the displacement of the memory objects is fixed at compile time as it is the case with [24,46,[48][49][50].
In multitask environments, the SPM is to be shared among several tasks for the placement of their memory objects.Several techniques exist for performing this sharing for static memory objects, that is, code and static variables [31][32][33][34][35][36][37].Although they target multitask systems, some approaches like [31,35] do not use the scheduling as a parameter for optimizing the usage of the SPM.Other approaches are targeting systems with a static scheduling [32,33] or assume the knowledge at compile time of the fully execution flow of the tasks [34].Finally, a few approaches target preemptive systems, which is also the target system in this paper.For instance, [26,36,37] optimize the SPM usage while using the properties of a static priority-based real-time preemptive system.

Stack Placement and Management.
Contrary to the static memory objects (i.e., code or static variables), the size of the stack changes during the execution of the application.More precisely, when a function starts its execution, it allocates a new frame on top of the stack by decreasing the stack pointer register (SP).This frame is used for placing the function's local variables which are not assigned to registers.Eventually, just before returning, the function destroys this frame by increasing SP. Figure 7 illustrates this mechanism by giving the evolution of the content of the SPM during the execution of function f 1 which calls f 2 , another function.A consequence is that the stack can grow larger than the SPM during the execution of the application.Furthermore, as frames are usually not equally accessed it may be inadequate to assign all of them to the SPM as the space they use could be more efficiently used by other memory objects.
In this context, the technique presented here manages two substacks, one into the MM and one into the SPM.The frequently accessed frames are allocated on top of the SPM substack while the others are allocated on top of the MM substack.The technique also supports moves of a part or the totality of frames during the execution of the application in order to free space from the SPM to be used by further Warp and unwarp operations are to be inserted just before and just after the call instructions where the substack of the coming frame is to be different from the current top one.This is enough because it is only when entering and leaving a function that the state of the stack changes.A store operation is to be inserted just before a call instruction (or if present, just before the unwarp operation preceding the call) when the top frame is to be moved (partly or fully) from the SPM to the MM.It is not necessary to insert store operations in other places in the program because it is only there that the stack grows.Symmetrically, a load operation is to be inserted just after a call instruction (or if present, just after the warp operation following the call instruction) when the top frame is to restore back to the SPM. Figure 8 illustrates the proposed management with the new stack operations.In the figure, the evolution of the content of the SPM is shown while four functions, respectively, f 1 , f 2 , f 3 , and f 4 are executed.Initially, frames 0 and 1 are assigned to the MM.Then a warp operation in f 1 allows to allocate frame 2 to the SPM.A store operation on frame 2 follows, executed during f 2 , which frees some SPM space.This space is not immediately required since frame 3 is allocated to the MM but this space is indeed used by frame 4.This possibility to store a frame ahead of the need is the reason why the store and the load operations have been limited to act only on the respective top frames of the MM and the SPM.

Limitations.
In conventional programs, the stack is localized with SP.Yet, this does not forbid to access it relatively to other registers provided the computation of their value takes root from SP.Therefore, if translation operations are inserted into arbitrary places of the code, the value of several registers and also some memory contents (e.g., in case of spill code) have to be updated at the same time.This requires both deep data dependency analysis to identify the places to update and important assembly code modifications.Both are complex to carry out, and more importantly, the energy cost of the code modifications can exceed the gain achieved by using the SPM.Moreover, such modifications are likely to change the size of the frames which could invalidate the inserted stack operations.Consequently, implementing a variable level approach like [41] is, at least, very difficult in practice.
This is why the technique proposed in this paper requires that, when accessed, a frame must be either fully into the SPM or fully into the MM.Furthermore, when a frame is stored it needs to be loaded symmetrically unless it is not accessed before the next call.

Difficulties.
Even though the technique is simple, several cases require additional care.First, when the arguments of a function are too numerous, some of them cannot be assigned to registers and are instead placed into the caller's frame.Then they are accessed by the called function but relatively to its own frame.This requires both frames to be contiguous; that is, they cannot be assigned into different substacks.However, [24] can allow such frames to be in different substacks by moving the corresponding stack operations before the instructions which put the extra arguments into the stack.
Second, when references to the current frame are passed as arguments to another function, the frame cannot be stored since it would invalidate the reference.
Third, the user often cannot modify the code of library functions, thence cannot insert into them new stack operations.However, it is still possible to assign their frames to the SPM by considering each call to such a library function as a call to a large monolithic function whose frame size is the maximum stack requirement during its execution.
Fourth, when a function f is called from different points into the program, it can happen that its frame is assigned to different substacks depending on where it has been called from.This is not a problem yet as the substack is selected before calling f .However, if f calls another function g, g's frame too might be assigned to different substacks depending on where f has been called from.Thence, depending on where f has been called from, different stack operations might be required to be executed during f .A simple solution is to forbid stack operations for this frame.A more refined solution is presented in [24] where predicates computed from the return address of f are inserted for selecting the stack operations to execute.
Last, recursive functions are difficult to handle since a same frame can be assigned several times in sequence.If the number of recursions is known at compile time, a single large frame which will contain all the frames of the recursion can be assigned.Otherwise, the technique cannot support them and they must be assigned to the MM substack.In [24] a more efficient solution is described which allocates on top of the SPM substack a circular buffer (already presented in [40]) for the frames of a recursive function.The code of the function is then updated for managing this buffer so that, when full, its oldest frame is evicted to the MM.

Compile-Time Selection of the Stack Operations.
The stack operations to insert are decided through an ILP whose objective to minimize models the energy consumed while accessing the stack and the extra energy consumed by the stack operations.The constraints ensure that the inserted operations maintain a valid state for MM and the SPM substacks.
For that purpose, a history of the calls is required to determine which frames exist at each point of the program where new stack operations can be inserted.This history is built from the call and control flow graph CCFG.This graph is partitioned into subgraphs we call sessions.Sessions are separated by the call instructions and are of two types: a store session is the subgraph of the CCFG starting just before a call instruction, and a load session is the subgraph of the CCFG starting just after a call instruction.By definition, a store operation can be inserted at the beginning of a store session and a load operation at the beginning of a load session.
The ILP formulation can then be built using the frames (u) and the sessions (v) for indexing the variables and the constants.The variables of the formulation are there to control the insertion of the stack operations and are the following: x u,v : this binary variable is 1 when frame u is fully into the SPM during session v; xn u,v : this integer variable is the number of bytes of frame u into the SPM during session v; sn v : this integer variable is the number of bytes stored at the beginning of session v; ln v : this integer variable is the number of bytes loaded at the beginning of session v.
The parameters used for the formulation include characteristics of the processor, metrics extracted from the application code and from profiling information.In the formulation, these parameters are represented by the following constants: S stack : is the maximum size of the SPM substack; Cspm u,v : it is the energy cost of the total accesses to frame u during the whole executions of session v if it is in the SPM; Cmm u,v : it is the energy cost of the total accesses to frame u during the whole executions of session v if it is in the MM; S u : is the size of frame u; C st : is the energy cost for storing one byte; C ld : is the energy cost for loading one byte; Ne v : it is the number of executions for session v; A u,v : indicates if frame u is accessed during session v; Forb v : indicates if the load or the store operation of session v is forbidden; Sym v : indicates if the load operation of session v must be symmetric with the corresponding store operation.
The objective function is the sum of two parts: the first one, noted obj stack , represents the energy consumed while accessing the stack, and the second one, noted obj ops , represents the energy consumed by the stack operations inserted into the assembly code for managing the stack between the SPM and the MM.
The first part of the objective function is the sum of the energy consumed by the application during each of its sessions when accessing the stack.For each frame, the energy consumed depends on whether it is assigned to the MM or to the SPM.The x u,v variables are used for selecting the cost corresponding to the used memory, and obj stack is then computed as follows: ( In the above equation, in order to keep the linearity of the objective, only the difference in energy consumption between the accesses to the MM and the SPM is actually represented. The second part of the objective function is the sum of the store and the load costs (the warp and unwarp operations are neglected in this paper, please refer to [24] for details about how to take them into account): ( Since the size of the SPM substack is bounded, constraints are required for limiting the assignment of frames to it.There is one such constraint per session: SPM space Time (with tasks execution intervals)  When accessed during a session v, a frame u is required to be fully into the MM or fully into the SPM, which means that xn u,v must be either 0 or the size of the frame.This is ensured by the following constraint: ( The load and store operations modify the size of current frame's part into the SPM, hence, they are constrained as follows where Store and Load are, respectively, the sets of the store and the load sessions and Prev(v) is the set of the sessions preceding v: These operations are not always possible the validity of the assembly update can forbid them or force them to be symmetric.Both kind of limitations are converted to the following respective constraints where Call(v) is the session from which the function of frame u has been called: (7)

Code and Static Data Placement with SPM Sharing among
Several Tasks.In multitask environments, the SPM is to be shared among several tasks when placing their memory objects.The sharing can be done among two dimensions [35,37]: spatial and temporal.With the spatial sharing, each task is provided with an exclusive access to a part of the SPM for placing its memory objects.This sharing does not require any run-time processing, but each task has access to only a small part of the SPM. Figure 9(a) illustrates this sharing: whatever the executed task may be (i.e., t 0 , t 1 , t 2 , and t 3 ), the assignments to the SPM space do not change.On the contrary, the temporal sharing provides each task the totality of the SPM space as seen in Figure 9(b).Consequently, this second sharing dimension requires to change the content of the SPM at each context switch: when a task is preempted, this content must be first saved to the MM (if it has been modified), then the content corresponding to the next active task must be restored from the MM to the SPM.These copies require time and consume energy which reduces the efficiency of this sharing.Works in [35,37] presented hybrid approaches which use both sharing dimensions in order to achieve better results.Both approaches reserve a part of the SPM for the spatial sharing and the rest for the temporal sharing.In [26] we proposed an approach which also uses these dimensions but in a more uniform fashion since the full space of the SPM is used for both.It is this approach which is described in this section.

Assumptions.
As there is one code SPM and one data SPM in the target architecture, the technique is applied twice, once for the code and once for the data.In addition, when a task is preempted, the SPM management needs to save the parts of its memory objects assigned to the SPM which will be overwritten by other tasks only if they have been modified.For the code regions, it is assumed here they are never modified (which is often the case in practice for embedded systems).For the data regions, it is difficult to know if they have been modified or not.Hence, it is assumed here that they are always modified.

Overview of the Technique. The proposed technique
assigns to each task a block in the SPM for placing its memory objects (code or data depending on the target SPM).Blocks can overlap but then, these overlap regions must be updated at context switches.In order to reduce the cost of such updates, the addresses of the blocks within the SPM are computed for reducing the overlaps among the blocks which often interfere.Figure 10 shows how the addresses of the blocks change the overlaps' sizes between them for the case of two tasks.In Figure 10(a), both blocks have the same address and the overlap region is the totality of block β 1 whereas, in Figure 10(b), the overlap is minimized to 128 bytes.

Importance of the Scheduling.
The scheduling is an important parameter for reducing the number of copies between the SPM and the MM during the context switches.For instance, if a task t 0 is never active while another task t 1 is ready, their respective blocks can overlap without any corresponding copy being required during the context switches for updating the content of the SPM.However if t 0 is active the overlap region between its block and t 1 s will have to be saved to (if data) and restored from the MM.In order to take the scheduling into account, the following is defined.
Preemption pattern: it is the set of the tasks executed between two consecutive active states of a ready task called the reference task.Figure 11 gives an example of scheduling and the corresponding preemption patterns are given with their number of occurrence in Table 1.In the figure, the up arrows represent the firing of a task and the down ones represent preemptions.
During the execution of a given preemption pattern, the parts of the reference task's block to save (if data) and replace are the ones in overlap with blocks of the tasks of the pattern.When several overlaps of a pattern intersect, the common parts must still be counted only once for the corresponding context-switches update costs.Figure 12 illustrates such a case where δ 0,1 is the overlap region between β 0 and β 1 , but is also included into δ 0,2 which is the overlap region between β 0 and β 2 .
As an overlap region needs to be copied to the MM only once for a given pattern occurrence, δ 0,1 should not be counted when computing the context switch cost.For that purpose we define the following.
Effective overlap: it is an overlap region whose context switch cost is counted for a given scheduling pattern.
For a given pattern, the effective overlaps can be considered as the overlap regions which are not included into other ones obtained from other executed tasks in the pattern as it is shown in [26].However, it can still happen that part of overlaps are still over counted.For simplicity, these last over counts are neglected in this paper but [26] do take them into account.

Memory Objects Placement and Task's Blocks Computation at Compile
Time.An ILP whose objective models the energy consumption related to the static memory objects' accesses is used for selecting for each task which memory object is to put into the SPM and for selecting the address of its SPM block.The variables and the constants of the formulation are indexed with i for the preemption patterns, j for the tasks and their respective blocks, and k for the memory objects.The variables are the following: x j,k : this binary variable is 1 when memory object k of task j is in the SPM when this task is active; b j : this integer variable gives the starting ("begin") address of block j; e j : this integer variable gives the ending ("end") address of block j; o j, j : this integer variable is the size in bytes of the overlap between blocks j and j ; o sel j, j : this binary variable is used for selecting the constraint defining the overlap o j, j ; oe i, j, j : this integer variable is the size in bytes of the effective overlap between blocks j and j for pattern i; o min i, j, j : this binary variable is used for selecting the constraint defining the effective overlap oe i, j, j .
The parameters used for the formulation include characteristics of the processor, metrics extracted from the application code and from the profiling information.In the formulation, these parameters are represented by the following constants: Sspm: it is the total size of the (code or data) SPM; R i : it is the total number of occurrences for pattern i; Cspm j,k : it is the energy cost of the total accesses to memory object k of task j if the object is in the SPM.This cost also includes the copy required when a task is fired; Cmm j,k : it is the energy cost of the total accesses to memory object k of task j if the object is in the MM (which can be cached or not); S k : it is the size of the memory object k; Ccxt: it is the average energy spent in context switches for one byte of data or code.When the target is the data SPM, this cost is the average energy required for the saving and restoring of one byte of data and when the target is the code SPM, this cost is the one required for restoring one byte of code.
The objective function is the sum of two parts: the first one, noted obj tasks , represents the energy consumed by the tasks while accessing the memories.The second one, noted obj cxt , represents the energy consumed while updating the SPM during the context switches.
The first part of the objective function is the sum of the energy consumed by each task while accessing its memory objects.For each memory object, the energy consumed depends on whether it is assigned to the MM or to the SPM.The x j,k variables are used for selecting the cost corresponding to the used memory, and obj tasks is then computed as follows: In the above equation, in order to keep the linearity of the objective, only the difference in energy consumption between the accesses to the MM and the SPM is actually represented.
The second part of the objective function is the sum of the energy consumed by each scheduling pattern for saving and restoring parts of blocks pondered by their corresponding numbers of occurrences.For that purpose, the effective overlap oe i, j, j are used: The first constraints ensure that each block fits individually into the SPM.For that purpose the size of a block is computed as the sum of the sizes of each memory object of the corresponding task which are assigned to the SPM: ∀i, j e j − b j = x j,k * S k .(10) Finally, a block is forced to fit into the SPM by bounding its upper address as follows (for LP-solvers variables are positive by default, hence this is not necessary to bound b j ): ∀ j e j < S spm.(11) Additional constraints are required in order to compute the effective overlaps.First of all, the (noneffective) overlap between two blocks, j, and j , is computed by considering the four cases of their relative positions when intersecting, that is, j included into j , j's ending address included into j , j 's ending address included into j and j included into j.The cases are select through the four mutually exclusive binary variables, o sel j, j , o sel j, j , o sel j , j , o sel j , j as follows: In the above equations the constants M * , * are used as "big Ms": they have to be large enough to ensure the right part of each equality to be negative when the corresponding o sel * , * variable is 0. When there is no overlap, these equations are still valid as then, either e j − b j or e j − b j is negative so that o j, j will be set to 0. These overlaps are used for computing the effective overlaps as described previously.The formulation of each effective overlap oe i, j, j is expressed using a binary variable o min i, j, j which is 1 when o j, j is smaller than o j , j and o j, j and therefore included into them [26] (constants M min * , * , * and Me * , * are also "big Ms"): 1 − o min i, j, j * M min j, j , j ≥ o j, j − o j, j + j, j − j, j , 1 − o min i, j, j * M min j, j , j ≥ o j, j − o j , j + j, j − j , j , oe i, j, j ≥ o j, j − Me j, j * o min i, j, j .( In the previous equations, constants * , * ensure that at least one effective overlap is not set to 0 when several overlaps are equal.These positive constants are strictly smaller than 1 and are all different from one another.This artificially forces the subtraction of overlaps of (13) to be different from 0.

Run-Time
Management of the SPM.The content of the SPM must be updated at each context-switch.For that purpose a function is added to the context-switch procedure of the operating system.This function uses two tables for determining which region of the SPM is to save (if data) and replace.These tables describe the content of the SPM partitioned into the set of the areas which are separated by the starting and ending addresses of the blocks.Figure 13 gives an example of such a partition where the blocks are noted β * and the areas α * .
By definition, areas are contiguous to each other, do not overlap and each block is a set of consecutive areas.The first table, fixed at compile time, is indexed by the blocks and its two entries are the first and the last of the areas which are covered by the block.The second table evolves at run time and is indexed by the areas.Its entries are for each area, its start address in the SPM, its size, and the block currently occupying it.At each context-switch, the SPM update function iterates on the areas (fetched from the first table) of the next active task.For each of these areas, the second table is looked up for determining which block is occupying it.If it is a block different from the one of the next active task, the area is saved to the MM (if data) and the corresponding content of the new block is copied back to the SPM.

SPM with Code, Static, and Stack Data in Multitask
Systems.Since the stacks are dynamic memory objects, they cannot be handled natively by the proposed technique for sharing the SPM among several tasks (nor, to our knowledge, by any of the existing sharing techniques).Yet, if the maximum size of the stacks can be known at compile time, they can still be considered as usual static objects whose size is this maximum.
The same can be done with the proposed stack management technique by considering each SPM substack as a static memory object whose access cost is the resulting value of the corresponding objective function and whose size is chosen small enough to leave room into the SPM for the other data memory objects.However, the efficiency of the stack management depends highly on the size of the SPM substack as it can be seen from the results of the next section.Fortunately, solving the ILP formulation for this technique proved to be very fast (less than one second for all the applications of our experiments) and reasonably fast for the case for the SPM sharing technique (15 seconds were required for the worst cases with [51]).But the number of iteration to perform is exponential with the number of sizes for the SPM stacks.For instance, with five SPM substack sizes and nine tasks, 9 5 iterations are required.It is difficult in practice to afford such a huge number of iterations of both techniques for finding the best sizes for each SPM substack.Yet, since it is affordable to iterate several times with the stack technique alone, we propose to extend the SPM sharing technique by adding several substacks of different size per task.Each of these substacks is considered as a static memory object whose access cost is the value of the objective function of the respective stack management's ILP formulation.Equation ( 8) is therefore modified as follows: In the equation, m is the index for the SPM substack case and C stk,m is its corresponding energy cost.Besides, variable x stk, j,m is 1 when the SPM substack case m is used and 0 otherwise.
As one and only one SPM substack is required for each task, the following new constraints are added: Finally, the sizes of the substacks must also be taken into account and ( 10) becomes e j − b j = x j,k * S k + S stk,m * x stk, j,m . ( As a result, the solution of this new formulation includes for the content of each block the selected SPM substack in addition to the static memory objects. While the number of variables increases, the constraints of ( 16) are efficiently used by LP-solvers so that the solving time is not importantly increased: it took 20 seconds for our LP-solver [51] to solve the problem with the largest task set of our experiments.

Experimental Results
. We applied our technique on a multiperformance processor (based on the MeP) which includes an 8 KB data SPM, an 8 KB code SPM, and a 4-way instruction cache.Energy characteristics of this architecture are given in Table 2.
We used the Toshiba's MeP Integrator (MPI) tool chain [52] for compiling and simulating the applications.Compilations were performed with the −02 level of optimization.Executions were performed for a static priority-based rate monotonic preemptive system with a processor utilization of about 60%.Both multitask and stack techniques were applied on the tasks of the sets given in Table 3. Tasks come from the EEMBC [53] and the Mibench [25] benchmarks.Figure 14 gives the results of the proposed stack management technique (ours) applied on monotask cases compared to a management which does not store nor load frames (static) [42] and to one which uses a circular buffer of frames (circular) [40].As ours supersets static and circular, it is not surprising that it always achieves better results than them.On average, for the respective 1/3 and 2/3 SPM substack sizes, static achieves about 39% and 84% of stack-related energy consumption reduction, circular about 41% and 74%, and ours does better in both cases with about 49% and 85%.
The variable level of granularity mentioned by [41] is complex to formulate and implement in practice, but can be qualitatively compared to our technique.The low granularity of the frame level is usually compensated by run-time partial loads and stores whose overhead can be compared favorably with the variable-level overhead for accessing the dispatched variables of a frame.However, for the cases where a frame is larger than the remaining SPM space, even after store operations, there would indeed be more energy reduction if the often accessed variables of such a frame could be separated from the others at reasonable cost.Figure 15 gives the results obtained when sharing the data SPM among the tasks of each set of Table 3 with the technique proposed in this paper (block) and with state-of-the-art techniques including a fully off-line spatial sharing one (space), a temporal sharing one (time) (both are described in [35]), and a hybrid spatial and temporal one (hybrid) [37].The figure also includes the results obtained when the sizes of the SPM substacks are selected by the merging technique presented in Section 3.4 (merged).For fairness of the comparison, the techniques other than merged also include the stacks, but considered as static memory objects whose sizes are the corresponding maximum stacks' requirements.For the case of a 1 KB SPM, on average, space achieves about 40% of energy consumption reduction, time about 45% hybrid about 45%, block about 46%, and merged achieves much better with about 69%.If the SPM size is 4 KB, these techniques achieve, respectively, about 72%, 79%, 81%, 83%, and 84% of average energy reduction.In all the cases, merged achieves better or equal than block, block better or equal than hybrid, and hybrid better or equal than time and space.The systematic advantage of hybrid, block, and merged over space and time is due to the fact that the two latter are part of the solutions explored by the three former techniques.Moreover these three techniques also include the overhead in their ILP formulations so that it remains under control.
Figure 16 gives the results of the same experiments for the code SPM (the MM is then cached), with space, time, hybrid, and block (for the code, merged is identical to block).As the cache consumes much less energy than the external memory, more SPM is required for achieving important energy consumption reduction.For the case of a 4 KB SPM, on average, space achieves about 16% of energy consumption reduction, time about 42%, hybrid about 42%, and block about 43%.And for the case of an 8 KB SPM, these techniques achieve, respectively, about 26%, 60%, 60%, and 63%.
While not being the focus of the paper, the execution speed also benefits from the proposed techniques because accessing the MM requires much more time than accessing the SPM.With the processor configuration used for the experiments, accessing the MM requires about 20 cycles on average and there is no data cache.Therefore, the speedup related to the data memory accesses is significant.For instance, when the SPM size was 2/3 of what the stack required, the proposed stack management technique (ours) achieves 77% of speedup related to the memory accesses (including the overhead of the management).Similarly, the proposed technique for sharing the SPM among tasks (block) achieves about 80% of data-access-related speedup (including the overhead) with a 4 KB SPM.For the code

Conclusions
The paper presented our recent research activities and results on characterizing and reducing the energy consumption in embedded systems.Our main focus is on software-directed approaches for estimating and reducing the energy consumption of embedded real-time systems.As the demands of system integration, performance, and low power operation have pushed chip vendors down to 65 nm or below, NRE (nonrecurring engineering) costs and design complexity have increased significantly.A remedy for the NRE explosion is to reduce the number of developments and sell tens of millions of chips under a fixed hardware design.In such a situation, embedded software plays much more important role than today.This paper covered our approach for fast and accurate energy estimation of software on a given processor system and software-directed techniques for reducing the energy required for accessing the memory subsystems in the embedded processor.
As future work, we plan to place also parts of the heap into the SPM for achieving more energy consumption reduction.We also plan to refine the stack approach proposed in this paper with a grain finer than the frame level while still being implementable in practice.

Figure 1 :
Figure 1: A target processor system model.

Figure 6 :
Figure 6: Results for MPEG2 encoder run on MeP.

3. 1 .
Related Work.Several works use the SPM for reducing the energy consumption of memory accesses.

Figure 10 :Figure 11 :
Figure 10: Effect of the blocks' address on overlaps.

Table 1 :
Example of preemption patterns.

Table 2 :
Average energy consumptions for The multiperformance processor.

Table 3 :
Tasks-set used for the experiments.