Photocopying permitted by license only Publishers imprint. Printed in Malaysia. An Instruction-Level Power Analysis Model with Data Dependency

Power constraints are becoming a critical design issue in the field of portable microprocessor systems. The impact of software on overall system power is becoming increasingly important as more and more digital applications are implemented as embedded systems, part of which are hardware (ASICs) and part software in which a specific application is executed on a processor. In this paper, a data-dependent instruction-level power analysis model is presented. It is compared with the average cost model proposed by Tiwari et al. [1] in both estimation accuracy and characterisation time. The data-dependent model can be generalised to be applied to generic RISC processor. Application of the data-dependent model we propose sensibly reduces errors in estimating software power consumption per clock cycle which is lower than 10%, in the case of the ST20-C2P core.


INTRODUCTION
The spread of portable microprocessor systems (mobile phones, PDAs, MP3 players) that use a battery as a source of energy is making power consumption a key issue.Power consumption constraints are now equally important system design speci®cations as memory area and performance.The reasons for this sudden interest are not only the evolution of technological processes (which makes it possible to integrate millions of transistors on a 1 cm 2 die) but also the development of the System-On-Chip concept whereby a processing system with all its peripherals can be implemented on a single integrated circuit, thus allowing us to hold in the palm of our hand what up to a few years ago took up a whole desk.Competition on the portable electronics market concerns three dierent aspects: autonomy, functionality and volume.Only an optimal combination of these three indexes is the winning solution.It is not a simple task as these three variables are closely correlated and enhancing one is often to the detriment of another.If, for example, the aim is to increase autonomy, one possibility is to use a larger battery, but this increases the volume; to increase functionality more complex software is needed (the functions of an embedded system are usually implemented by the software): this generates a greater processing workload, which means more power consumption and consequently less autonomy.
Current research in the ®eld of low-power devices focuses on two issues: methods to design low-consumption architectures and the de®nition of techniques to analyse power consumption right from the earliest design stages.The former include low-power synthesis techniques [5,6] which map a technology-independent circuit onto a dependent one, using a power metric to choose gates from a library; techniques for power management applied to the single blocks of a circuit which allow blocks whose result will not be used to be turned o (clock gating, operand isolation etc.) [7 ± 10]; encode/decode techniques to minimise switching on high-capacity buses (e.g., those interfacing the memory) [13 ± 19].Among the latter we have methods operating at a higher level of abstraction which conduct a search through all possible system con®gurations and choose the one that optimises a function usually depending on the variables area, performance and power.In [21,22] the possible con®gurations of a reference system comprising a CPU, a hierarchy of memories and a certain number of ASICs and peripherals are explored for varying memory hierarchy con®gurations (size, associative capacity, cache block size) and bus features (number of lines, encoding techniques).In [23,24] the architecture of more complex, highly parametrised systems is explored, de®ning a wide range of possible con®gurations.
Optimisation in terms of power is only possible if estimation methods and power analysis tools are available to measure the impact of any modi®cations made.Power consumption analysis often focuses more on the impact of the hardware components of a system, not taking into account the impact of software on overall system power.Software is, however, becoming increasingly important as more and more digital applications are implemented as embedded systems, part of which are hardware (ASICs) and part software in which a speci®c application is executed on a processor.Although it is clearly of fundamental importance to assess the impact of software on power consumption, most research eorts have focused on de®ning analysis methods at a low level of abstraction, i.e., at the circuit and gate level, and little has been done at a higher level, in particular the software level.Analysis at lower levels gives the best results in terms of accuracy, but it takes a very long time to estimate the consumption of a system on which a program is executed.Estimating the power consumption for a program with millions of instructions is not realistically possible with low-level analysis tools.A higher-level, i.e., software-level, methodology is less accurate but has the advantage of taking much less time.
An approach speci®cally de®ned to estimate the amount of power absorbed by a processor when it executes a software program was presented by Tiwari et al., in [1]; it consists of characterising the instruction set of a processor in terms of power, by assigning an average cost to the various instructions.Once the instruction set has been characterised in terms of power it is possible to calculate the amount of power absorbed by the processor when it executes any program with a known trace, by summing the power contributions of each instruction executed.This model, however, does not take into account the contribution of data to power consumption.While it is low for highly complex processors, in the case of very simple processors ( frequently used as the core of embedded systems) the contribution made by data may be signi®cant.In this case considerable errors may be made in estimating power consumption.
In this paper we present a software power consumption analysis model that can capture its dependence not only on instructions but also on data.The cost model is characterised by means of a circuit-level analysis tool.Characterisation using very low-level analysis tools is the only type possible when the processor silicon is not available but only an HDL model or the netlist of the core, as happens in the design of embedded systems.Although the instruction set is characterised at a low level, this is not a limiting factor because it only needs to be done once.Once the characterisation stage has been completed, software power consumption can be estimated by executing the program on a simulator of the processor and summing the contributions made by the instructions and the data involved at each clock cycle.This means that it is not necessary to know details of the architecture of the processor to estimate the software power consumption, but it is necessary to have the power cost model provided by the manufacturer.
Application of the data-dependent model we propose sensibly reduces errors in estimating software power consumption per clock cycle.In the case of the ST20-C2P core, which we took as a case study, the error per cycle remained lower than 10%, whereas the mean error for the whole of the test was below 5%.
In [2] we will describe the average cost model proposed by Tiwari et al. [1] and apply it to the ST20-C2P.As will be seen, the great datadependency prevents an instruction from being labelled with an average cost.To overcome this problem, in 3 we propose a data-dependent model which measures certain activity indexes to obtain more precise estimates for all architectures in which data makes a signi®cant contribution to power consumption.In 4 we will discuss the strategy used to design a high-level tool to estimate software power consumption on the basis of a trace generated from a functional model of the processor.Finally, 5 will give our conclusions and discuss future developments.

AN AVERAGE MODEL
In [1] Tiwari et al., proposed a method for estimation of the power absorbed by a processor executing any program, starting from a trace of the instructions it executes.The authors state that by measuring the power absorbed by a processor X repeating certain instructions or short sequences of instructions it is possible to obtain much of the information needed to calculate the power dissipated when the processor X executes the instructions of any program Y. Note that we will use terms power and energy interchangeably since for a processor power is just the energy per cycle times the clock frequency.
Given a processor, and indicating the code of a generic program as P, an estimate of the amount of power absorbed by the processor during the execution of P is given by the following equation: where the basic cost, B i , of the instruction i (i.e., the average amount of power absorbed when all the stages of the pipeline are processing the instruction i) is weighted by the number of times, N i , the instruction is executed.To this it is necessary to add the overhead, O i,j , due to the change in system con®guration (since the instruction i is preceded by the instruction j) weighted by the number of times, N i,j , this pair occurs during execution of the program.Finally, the power consumed by any stalls or cache misses that occur during execution of the program is added.The terms B i , O i,j , E k refer to the processor being investigated and are determined in a preliminary characterisation stage in which the basic cost of each instruction in the instruction set and the cost of changing instructions for each pair of instructions (or each pair of classes of instructions) are de®ned.In Section 2.1.2we will discuss the techniques used to determine these values.
Model 1 was applied to three dierent processors [3]: an Intel 486DX2 (a CISC processor based on the x86 architecture) [1], a Fujitsu SPARCli-teMB86934 (a 32-bit RISC processor based on the SPARC architecture) [4] and a proprietary Fujitsu DSP [2].For the ®st two the model was simpli®ed, representing the contribution O i,j for any pair of instructions with a single constant term.This was because the overhead (with respect to the basic cost) of the power absorbed during execution of the instruction i when it is preceded by the instruction j is negligible if compared with the average amount of power absorbed during the execution of a program.These contributions range from 5 to 30 mA as compared with 300 ± 400 mA during execution of a program on the Intel 486DX2, and are below 20 mA on the Fujitsu SPARClite as compared with a range of 250 ± 400 mA.With the DSP, on the other hand, the overhead is not negligible and has to be taken into account.

Application to the ST20-C2P
Equation (1) was adapted for application to the ST20-C2P, which is a core processor, i.e., a processor that does not exist as a single component but is the core of a microprocessor-based embedded system.It is used in a range of applications from the cheapest portable embedded systems to more exacting applications with the performance requirements typical of DSPs [25].It is a 32-bit processor, the control unit is micro-programmed, it has a two-stage pipeline (the ®rst stage involving instruction fetch and decoding, the second execution and write back), and is manufactured using HCMOS7 technology, 0.25 mm, 2.5 V.The analysis was made considering the two main blocks making up the CPU the fetch unit and the execution unit separately.The fetch unit features average power absorption per cycle of about 6.6 mA with an average deviation of AE 2 mA.The execution unit absorbs on average 30.3 mA AE 15 mA.It was therefore decided to label the fetch unit with a ®xed cost of 6.6 mA and to treat the execution unit separately, as it is the main power consumer.
In general, the average amount of power absorbed by a processor in a given clock cycle depends on the instructions processed by the various pipeline stages in that clock cycle.If we take the ST20-C2P as a reference and de®ne the clock cycle in which the instruction I i is executed as the cycle in which the instruction I i is processed by the last stage in the pipeline, we can write: Cost(I i ) indicates an estimate of the average amount of power absorbed during execution of the instruction I i .The term base(I i ) represents the basic cost of the instruction I i , i.e., the average amount of power absorbed by the processor during execution of instruction I i when the latter is preceded by the same instruction I i , or more generally, when all the stages in the pipeline are processing the instruction I i .The terms (I i À 1 3 I i ) back and (I i 3 I i 1 ) forw , which we will call the backward and forward costs respectively, represent the inter-instruction contribution, i.e., an excess added to the basic cost due to the fact that the instruction currently being executed is preceded and/or followed by a dierent instruction.The former is the inter-instruction cost in the execution stage while the latter is the interinstruction cost in the fetch and decoding stage (see Section 3.4).We will indicate this model as ACM (average cost model).

Explanation of Inter-instruction Costs
When an instruction I i is preceded by an instruction I i À 1 (I i T I i À 1 ) or followed by an instruction I i 1 (I i T I i 1 ), its cost exceeds that of base(I i ).This excess is called the inter-instruction cost.
It can be seen as the sum of two contributions: a backward cost and a forward cost.To see how much the inter-instruction cost aects the basic cost of an instruction, consider Table I.The ®rst column shows the instruction executed, the second column the average amount of power absorbed by the execution unit, the third the average amount of power absorbed by the micro-controller block alone, and the fourth the percent increase in power absorbed when an instruction is followed or preceded by a dierent instruction as compared to when it is preceded or followed by the same instruction.As can be seen, an stl [25] followed and preceded by the same stl causes an absorption of about 7 mA in the execution unit, 2.6 mA of which are absorbed by the micro-controller alone.The ®rst stl preceded by an ldnlp, on the other hand, absorbs about 13 mA, causing an overhead of 85% as compared with the previous case, due to the change in con®guration.The last stl followed by an add causes an absorption of about 32 mA with an overhead of 353% as compared with the ®rst case discussed.In this last case, the excess is all concentrated in the micro-controller block (86% of the power absorbed by the execution unit in this case is concentrated in the micro-controller).Later on it will be seen in the case study that the average amount of power absorbed by the execution unit during execution of an instruction greatly depends on the data it is processing.The values given in Table I are there not absolute values but can vary considerably along with variations in the data involved.The values given were measured in particular conditions of null activity (the concept of null activity will be dealt with in further detail below).
A great inter-instruction eect was also observed in [3] for a proprietary Fujitsu DSP in which the overhead was in some cases over 25 mA when variation of the power absorbed during execution of a program was distributed in a range between 20 and 60 mA.For more complex architectures like the Intel 486DX2 and the Fujitsu SPARClite, these eects were considered to be negligible as compared with the basic cost: in the worst case the excess was no higher than 10% of the basic cost ( [3,4]).
As for many microprocessors, in the ST20-C2P functioning at the circuit level is controlled by a micro-programmed sequence of instructions stored in a ROM in the CPU.Retrieval of the instructions in machine language and decoding causes the execution of a micro-program.The latter controls all the activity of the signal lines and all the data transfer or operations in the CPU during execution of the instruction.By way of example, let us refer to the execution of three instructions, I i À 1 , I i , I i 1 , and consider the clock cycle in which the last stage is processing instruction I i .In this case, (I i À 1 3 I i ) back is due to the change in the con®guration of the execution stage which was processing instruction I i À 1 in the previous clock cycle.(I i 3 I i 1 ) forw , on the other hand, is due to the change in the con®guration of the fetch/decode stage that was processing instruction I i during the last cycle, while it is currently processing instruction I i 1 .

Determination of Costs
Application of the model described by Eq. ( 2) requires the basic and inter-instruction costs to be determined.The basic cost of instruction I i is determined by constructing a test entailing repetition of instruction I i and making it operate on data uniformly distributed within the admissible range of variation.To clarify this concept, let us assume that the instruction being investigated is Inst n, operating on an explicit 32-bit operand.The test to calculate the cost base(Inst) will be: where Having obtained the amount of electricity absorbed by each instance of the instruction Inst n (we say c 1 , c 2 , F F F , c n ) we have: To determine the inter-instruction costs (I i À 1 3 I i ) back and (I i 3 I i 1 ) forw it is sucient to construct a test in which a sequence of I i is embedded between a pair of instructions I i À 1 and I i 1 .The ®rst occurrence of I i will contribute towards determining the cost (I i À 1 3 I i ) back , while the last will determine the cost (I i 3 I i 1 ) forw .
To make this even clearer, let us consider the following code fragment: Let us assume that the sequence is repeated N times (k P {1, 2, F F F , N}), keeping the operands of I i uniformly distributed.Let c k i be the power absorbed by the i-th instance of the sequence at the k-th repetition.We therefore de®ne: A basic hypothesis for characterisation of the instruction set using the ACM is the possibility of creating sequences formed by the same instruction (the one to be characterised) operating on uniformly distributed data.In the architectures taken as references in [1 ± 4, 11] most of the instructions have an explicit argument, in which case the structure of the test to determine the basic cost of a generic instruction Inst can be as follows: where n i are uniformly distributed.The ST20-C2P has a load/store stack-based architecture with a stack comprising three registers.Most of the instructions work on the data stored in the registers, so it is necessary to alternate register loading with the repetition of the instructions.As the execution of an instruction causes stack rotation, making the last register inde®nite, the maximum length of the sequence is three instances: during execution of the second instance the measurement is made, while the ®rst and third instances give the measure without the interinstruction eects.The test to determine the basic costs for instruction Inst will be structured as follows: In this case the amount of power absorbed has to be measured as corresponding to the framed instruction, so we need measurement tools that make it possible to conduct an analysis at the instruction level.From an operational point of view, the use of a simulation tool provides a cycleaccurate analysis but the simulation takes a long time.In this case, in fact, to determine the basic cost of an instruction it is necessary to take N (in [11], for example, N 100) measures, which require the execution of N sequences of this kind.Characterisation of the model (i.e., determination of the basic and inter-instruction costs) requires a tool to measure the amount of power absorbed by the processor during execution of a software test.The approach adopted in [1 ± 4] was to use an ammeter in series between the electric power supply pin of the processor and the power source.Although this method has the advantage of allowing measurements to be taken rapidly, it presents a number of problems: it supplies average information concerning a time window in which various instructions are executed, and it requires availability of the processor as an integrated component.In many cases, as in the hardware/ software codesign of an embedded system for example, the silicon of the core alone does not exist and the only way to measure the power absorbed by the core is to use a simulation tool for power analysis.As the ST20-C2P is a core processor, i.e., a processor that does not exist as a single component, a transistor-level simulator, Powermill, was used as a measurement tool.
Figure 1 shows the ¯ow of operations performed to construct the database of basic and interinstruction costs, to estimate the code and make a comparison with the results of the electrical simulation.The program being analysed was used as input for VHDL simulation, to generate stimuli for simulation with Powermill.VHDL simulation also provides the signals needed to generate a cycle-accurate trace of the ¯ow of instructions executed.The results of the Powermill simulation are combined with the trace of the program to generate an instruction trace ®le including power information.This ®le is processed to extract any basic and inter-instruction costs, which will be stored in a database.Together with the trace of the instructions executed, this will represent the input for the estimation program, the results of which will be compared with those obtained in the Powermill simulation.

Results of Estimation
In [1] the ACM gave good results when used to estimate the power consumption of two commercially available processors, the Intel 486DX2 and the Fujitsu SPARClite 934, in which a low datadependency was observed and the hypothesis of labelling instructions with an average value was con®rmed.In [2] the same authors then applied the ACM to a DSP, noticing that in this case datadependency is greater and the inter-instruction eect more marked.These eects are always present, but their contribution in terms of the fraction of power absorbed grows in an inversely proportional fashion to the architectural complexity of the processor.In DSPs with simple architectures, these eects are much more evident.In [1] the range of variation on the basic cost as the data varied always remained below 5% for the Intel 486DX2.In [3], on the other hand, a greater variation was observed for a proprietary Fujitsu DSP, but it remained lower than 10%.
With the core ST20-C2P we observed that consumption strongly depended on data.To highlight the contribution of data to power consumption, Figure 2 shows the average amount of power absorbed by the execution unit cycle by cycle during execution of a sequence of add instructions operating with null arguments (dashed and dotted curve) and gradually increasing arguments (continuous-line curve).The average amount of power absorbed per cycle ranges between 7 and 20 mA, a band of 13 mA that corresponds to 185% of the minimum absorption, while in terms of power there is an increase of 78% entirely due to the eect of the data.Use of the ACM to estimate this code (dashed curve) would mean a mean error per cycle of 98% in the lowactivity test estimate and 20% in the increasing activity estimate.
In our case study, the data-dependency of power consumption is far from negligible.The range of variation of the amount of power absorbed during execution of an instruction with varying amounts of data is on average 150% more than the lower end of the range.

A DATA-DEPENDENT MODEL
In Section 2.1.3it was shown that use of the ACM to estimate the average amount of power absorbed by an instruction leads to considerable error with simple processor architectures.It is therefore not appropriate to label an instruction with an average value: we need a model that is data-dependent as well [12].In the following sections we will present a data-dependent estimation model as a generalisation of the ACM, simply separating the basic and inter-instruction costs from the contribution made by data.The model will be adapted in order to apply it to the ST20-C2P and will be validated by estimating some software tests and comparing the results with those obtained using the ACM.

Formulation
We will initially refer to the simple architecture of the core processor which represents our case study.Architectures of this kind are commonly used in embedded systems in which memory area and power consumption constraints are more important than performance.The model will then be extended to consider RISC architectures with a greater pipeline depth.
Given the sequence of instructions I i À 1 , I i and I i 1 , the average amount of power absorbed during execution of the instruction I i can be estimated by using the following equation: Considering constant the power absorbed by the fetch unit (base The estimation function is seen as the sum of three contributions: that depending on the instruction being examined, the inter-instruction contribution (the sum of the eect of the change in con®guration and the determination of the subsequent state) depending on the instruction preceding and following the one being executed, ad lastly the contribution made by the data.The three addends are ®ltered by their mutual eects: the basic and inter-instruction costs are determined in a state of null activity and the contribution of the data is ®ltered by the basic and inter-instruction costs.

Basic Cost in a State of Null Activity
The meaning of the terms appearing in Eq. ( 7) is made clearer by Figure 3, which shows part of the 9 POWER ANALYSIS MODEL I207T001050 .207 T001050d.207datapath of a generic processor, comprising three components: the registers, the datapath buses and the logical combiner blocks.The ®rst contain the data to be processed, while the second act as channels to transfer the data to the combiner blocks.Figure 3 shows N registers connected to the two datapath buses (xbus and ybus) by threestate buers.xbus and ybus are connected to the ALU, the sub-blocks of which (adder, multiplier, shifter, comparator, etc.) perform arithmetical and logical functions on the data transported by xbus and ybus.All the blocks have the same input and all output a result.A multiplexer will only select the output of the correct block, loading the resbus with the result of the operation requested.
The basic cost of the instruction I i in a null activity state (base (0) (I i )) is de®ned as the amount of power absorbed during execution of the instruction I i when it is preceded and followed by the same instruction I i and null activity is maintained on the buses and registers.As execution of an instruction translates into correct mapping of the registers on xbus and ybus, by maintaining activity on the registers null, the activity on the bus is also null.
For the ST20-C2P the basic null-activity cost was around 7.2 mA per instruction.The reason for this behaviour is mainly the absence of operand isolation techniques: during execution of an instruction all the blocks are working and producing a result; the only dierence between two dierent instructions is selection of the output of one block rather than that of another.If, for example, we focus on the ALU block during execution of an add instruction, all the blocks (adder, comparator, shifter etc.) receive the same data and they all output a result, but only the adder output is selected.This therefore accounts for the similarity between the basic costs.Table II gives the basic null-activity costs for 36 instructions.As expected, the null-activity basic costs for the various instructions are pretty much the same.The average deviation of the nullactivity basic costs from the average value is less than 7%.

Null-activity Inter-instruction Cost
The cost I iÀ1 3 I i 0 back is de®ned as the overhead on the basic null-activity cost of instruction I i when the latter is preceded by the same instruction I i À 1 and followed by the instruction I i and activity on the registers and buses remains null.This excess is exclusively due to the eects of the change in con®guration.Likewise, the cost I i 3 I i1 0 forw is de®ned as the overhead on the basic null-activity cost of the instruction I i when it is preceded by the same I i and followed by the instruction I i and activity on the registers and buses remains null.This excess is exclusively due to the microcontroller that determines the subsequent state or, more generally, to the decoding logic.
Tables III and IV give the backward and forward costs for a number of pairs of instructions.The variation in backward costs as compared with the basic cost is on average less than 25%.For forward costs, on the other hand, the deviation is much greater: on average over 100%.Note also that the symmetry hypothesis put forward in [2] is not always valid in our case.The lack of symmetry in forward costs is due to glitches that occur at the micro-controller ROM inputs for particular pairs of instructions.See, for example, the forward costs for the pairs (and; add ), (and; sub) and (and; rev), where this asymmetry is quite evident.

Data Contribution
The term in 7 indicated as f(data) takes into account the contribution of data towards the amount of power absorbed during execution of the instruction I i .Execution of this instruction means activating the appropriate combiner blocks, the power absorbed by which depends on how their respective inputs vary.Let us indicate a metre whereby we can measure these variations with the term activity.The average power absorbed in a cycle by the generic block Block j depends on the activity of its inputs: Current Block j functionactivity j Let: respectively be the blocks activated during execution of the instruction I i and the activity at the inputs of these blocks.We can then write: Adapting to the ST20-C2P, which has no operandisolation functions, execution of an instruction I i means activating all the blocks, so we have: therefore substituting in 8 we get: Current Block j activity j 9 Going back to Figure 3, it is clear that the activity at the inputs of a block depends on the activity on the datapath buses (which represent the inputs of the combiner blocks): activity j functionactivity Xbus Y activity Ybus The activity induced on the datapath buses depends on the mapping of the registers of the current and previous instruction.registers mapped on the memory address and data buses when instruction I j is executed as RegMA j and RegMW j respectively, Eq. ( 10) still holds:

Application to a Case Study
The model de®ned by Eq. ( 7) was applied to the ST20-C2P.

De®nition of f (data)
It was found that the activity induced on the datapath buses (xbus and ybus) and on the memaddr bus was the greatest power consumer in the ST20-C2P.These buses were therefore chosen as activity indexes.To measure this activity, the Hamming distance between two successive con®gurations of the index being considered was chosen.This con®rms the hypothesis that the power absorbed depends on the number of variations and is independent of the lines on which the variations occur.To de®ne f (data) a linear model was chosen in which the activity of the various indexes is weighted with a dierent coecient.Given these hypotheses, Eq. ( 10), as adapted for the C2P, becomes: We will now describe the method used to determine the weights and con®rm the linearity hypothesis.

Determination of Weights
By suitable loading the registers and knowing the register/activity index mapping, it is possible to construct sequences of instructions in such a way as to induce guided activity on the activity indexes.
To calculate the weights appearing in Eq. ( 11) a test was constructed in such a way as to render the activity on two indexes ®xed, making only one of them variable.In this way the contribution of the activity on one index is isolated and it is possible to de®ne a relation that links the activity on the index with the power absorbed.
Let " c P `96 Â 1 be a column vector split into 3 components with 32 elements: the elements of the ®rst component " cc1 X 32 represent the power absorbed by the execution unit when there are 1,2, F F F , 32 transitions on xbus and zero transitions on the remaining activity indexes.The elements of the second component of " cc33 X 64 represent the power absorbed by the execution unit for 1,2, F F F , 32 transitions on ybus and zero transitions on the remaining activity indexes.Likewise for the third component, memaddr.
Given: weight M T M À1 M T " c 13 A more general method is to use a lookup table for each activity index.Each table is addressed by measuring the activity on the activity index being examined and gives the average amount of power absorbed.

Linearity Approximation
The adequacy of the linearity approximation linking the amount of power absorbed to the activity on the activity indexes is evident in Figure 4, which shows the power absorbed by the execution unit with varying amounts of activity on xbus and ybus.As can be seen, the activity induced on xbus leads to a greater power dissipation.The dierence can be accounted for by the fact that xbus has more branches than ybus and propagates its activity to a greater number of blocks, thus entailing a higher cost per transition.It was also observed that the weight of 0 3 1 transitions was dierent from that of 1 3 0 transitions.Table V gives the costs per transition determined by the least squares method, discriminating between 0 3 1 and 1 3 0 transitions.Equation 11 was extended so as to treat 0 3 1 and 1 3 0 transitions dierently:  The activity indexes will foreseeably shift from the combinatorial elements towards the interconnection buses as the scale of integration increases.The wire-to-gate capacity ratio has gone from 3 for old technologies to 100 for new technologies.This means that the interconnection buses are of greater importance, in relation to power consumption, than combinatorial logic and so it is of  fundamental importance to focus on the power dissipated due to activity on the buses.

Determination of Glitches
To apply Eq. ( 14) it is necessary to have precise knowledge of the transitions on the activity indexes.Let us consider the following code fragment: ldc 0x00000000 ldc 0xffffffff ldc 0x00000000 add ldc 0xffffffff ldc 00 Let us assume that we wish to determine the contribution of data, in terms of power absorbed, following execution of the instruction ldc 0xffffffff.To determine the activity on xbus and ybus Table VI shows the mapping of the registers on the datapath buses in relation to the instructions ldc and add.From Table VII  Hamm 130 ybus add Y ybus ldc 0 from which we conclude that according to Eq. ( 14) the data contribution is null.
The power measured during execution of the ldc 0xffffffff is 62 mA, while the estimated amount is 11 mA, thus giving an underestimation error of 82%.The estimation error was due to the way in which the transitions were counted.The enable signals of the three-state buers for the mapping of the registers on the datapath buses come from the micro-controller.As the length of the paths is dierent, these selection signals arrive after the contents of the registers have changed.
Figure 5 shows the real transitions on xbus and ybus.During execution of the instruction add, xbus and ybus are loaded with the contents of the registers AReg and BReg respectively.On the rising edge of the next cycle the contents of the registers will be updated with the result of the add operation and the datapath buses will be loaded with the new contents of the registers AReg and BReg, generating 32 0 3 1 transitions on xbus and 32 1 3 0 transitions on ybus.After a delay T following the rising edge of the clock, the selection signals will map the registers ConstantReg and 0Reg on xbus and ybus, thus determining another 32 1 3 0 transitions on xbus and 32 0 3 1 transitions on ybus.
Indicating the contents of the register that instruction I i À 1 maps on xbus during execution of the instruction I i as RegX i iÀ1 , the total number of transitions on xbus when instruction I i , preceded by instruction I i À 1 , is executed is: Indicating:    To show the in¯uence of data on power consumption, a test was prepared comprising a sequence of 50 add instructions.This test was performed in two modes: low-activity (sum_low) and high-activity (sum_high).In the low-activity mode the 50 add instructions operate on null arguments, while in the high-activity mode the arguments gradually increase (see Section 2.1.3).In Figure 6 the solid line shows the real average power absorbed per cycle in the case of high and low activity, while the dashed line shows the estimated power absorption.
To test the robustness of the DDM to interinstruction eects, a test (rand_inst) was built comprising a sequence of about 500 instructions extracted at random from the instruction set.
The aim of this test was to create a preponderant inter-instruction eect which, as we have seen, is the greatest contributor to power consumption.The random distribution of the instructions avoids the presence of repeated sequences which might falsify the estimation results.Figure 7 gives the percent error distribution over the cycle using the ACM (a) and the DDM (b).
Figures 8 and 9 shows percent error distributions using the ACM and the DDM for test which executes the sum (mtx_sum) and the product (mtx_mul) of two 3 Â 3 matrices formed by random integer elements.
Figure 10 shows the real and estimated power absorbed by the execution unit in the ®rst 80 cycles of execution of a test performing a digital Fourier transform (dft) on 16 signal samples.Figure 11 gives the percent error distributions over the cycle using the ACM (a) and the DDM (b).
Table VIII summarises the results of the tests described in this section.r and s respectively indicate the vectors of the real and estimated power per cycle.E{x} indicates the average of the elements in the vector x.For each test carried out the Table gives the average amount of power absorbed during execution, the percent estimation error and the average of the absolute error values obtained cycle by cycle.In all cases the error made using the DDM was lower than 6% in estimating the average power and less than 10% in the estimate per cycle.The accuracy of the ACM is not so easy to quantify as the error varies in a very wide range.For some highly data-dependent tests, the average error goes from 97.77% to 5.58% for varying amounts of input data.In the test estimate in which the average amount of power absorbed has a narrow range of variation for varying amounts of data, the average ACM estimate error is between 7% and 20% for the average amount of power absorbed and between 15% and 30% for the cycle-based estimate.The model described by 7 was speci®cally de®ned for our case study.In this section we will show that it is possible to generalise the model to apply it to generic RISC processors.Let us refer to the following simplifying hypothese: N pipeline stages (indicated as S 1 , S 2 , F F F , S N ).
Each stage processes instructions in a single clock cycle.Each instruction is processed by all the stages.0 We will consider the pipeline con®guration shown in Figure 12.
The average power absorbed per CK M clock cycle is:  where weight 1 and weight 2 weight the activity according to the instruction being executed.In no operand isolation architectures we will have weight(I) cost V I. Constructing a test formed by repetition of an instruction I, so that the pipeline con®guration is the one shown in Figure 13, it is possible to obtain the basic costs base 0 S IV S P f1Y 2Y F F F Y Ng.Likewise, constructing a test in which the sequence of instructions I and J is repeated, so that the pipeline con®guration is the same as the one in Figure 14, we obtain the costs int 0 S IY JV S P f1Y 2Y F F F Y Ng.

Determination of F S (activity S , I)
There remain to be determined the contribution of the data represented in Eq. ( 17) with the terms of type F S activity S Y I. First it is necessary to identify  the activity indexes for each stage.It can reasonably be assumed that buses transferring data towards large combinatory blocks or buses interfacing the memory (with large capacity loads) represent general activity indexes.This will be made clearer by referring to a speci®c case.Let us consider the classic 5-stage pipeline architecture of an MIPS or DLX.We will assume two activity indexes in the execution stage: the registers operating at the ALU input (interface between the decode and execution stages) and de®ne an activity metric (e.g., the Hamming distance between the con®gurations in these registers in two subsequent clock cycles).It is possible to construct a sequence of instructions in such a way as to induce a ®xed activity on the activity indexes.To determine the relation between activity on one index and the amount of power absorbed it is necessary to construct a test in which activity on one index is made to vary continuously while that on the other is kept constant.The points obtained can be interpolated or approximated with the points of a straight line using the least squares method to establish a relation between activity on the index concerned and the power absorbed.Let us consider for example the instruction add r d , r s1 , r s2 , the eect of which is to load the activity indexes with the contents of the registers r s1 and r s2 .Constructing a sequence of the type: li r1, X1a ; r1XX1a li r2, X2a ; r2XX2a li r3, X1b ; r3XX1b li r4, X2b ; r4XX2b add r5, r1, r2 ; Oper1 r1, Oper2 r2 add r5, r3, r4 ; Oper1 r3, Oper2 r40 when the second add had been processed by the execution stage, the ®rst activity index will pas from the con®guration X1a to the con®guration X1b, while the second index will pass from X2a to X2b.If we de®ne the average amount of power absorbed by the execution stage and the activity on the activity indexes as C ex and activity ex respectively, it is necessary to link C ex and activity ex mathematically, i.e., to determine a function f such that:   T001050d.207so as to minimise err.The function f can be determined in various ways.If the activity index under examination is a high-capacity bus, the activity can be measured via the Hamming distance between two subsequent con®gurations and the activity can be linked with the power absorbed with a proportionality constant.If the activity index is due to variation in the inputs of a combinatory block, it can be characterised using the power model proposed in [20] or by using a neural network to obtain the relation between the input switching activity and the power absorbed.

Characterisation: ACM Vs. DDM
As we have shown, the characterisation of both models (ACM and DDM) on the core processors of embedded systems requires the use of software analysis tools.Although the use of accurate simulation tools during characterisation guarantees better results in the estimation phase, it requires great calculation resources and long computation times (the lower the level at which they operate, the more accurate they are).
In this section the ACM and DDM will be compared in relation to the complexity of characterisation when, as in our case study, the silicon of the core to be characterised is not available and the only way to take the measurements is to use a software tool for power analysis.As a speci®c case we will consider characterisation of the instruction set of the ST20-C2P.Determination of base(Inst) requires the construction of a test in which the sequence: is repeated a signi®cant number, N, of times to characterise the average.The ldc instructions are executed in a number of cycles proportional to the length in nibbles of the operand cyclesldc n 1 nibblenÀ1

2
. If the data is uniformly distributed, a ldc random will be executed on average in 2 cycles.Let C Inst be the number of cycles needed to execute the instruction Inst.The total number of cycles needed to calculate the basic cost of instruction Inst is therefore: To determine the inter-instruction costs, by repeating N times the sequence: ldc random ldc random ldc random Inst_j Inst i Inst_i0 the cost (Inst j 3 Inst i ) back is determined, whereas by repeating N times the sequence: ldc random ldc random ldc random Inst_i Inst i Inst_j0 Inst i 3 Inst j forw is determined.We have: Let M the number of instructions to be characterised, the number of cycles needed to determine all the basic and inter-instruction costs for the ST20-C2P is: We follow the same reasoning when the DDM is used.Determination of base (0) (Inst) requires the construction of a test of the following kind: in which the values A, B and C are ®xed in such a way that execution of the framed instruction takes place in a state of null activity.base (0) (Inst) will be the average power absorbed during execution of the framed instruction.If we indicate the number of cycles needed to execute instruction Inst as C Inst , the total number of cycles needed to calculate base (0) (Inst) will be: To determine the inter-instruction costs with null activity, we consider the following sequences: determining the cost Inst j 3 Inst i 0 back , while from: we determine Inst i 3 Inst j 0 forw .We get: The number of cycles needed to determine all the basic and inter-instruction costs with null activity is: This requires some clari®cation.The number of cycles needed to characterise the ACM was calculated without including the cycles needed to characterise the term f(data).In our case study, only two activity indexes were taken into consideration, so the number of cycles needed to characterise them is negligible as compared to those required to characterise the other terms.In addition, the absence of operand isolation mechanisms makes it possible to render f (data) instruction-independent (see 3.2.2. in which about 96 measures were required to determine the costs per

A TOOL FOR SOFTWARE POWER CONSUMPTION ESTIMATION
Application of model 16 requires knowledge of the state of the processor cycle by cycle.By the state of a processor in a generic clock cycle we mean the contents of the registers, the logical levels of the lines of the buses interfacing the memory and the datapath, and the state of the ®nite state machine implementing the micro-controller.Such detailed knowledge of the state of the processor can be obtained via VHDL simulation of the code to be estimated.The manufacturer of the processor will obviously not provide the VHDL models of the processor, so the client cannot run a VHDL simulation of the code to obtain a cycle-accurate trace and apply model 16.
A CPU emulator is an application that is always present in the package of development tools a CPU manufacturer provides his client with.A CPU emulator is a functional model of the CPU with which it is possible to trace the ¯ow of instructions executed by the processor during the execution of a program.For each instruction executed, it gives the state of the registers and any memory access operations.An instruction is generally executed in a certain number of cycles, passing through a sequence of micro-states that is usually not ®xed.

Binary Tree Representation
The ®nite state machine implementing the CPU micro-controller determines the subsequent state according to the current one and the value of certain signals.The VHDL code implementing the FSM of the micro-controller can be generalised as in Figure 15.So, given an initial state it is possible to reach several ®nal states according to the value of certain signals.We can represent the states that can be reached from a certain state by means of a binary tree in which the nodes represent the conditions and the leaves the states reached.In the case described, the binary tree representing the state state i is made up of four nodes and four leaves.The nodes represent the conditions of the IF constructs while the leaves represent the possible subsequent states (Fig. 16).The path through the tree depends on whether the condi-FIGURE 15 VHDL code implementing the FSM of the micro-controller.

I207T001050 . 207
T001050d.207tions de®ned by the nodes are met or not: having ®xed a conditional node, the left-hand side path is taken if the condition is met, the right-hand side one if it does not.If, for instance, the signals signal i and signal j are both high, the subsequent state will be state j .

Analysis of Conditions
Splitting up the instructions into the component micro-states can be achieved if it is possible to evaluate the conditions (at the nodes of the conditional tree) using the information supplied by the CPU emulator.These conditions refer to the contents of the registers, the datapath bus and the signals regulating the memory access protocol.The conditions depending on the register contents can be directly evaluated as the trace ®le generated by the processor emulator provides information about the register contents.Information about the state of the datapath bus lines is not directly given in the trace ®le.It is, however, possible to obtain it through knowledge of the current state and the register contents.The state, in fact, ®xes the mapping between the registers and the datapath buses, so it is possible to obtain the word mapped on xbus and ybus.For instructions that do not map a register on xbus and ybus the contents remain constant due to the presence of the bus keepers.The problem of evaluating the conditions depending on the memory speed is solved by implementing a memory model that operates according to the protocol speci®ed.

Generation of a Trace of the Micro-states
As we have illustrated, it is possible to split an instruction up into its component micro-states just by using the register contents and state information.Valuation of the conditions makes it possible to visit the conditional tree relating to the current state and reach the leaf representing the subsequent state.If this algorithm is applied to all the instructions appearing in the trace of a program, it is possible to generate a micro-state by micro-state trace from the instruction by instruction trace.
Figure 17 shows the estimation ¯ow starting from the source code of the program.This is compiled to generate an image of the ROM, and is then executed via the CPU emulator that supplies a trace of the instructions executed together with  Splitting the instruction up into their component micro-states does not mean generating a cycleaccurate trace.Fetch unit stalls may introduce waiting states that are dicult to pick up from the instruction trace.A waiting state is imposed when the fetch unit has stalled and the execution unit is ready to execute a new instruction in the next cycle.Picking up these eects form the instruction by instruction trace is a very complex operation that essentially depends on two factors: the memory speed, which determines the number of cycles needed to ®ll the instruction buer with the new instructions to be executed.The alignment of the words of the machine code of the instructions in the memory.0Estimation of the average amount of power absorbed during these waiting states also depends on a number of factors.As we have seen, a great contribution to the power absorbed in the ST20-C2P is made by activity at the micro-controller ROM inputs.In fetch unit stall cycles, the microcontroller ROM is stimulated with spurious data on account of the direct connection between the fetch unit predec output and the micro-controller ROM inputs.

Validation of the Method
The tool was validated by executing some estimation tests and comparing the results with those obtained using PowerMill.In all the cases examined both the error in estimating the average amount of power absorbed and that in estimating the total power required were always below 5%.It was not possible to assess the tool in terms of error per cycle since stalls during the execution of the tests led to a loss of synchronisation between the trace extracted via the VHDL simulation and the one generated by the tool.It was only possible to perform this analysis on short sequences of code in which there were no stalls.The accuracy achieved is shown in Figure 18.

Best and Worst Case
The energy absorbed in executing any one program varies according to the data involved.A data-dependent model can be used to determine the minimum and maximum power absorbed to execute the program.These operational extremes can be estimated by exploiting model 7. Having established the software, the minimum power required can be estimated by neglecting the data contribution (i.e., setting f (data) 0 for each instruction executed ), whereas the maximum power required can be estimated by maximising the term f (data), i.e., maximising the activity on the activity indexes involved during execution of each instruction.An example of the possible variation in the amount of power absorbed by a single program when the amount of data varies is given in Table IX.For each program the table gives the estimate of the average power absorbed when the term f(data) is neglected and maximised.As can be seen, the data contribution alone can in principle cause a 100% dierence in the average amount of power absorbed per cycle.

CONCLUSIONS
We have proposed a data-dependent model for estimation of the average amount of power a processor consumes cycle per cycle during execution of a program.Application of the model requires a trace of the program as input, i.e., the ¯ow of instructions executed and the state of the system as de®ned by the contents of the registers and information about memory access, if any.To determine the trace of the program it is necessary to execute it using, for example, a functional model of the processor.The need to execute the program to obtain the trace is not a limiting factor: generally, any model for software power consumption estimation requires knowledge of the ¯ow of instructions executed.As the sequence of instructions executed is highly data-dependent (consider, for example conditional jumps), it is necessary to execute the program to be evaluated.
The software power consumption estimation models proposed in literature aim at minimising the mean estimation error over a time window which extends to the duration of the program.Our point of view, on the other hand, was to minimise the estimation error per clock cycle: in this case the data contribution cannot be overlooked (compare Figs. 2 and 6).The results obtained con®rm the validity of the model: in all the tests carried out, the average error per cycle remained below 10%.The advantage of a data-dependent model over a data-independent one ( [1]) or an instructionindependent one ( [11]) is without doubt the greater accuracy of the estimate.
All three models give the same estimate of the total amount of power consumed during execution of a whole general-purpose program, but where a data-dependent model shows its strength is in the estimation of small fragments of code or highly input data-dependent applications (mathematical routines, DSP applications).
A data-dependent model can be used to characterise a given software application in power terms for varying amounts of input data.An example is given by a function for the application of an image ®lter: it can be characterised via a power cost depending on the graphical ®lter used.Another ®eld of application for a data-dependent estimation model is determining the best case and worst case conditions for a program in terms of power: model 7 can be applied to each instruction in the program trace minimising or maximising the term f (data).Catania, Italy.His research interests include power estimation, low-power design methodologies and techniques, codesign techniques for embedded systems.Davide Sarta received the Laurea degree in electronic engineering from the University of Catania, Italy, in 1994 and Master degree in Information Technology in 1997 at the Politecnico di Milano.
He has worked at STMicroelectronics, Catania, Italy, since 1997, where he is involved in design and veri®cation of System-on-a-Chip.His research interests include techniques for power estimation and optimisation in digital systems.

FIGURE 1
FIGURE 1 Flow of operations performed to construct the database of basic and inter-instruction costs.

FIGURE 2
FIGURE 2 Average amount of power absorbed by the execution unit cycle by cycle during execution of a sequence of add instructions operating with null arguments and gradually increasing arguments.
and indicating the inter-instruction costs in the execution and fetch stages as I iÀ1 3 I i 0 back and I i 3 I i1 0 forw respectively, we have:

FIGURE 3
FIGURE 3 Part of the datapath of a generic processor.
:" c M Â weight12Equation (12) is a system of 96 equations with the 13 POWER ANALYSIS MODEL I207T001050 .207 T001050d.207three unknown values weight xbus , weight ybus and weight memaddr .By applying the least squares method it is possible to obtain the vector weight minimising the quadratic error.
where the function Hamm x3y " aY " b gives the number of transitions x 3 y involved in passing from con®guration " a to con®guration " b.

FIGURE 4
FIGURE 4 Current absorbed by the execution unit with varying amounts of activity on xbus in (a) and ybus in (b).

f
data transX 031 I iÀ1 Y I i Â weight 031 xbus transX 130 I iÀ1 Y I i Â weight 130 xbus transY 031 I iÀ1 Y I i Â weight 031 ybus transY 130 I iÀ1 Y I i Â weight 130 ybus Hammmemaddr iÀ1 Y memaddr i Â weight memaddr 15 3.3.Estimation Tests Given the sequence of instructions I i À 1 , I i , I i 1 , the model used to estimate the average amount of power absorbed during execution of instruction I i is: CostI i base 0 I i I iÀ1 3 I i 0 back I i 3 I i1 0 forw transX 031 I iÀ1 Y I i Â weight 031 xbus transX 130 I iÀ1 Y I i 16 Â weight 130 xbus transY 031 I iÀ1 Y I i Â weight 031 ybus transY 130 I iÀ1 Y I i Â weight 130 ybus Hammmemaddr iÀ1 Y memaddr i Â weight memaddr This model will be referred to as DDM (data dependent model).The DDM was validated and compared with the ACM by performing some software estimation tests.These were divided into two classes: tests constructed ad hoc to highlight the features of the model proposed, and tests extracted from fragments of the codes of real applications.Each test was carried out using both the ACM and the DDM, calculating the average amount of power absorbed by the execution unit (real and estimated ) per cycle ( for the ®rst 80 clock cycles) and the distribution of the mean percent error over the estimation cycle.

FIGURE 6 4 .
FIGURE 6  Real and estimated average power absorbed per cycle in the case of high and low activity for a test comprising a sequence of 50 add instructions.

FIGURE 7
FIGURE 7 Percent error distribution over the cycle using the ACM (a) and the DDM, (b) for a test comprising a sequence of about 500 instructions extracted at random from the instruction set.

FIGURE 8 2 I 0 NÀ1 I 3 int 0 NÀ1 I 2 Y I 3 F 0 N I 2 int 0 NSSS I or base 0 SSS IY J int 0 S JY I 18 F
FIGURE 8 Percent error distribution over the cycle using the ACM (a) and the DDM, (b) for a test which executes the sum of two 3 Â 3 matrices.

FIGURE 9 1 .
FIGURE 9 Percent error distribution over the cycle using the ACM (a) and the DDM, (b) for a test which executes the product of two 3 Â 3 matrices.

FIGURE 11
FIGURE 11 Percent error distributions over the cycle using the ACM (a) and the DDM, (b) of a test performing a digital Fourier transform.

FIGURE 13 S
FIGURE 13 Pipeline con®guration to obtain the basic costs base 0 S IV S P f1Y 2Y F F F Y Ng.

FIGURE 14
FIGURE 14 Pipeline con®guration to obtain the costs int 0 S IY JV S P f1Y 2Y F F F Y Ng.

FIGURE 16
FIGURE 16  Conditional tree representing the state state i .

FIGURE 18
FIGURE 18  Real and estimated power absorbed in the ®rst 32 cycles of execution of a test performing the sum of two 3x3 matrices.

TABLE I
RegMap j RegX j Y RegY j represents the register values that instruction I j maps on xbus and ybus.Equation (10) can be generalised to account for other activity indexes such as activity on the address and data lines interfacing the memory.Proceeding in the same way, if we indicate the If we indicate the registers mapped on xbus and ybus when instruction I i is executed as RegX i and RegY i respectively, we have: activity Xbus functionRegX iÀ1 Y RegX i activity Ybus functionRegY iÀ1 Y RegY i So, in conclusion: f data functionRegMap iÀ1 Y RegMap i 10 where:

TABLE VII Then
the data contribution is null FIGURE 5 Real transitions on xbus and ybus.