Architectural Power Estimation Based on Behavior Level Profiling

High level synthesis is the process of.generating register transfer (RT) level designs from behavioral specifications. High level synthesis systems have traditionally taken into account such constraints as area, clock period and throughput time. Many high level synthesis systems [1] permit generation of many alternative RT level designs meeting these constraints in a relatively short time. If it is possible to accurately estimate the power consumption of RT level designs, then a low power design from among these alternatives can be selected. In this paper, we present an accurate power estimation technique for register transfer level designs generated by high level synthesis systems. The technique has four main aspects: (1) Each RT level component used in high level synthesis is characterized for average switched capacitance per input vector. This data is stored in the RT level component library. (2) Using user-specified stimuli, the given behavioral description is simulated and event activities of various operators and carriers are measured. Then, the behavioral specification is submitted to the synthesis system and a number of alternative RTL designs meeting speed, space and throughput rate constraints are generated. (3) Event activity of each component in an RT level design is estimated using the event activities measured at the time of behavior level profiling and the structure of the RTL design itself. (4) The event activities so obtained are then used to modulate the average switched capacitances of the respective RT level components to obtain an estimate the total switched capacitance of each component. Detailed power estimation procedures for the three different parts of RTL designs, namely, data path, controller and interconnect are presented. Experimental results obtained from a variety of designs show that the power estimates are within 3%10% of the actual power measured by simulating the transistor level designs extracted from mask layouts.


INTRODUCTION
Due to the increasing demand for portable applications and the rapidly growing complexity, power consumption has become one of the main issues in the realization of VLSI chips.There have been major efforts [2] to reduce the power consumption at all levels of abstraction in the design flow.Accurate power estimation techniques are the key to the success of these efforts.Although accurate power estimation is possible at the lower levels of abstraction, it is very time consuming.
Hence, recently focus has shifted to the higher levels of abstraction including register transfer (RT) level and above [3].In this paper, we present a power estimation technique for automatically synthesized RT level designs.This technique is based on behavior level profiling.
A high level synthesis system accepts a beha- vioral specification written in a hardware descrip- tion language such as VHDL, a module library, and design constraints such as the area and delay constraints.The module library consists of RT level modules such as adders, multipliers, registers and multiplexors.The output of the synthesis system is a RT level design satisfying the user specified constraints.The synthesis time is usually quite small compared to logic synthesis or layout synthesis.Hence, it is possible to synthesize many constraint-satisfying RT level designs in a rela- tively short time.
RT level designs are composed of two interact- ing parts: datapath and controller.The datapath consists of execution units such as adders and multipliers, storage units such as registers and RAMs, and interconnect units such as multi- plexors and buses.Since the structure of the design is known completely accurate power estimation is feasible.In addition, since the modules are at a sufficiently high level of abstrac- tion such estimation should be time efficient.At the higher levels of abstraction such as the behavioral level, accurate power estimation is difficult due to the lack of sufficient implementation detail.On the other hand, at lower levels of abstraction, such as logic and layout levels, even though sufficient implementation detail is avail- able, estimation time is discouraging.Hence, we are motivated to estimate power at the RT level of abstraction.For a given RT level design and for a given set of input vectors, we estimate the total capacitance switched in the design.We use "power" and "switched capacitance" synony- mously.Our estimation technique is set in the context of a high level synthesis system known as the Profile-Driven Synthesis System (PDSS).
Our power estimation procedure requires the following inputs: (1) A module library character- ized for the average intrinsic switched capacitance (ISC) per input vector.(2) Profile data for various carriers and operators in the data flow graph of the behavioral specification.This data is obtained by simulating the behavioral specification using user- specified stimuli.(3) A RT level design generated by the synthesis system.(4) Binding information of the operators and carriers in the data flow graph to the module instances in the RT level design.
High level synthesis process introduces certain RT level module instances such as temporary registers and multiplexors for which the profile data is not known since these modules have no direct correspondence with the operators and carriers at the behavior level.Profile data for such data path units is derived using the profile data at their inputs which in turn is obtained from the profile data measured at the behavior level.The switched capacitance for each module instance is estimated as the product of its profile data and its intrinsic switched capacitance obtained from the module library.The total switched capacitance in the datapath is the sum of estimated switched capacitances over all instances.
The switched capacitance estimation for the controller, assumed to be implemented as a PLA, is as follows: A parameterized PLA characteriza- tion table for average switched capacitance per clock cycle is obtained as explained in Section 5.
Given the controller size, the switched capacitance for the controller is estimated by determining the closest point in the PLA table.
The power estimated for the entire design is the sum of the estimated switched capacitances of the datapath and the controller.Experimental results show that the power estimated in the RT level datapaths and controllers is within 15% of the actual power measured at the layout level.
Section 2 presents a brief survey of the power estimation techniques at architectural and other levels of abstraction.Section 3 discusses various issues involved in power estimation.Section 4 discusses the concept of behavioral profiling.Section 5 discusses the module library character- ization and the PLA characterization for the average switched capacitance per unit vector.Section 6 discusses the power estimation techni- que.Section 7 presents the results obtained for several examples.Section 8 discusses the results and presents concluding remarks.[4] suggested Power Factor Approxi- mation (PFA) method for characterizing each module in a module library consisting of func- tional blocks.The method provides different gate equivalent models for blocks such as multipliers, adders, etc.Each functional block is associated with a PFA proportionality constant and a hardware complexity constant.The PFA constant captures the intrinsic internal activity of the module.Purely random inputs are applied when deriving the PFA constant.The power dissipation in a chip is the sum of the power dissipation in all blocks of the chip.The power contributed by a block in the chip is simply the product of the above two constants and the block's activity frequency.The activity frequency of a functional block is the frequency at which the function is performed.In Powell et al. [5] present an algorithm level power dissipation model for a class of DSP algorithms known as MA-based (Multiply-Add) DSP Algo- rithms.The major sources of power dissipation in MA-DSP systems were identified to be memory operations, computations and I/O operations.
Impact of the number of available processing elements, complexity of processing elements, memory organization and type of arithmetic on power dissipation was discussed.This model relates power dissipation to high level algorithmic and architectural parameters.
Chandrakasan et al. [6,7] described a high-level synthesis, system, HYPER-LP, which uses a variety of architectural and computational trans- formations to minimize power consumption in application-specific datapath-intensive CMOS cir- cuits.
Landman et al. [8] presented a methodology for low-power design-space exploration at the archi- tectural level.Black-box power models for the architectural-level components were generated [9] and used to estimate power while preserving the accuracy of the gate or circuit level estimation.The power analysis tool was set in the context of HYPER [10], a high level synthesis system.The key differences between our approach and Land- man's approach are (1) our synthesis system, known as PDSS (Profile Driven Synthesis System) [11], is targeted towards control-dominated ASIC applications.The behavioral specifications can contain complex control constructs such nested loops, conditional and subprograms.On the other hand, HYPER primarily targets mostly straight- line DSP-style specifications.(2) Our approach is based solely on the behavioral profiling.Land- man's estimation is based on behavioral profiling or RT level profiling.For large designs, with large set of inputs, the latter approach is time consum- ing and hence design space exploration is difficult.
(3) Our characterization of the module library is based on purely random inputs, that is, Uniform White Noise (UWN) model.Landman, on the other hand, proposed DBT (Dual Bit Type) model to take into account the input activity.Our power dissipation model based on UWN model is simpler compared to Landman's and yet yields reasonably accurate estimates.
Renu et al. [12] proposed a behavior level power estimation technique based on a combina- tion of analytical and stochastic methods.Based on this, a design space exploration tool is presented which is used to examine the effect of different design steps such as transformations and algorithms.These techniques have also been implemented in HYPER synthesis environment [10].
Anand et al. [14] present a behavioral synthesis system known as Genesis, for synthesizing low power datapath intensive CMOS circuits.During the allocation phase, (1) the physical capacitance is reduced by minimizing the number of functional modules, registers and multiplexors; and (2) the transition activity for a given module is reduced by selecting a proper sequence of operations for that module.The controller is optimized so as to generate control signals which will reduce the transition activity in the datapath.This is achieved by introducing don't-cares in the state table of the controller.If a datapath module is idle for a particular cycle, then the control signal driving that module is assigned a don't-care, thus avoiding unnecessary clocking of the module.In Anand et al. [15] present a simulation-based method to measure intra-and inter-iteration effects of hardware sharing on switched capacitance.During the simulation, information is gathered which is used to formulate allocation as an ILP problem with the total switched capacitance in the datapath as the objective function.The solution to the problem yields optimal allocation for the given model.
A detailed discussion about power consumption in CMOS digital designs can be found in [16].
Techniques for low power operation are presented which use the lowest possible supply voltage coupled with architectural, logic, circuit, and technology optimizations.An excellent literature survey on the power estimation techniques at the logic and lower levels of abstraction can be found in [17], In [11,18], we have proposed a behavior level profiling based technique to estimate switching activity and switching capacitance in a design.The estimation is carried out in the scheduling and perfornance analysis phase of the synthesis.For a given input specification, various schedules can be generated satisfying the user given constraints.The schedule with least estimated switching capacitance is further synthesized.The estimation technique adopts analytical approach at the design level and statistical approach for the module library characterization.One of the drawbacks of the approach is that the interconnect estimation is somewhat inaccurate at the scheduling level, resulting in inaccurate power estimation.In the present work, the estimation is at the RT Level and is based on the behavior level profiling of the input specification.Since the interconnect struc- ture is known completely, power estimation in the interconnect is more accurate compared to that obtained at the end of the scheduling step.In the present approach, the error in power estimator is in the range of 3 % 10%.

ISSUES IN POWER ESTIMATION
In a CMOS digital circuit, the power consumed is given by the following equations [19,16]" econsumed Pswitching -+-Psc + Pleakage .eswitcling E * Ci * V supply esc Isc * Vsupply eleakage Ileakage * Vsupply Pswitching is known as the switching component of power consumption which arises due to charging a node with a load capacitance of Ci and which is clocked at a frequency, fi.Psc, the short-circuit component arises when the PMOS and NMOS transistors are switching simultaneously resulting in a short-circuit path from the voltage supply to ground.For a very short period of time, current is drawn from the voltage supply to the ground which results in power dissipation.Pleakage is due to the leakage current,/leakage, which arises due to substrate injection and subthreshold effects.
The dominant term is Pswitching.This term is dependent on the architectural parameters and is relatively amenable for estimation at higher levels of abstraction.It is well-known that the static power consumption in digital CMOS circuits is negligible compared to the dynamic power con- sumption.Hence Pleakag, which is static in nature is negligible.Pse can be kept within 15-20% of eswitehing [20] by proper design methodology.Thus, it is sufficient to estimate Pswitching to estimate the average power consumed by a design.
Dynamic power consumption is strongly dependent on the stream of inputs applied to the circuit [17].Without any information about the input stream, it is impossible to accurately estimate the power consumption of a design.Thus, for a power estimation technique it is necessary to provide actual or statistical information about the input behavior.
Different power estimation techniques make different assumptions about the input vectors.These techniques are based on statistical, stochas- tic, probabilistic, or analytical approaches.For any technique two broad steps can be identified: (1) Characterization of the circuit components for power and storing relevant information about the components in the form of statistical models, parameterized tables, equations, etc.This is usually done only once for all the components used in the circuit.(2) Estimation of average power for a given design by combining the input behavior information specific to a design with the module library information using a statistical, stochastic probabilistic or analytical approach or a mix of these approaches.
In our approach for power estimation at RT level, the input vector behavior is indirectly specified by the user by providing a sequence of typical input vectors, known as the profiling stimuli.These vectors denote typical usage of the digital system being synthesized.These vectors are used to simulate the behavior level specification during which event activities of various behavior level operators and carriers are monitored and recorded.Collectively this information is known as the profile data.
For a given set of inputs to a digital circuit, the capacitance switched in the circuit is a measure of the power consumed by the circuit.We adopt this indirect approach for power estimation.Thus, in this paper, we use "power" and "switched capacitance" synonymously.
The module library is precharacterized for average switched capacitance per input vector as explained in detail in Section 5. RT level designs contain three subunits: datapath, controller, and interconnect.Detailed procedures to estimate the switched capacitances in each of these units are presented in Section 6.

BEHAVIORAL PROFILING
The concept of profiling a given program to gather various statistics is not new.A well-known tech- nique for measuring program performance is to insert monitoring code into the program and execute the modified program.Program profiling counts the number of times each basic block is executed and the number of times each control- flow path is traversed.Profiling is widely used to measure instruction set utilization, identify program bottlenecks and estimate program execution times for code optimization [25,26,27,28,29].Techniques to inser monitoring code to optimally and efficiently profile programs exist in the literature [30, 31, 32].
Behavioral level profiling is similar to program profiling.For profile data to make sense in case of high level synthesis, one needs to understand the correspondence between the constructs (variables, operations, loops etc.) in the behavior representa- tion to elements in the resulting hardware.Under- standing this correspondence helps in determining the data to be gathered during profiling.The profiling strategy is mainly dependent on how different synthesis tasks go about synthesizing the target design.
Consider the behavior description written in VHDL as shown in Figure 1.One possible RTL data path synthesized from the specification is as shown in Figure 2. The correspondences between elements of the specification and the elements of (   the RTL design is also shown by the line number annotations.Each register is associated with a carrier in the description, for example, register a corresponds to the port a in the specification.The profile data obtained by behavioral profil- ing should indicate the usage of different hardware elements.For example, the profile data of an assignment statement in the behavioral description gives an estimate of the excitation frequency of the corresponding path in the hardware.In our example, if line number (18) has a profile data of 10, it means that the corresponding path from the output of subtracter through the multiplexor to the input of the register c, is excited ten times.
RTL designs generated by high level synthesis systems contain temporary registers and intercon- nect units which have no direct correspondence with constructs in behavior level specification.
Profile data for such RTL components which do not explicitly appear in the specification has to be calculated by some indirect means.
In order to profile a behavioral specification the profiler inserts monitoring code in the specifica- tion.This code typically declares, initializes and increments various counters to measure various types of event activity.The modified program is then simulated to determine the profile data.
Behavior profiler takes the CDFG representa- tion of the specification and generates equivalent VHDL program with probes (counters and similar monitoring variables) to gather various event activities.We need to profile the CDFG rather than the original specification since the CDFG representation exposes all the operations and carriers (edges in CDGF) that will be bound to hardware resources.
The generated VHDL program is simulated using input vectors called the profiling stimuli supplied by the user.Profiling stimuli should represent typical usage of the design being synthesized.Since profiling stimuli will decide the event activity in the design, the user should take extreme care in preparing this data.Some sugges- tions as how to prepare profiling stimuli for different classes of designs are given in [11].
For the given profiling stimuli, the profile data of the specification constitutes the following informa- tion associated with the CDFG nodes and edges.
The event activity of a CDFG node op is the number of times that node is executed and is denoted by Eop.The transaction activity of an edge e is the number of times the edge is traversed during the execution and is denoted by Te.The event activity of an edge e is the numbers of times input changed on the edge and is denoted by Ee.
Note that Ee< Te.Probes are inserted by the profiler to measure the profile data.
5. POWER CHARACTERIZATION OF RTL MODULES AND PLAs

Module Library Characterization
The RTL module library contains parameterized modules such as n-bit registers, n-bit adders and n- bit m-to-1 multiplexors.Modules are parameterized with respect to bit-width of each input and, where applicable, the number of inputs.For each module in the library, its interface description, parameters such as area, delay and average intrinsic switched capacitance (ISC) characteristics are stored in the library.The area, delay and ISC characteristics are expressed as a function of parameter variables such as bit-width, word length etc. and are in the form of either equations or tables.If the data cannot be fit into an equation, then it is stored as a table.For tables, linear interpolation or extrapolation is assumed when- ever the parameter value is not available for a given value of parameter variable.
For a given library module, area, delay and ISC values are determined by generating layouts for different parameter values.Linear regression by the method of least squares is used to find an.equation which determines the area, delay or ISC characteristic given the bit-width parameter value.
If the standard error is too high, then the data is entered as a table assuming the use of linear interpolation in between the data points.Determi- nation of area and delay parameters for layout instances is straightforward.Area can be directly measured from the layout and delay can be determined through simulation or a timing analysis programs such as Crystal [34].Determination of ISC which depends on input patterns is more involved and is described below.
We define the average intrinsic switched capaci- tance (ISC) of a module instance as the average capacitance that is expected to switch when an input event (change of logic values on the input lines) takes place.ISC of a module instance is determined by extracting a switch level model from its layout, simulating the switch level module using a very long stream of randomly generated input patterns and monitoring the capacitance switched per pattern, until convergence occurs as discussed below.The capacitance measurements are carried out by IRSIM-CAP [37], which is a modified version of IRSIM [38] switch level simulator for better capacitance measurements.
Let Ck be the total capacitance charged after applying k random input patterns without reini- tialization between successive patterns.Zk Ck/k denotes the average capacitance per input pattern after applying k patterns.6k-'lZk--Zk_l[/Zk_ denotes variation in the average capacitance between the k-1 th and k th patterns.We continue to apply random input patterns until 6k remains less than 0.001 over 1000 consecutive input pattern applications.At this point we say that the average switching capacitance estimation converged and accept the value of Zk after the last input pattern is applied.This value is the ISC of that instance of the module.Similar procedure is used to determine the ISC of various instances of the module and results are expressed as an equation or table .Figure 3 shows the ISC characteristics of a library module.Figure 4 shows ISC plots with respect to the bit-width for three modules, namely, adder, register and two-input multiplexor.Table I shows the ISC characteristics of some PDSS library modules.For RAM component, there are two parameters namely, select size and the word size.The ISC value shown for RAM is the average capacitance.switchedfor either a Read or a Write operation.

PLA Characterization
The controller is a finite-state machine which we assume is implemented as a PLA structure.The PLA structure consists of an input plane, an output plane, and I/O buffers.We assume that the PLA is implemented using dynamic CMOS with pre-charged product and output lines [19].The product and output lines are selectively discharged based on the input conditions and are controlled by two non-overlapping clocks.
A PLA is characterized by three parameters: (1) input size, 2; (2) output size, (9; and (3) the number of states, S. The ISC for any controller function of these parameters.By varying 2", (9, S, random PLAs are generated and characterized as follows: The switch level model of the controller, extracted from the layout, is simulated using random input vectors.Simulation is carried out until the capacitance switched per clock step (as opposed to per input pattern in the case of the modules in the library) converges in a fashion similar to the one described in module library characterization.States(S} PDSS (Fig. 6) accepts specifications in a behavior- al subset of VHDL and user-specified constraints in terms of clock-period and area.It generates a RT level design satisfying the given constraints.PDSS consists of four main modules: scheduling and performance estimation, register optimization, interconnect optimization and controller generation.More detailed discussion on PDSS appeared in [11,33].
The RT level design produced by PDSS consists of four major subunits from the power estimation view point: Datapath, Controller, Interconnect  and System Clock.Power consumed in the design is given by, Pdesign Pdp -+-Pcon + Pinter + eclock where Pdp, Pcon, Pinter, and Pclock are the power consumed in the datapath, controller, and inter- connect and system clock respectively.
Our RT level power estimator needs the follow- ing inputs as shown in Figure 6: 1. Profile Data: This is obtained from the beha- vioral profiling of the high level specification given as input to the PDSS.For each operator and each edge in the CDFG a count of total event activity occured on the operator/edge is recorded.
2. Binding Information: One of the synthesis tasks is to bind each operator and edge in the CDFG to an instance of one of the modules in the module library.It also binds the temporary variables introduced to hardware registers in the module library.3. Module Library: The module library is precharacterized for ISC as explained in the section 5. 4. RT Level Output: This is the structural im- plementation containing instantiations of mod- ules from the module library.The controller is a finite state machine description.The datapath and controller interact with each other to form the entire design.
To estimate average power of a given RT-Level design, the power estimator goes through the following phases: (1) Pre-processing stage; (2) Profile data computation of hardwar resources introduced during synthesis; and (3) Power esti- mation of the design.

Pre-processing Stage
The power estimator initializes with the ISC values of all the modules obtained from the library characterization.The binding information provided by the synthesis tool is used to build a list of instances (inst_list) of modules.Each instance is initialized with sum of the profile data of all the operators (or edges) in the CDFG which are bound to that particular instance.Note that some of the instances' profile data is not known as they are introduced during synthesis.The profile data of such instances is computed in the next phase.

Profile Data Computation
Algorithm Compute_profile() in Figure 9 is used to compute the profile data of the temporary registers and the interconnect units introduced during the synthesis.
Procedure Build_dependency_st() builds a de- pendency list of the instances.It goes through each instance inst in the instance list inst._list and if the profile data of the instance is unknown, then it adds the instances at the inputs, to the insti's dependency list.
Consider Figure 7 in which there is a feedback from the output of the multiplexor inst(j) to the input of the register inst(i).Such a configuration gives rise to dependency cycles.Procedure Relno- re_cycles () removes the above described depen- dency cycles in the following way.Let two instances and j be in a dependency cycle.Besides the input which gives rise to a dependency cycle, if the profile data on remaining inputs of an instance is known, then let us say that the profile data of that instance is known.Otherwise, the profile data of the instance is said to be unknown.The following three possibilities can occur: Case 1" The profile data of both instances is known.The profile data of each of the instances is equal to the sum of the profile data of both instances.
Case 2: The profile data on only one of the instances (say i) is known.We remove the edge from j to i. Assuming that instance j is not in a dependency cycle with any other instance, the profile data ofj can be computed, which is the sum of the profile data of all the instances (including/) at its inputs.Since there was an edge from j to i, the instance has event activity from the output of instance j.Thus the new profile data of is the old profile data plus the computed profile data of instance j.
Case 3: The profile data of both instances is not known.Both the edges in the cycle are removed, ==> inst(j) i n n s s ( i ( j ) ) dependency cycle FIGURE 7 An example of dependency cycle.
profile data of and j are computed based on the profile data of the instances at other inputs.The new profile data of each instance is the sum of the profile data of both instances.
To illustrate Case 1, consider Figure 8. Two instances inst(i) and inst(j) are both in a dependency cycle.The profile data on inputs A and B of inst(i) are 10 and' 20 respectively.
Similarly, the profile data on inputs of inst(j) namely, C and D are 15 and 12 respectively.We make a conservative assumption that the inputs of a multiplexor are not switching simultaneously.Thus, the profile data on the output of a multi- plexor is the sum of the profile data on all the inputs.Thus, the equations to compute profile data on outputs of both instances are: e(x) e( r') + e(A) + P( Y) P(X) + P(C) + P(D) Where P(X) is the profile data on the output of inst(/) and P(Y) is the profile data on the output of inst(j).P(X) appears on the right hand side of P(Y) equation and vice versa.The above set of equations cannot be solved, unless we remove the dependency cycle.Since P(A), P(B), P(C) and P(D) are known, the example belongs to Case as An example to illustrate profile data computation in presence of a dependency cycle.
discussed above.With the dependency cycle removed, the profile data of X and Y are P'(X) P(A)+ P(B)= 30 and P'(Y) P(C)+ P(D)= 27.
With the dependency cycle included, the new profile data for both X and Y are, P(X)= P(Y) P'(X)+ P'(Y)= 57.If P(A) or P(B) is unknown to start with, then the example belongs to Case 2.
if P(A) or P(B) and P(C) or P(D) is not known then the example belongs to Case 3.After removing the cycles, for each instance whose profile data is unknown, it is calculated as the sum of profile data of all the instances at i's inputs.After computing profile data for all the instances, the data path power can be computed as follows.

Power Estimation in Data Path
The data path consists of the execution units such as adders and multipliers and storage units such as latches and shift registers.The power consumed by the datapath Pdp, and is computed by lines 2-4 of the procedure Estimate_Power ().Pdp is given by: Pdp Eop * ISCop op Where Eop is the event activity (or profile data) of the operator (or register) and ISCop is the average switched capacitance value of a hardware module instance to which the operator node op is bound.

Power Consumed by System Clock
As the system clock controls all the clocked components in the data path, it is loaded by a Algorithm Compute_profile() begin 1.T Build_dependency_list(); 2. Remove_cycles(T); 3. for each I in inst_list do 4.
for each J in/.dependency_list do 5.
/.profile_data += J.profile_data 6. end for 7. end for end FIGURE 9 Algorithm for the computation of profile data.large amount of capacitance.The power consumed by the system clock is estimated in Algorithm Estimate_Power ( ) shown in Figure 10.The lines 8-10 estimate the load capacitance Cclock on the system clock.In the clocked components such as registers and latches, the load capacitance on the clock line varies approximately 50fF per bit-width.Thus, total capacitive load on the system clock is the sum of the clock capacitances of each instance.The total capacitance switched in a design is given by the product of number of input vectors (N0, total number of clock cycles required to process an input vector (Ttota and clock capacitance (Cclock ).

Power Estimation in Controller
The controller is a finite state machine implemented as a PLA.Any PLA is characterized by three parameters: the number of inputs 2-, the number of outputs (9 and the number of states, S. In the module library, there already exists a PLA characterization table, which was explained in detail in section 5. From the table, we can obtain the average intrinsic switching capacitance ISC of a PLA of a given size.Interpolation/Extrapolation is assumed where ever the values are not available for a given set of parameter values.The ISC value so obtained is the average capacitance that switches per clock step in the PLA of size (2-, O, S).
14. Let 27, O and q be the controller size. 15.Ttotat=Est hnate_clock steps() 16.C'con 17. Pco Nv 18. Pdo Nv T,o Cdo 20.Ptotat Pa + Peon +/nter + Paoc end Let Nv be the total number of profiling stimuli applied.Let the CDFG be scheduled in Nc number of control steps.In the module library, for each module, the number of clock steps needed to process an input vector is stored as function of its parameters such as bit-width, wordsize etc.The total number of clock steps required to process an input vector is sum of the maximum number of clock steps needed in each of the control step.
Algorithm Estimate_clock_steps () estimates the number of clock steps needed by the design to process an input vector (say Ttotal).The power consumed in the controller is given by the product of Nv, Ttotal and the ISC(2,O,S) as given in Algorithm shown in Figure 10.

Power Estimation in Interconnect
Already the profile data for the interconnect units has been calculated as discussed in profile data computation phase.The interconnect units are not present in CDFG and arise due to operator sharing, register sharing and interconnect sharing.In this work, we consider only Multiplexor-based designs.
The profile data of a multiplexor is computed as the sum of the event activities on all the inputs.This is a conservative estimate of the total number of events that the multiplexor is subjected to.
The power consumed in the interconnect is calculated in the same way as is done for the datapath.It is given by: einter Ei * ISC(MUX, /.size) where is a instance of a multiplexor of size/.size.
and ISC (MUX, /.size) is the average intrinsic switching capacitance of a multiplexor of size /.size.

RESULTS
In this section we present experimental results for six designs"  III shows the behavioral specification data.PDSS system is implemented in C++ on Sun Sparcstation platforms.
Each register level design produced by PDSS is processed by the Lager IV silicon compiler [36] to generate mask layouts.The designs generated use a two phase non-overlapping clocking scheme.Although the designs are generated in a scalable CMOS technology, all results for this paper are obtained using 2 micron feature size.Switch level models are extracted from the layouts and simulated using the IRSIM-CAP [37] switch level simulator.Table IV shows the synthesized design data at the layout level.
Table V shows the estimated and actual powers in the data path and interconnect of the six designs.The estimated power is computed at the RT-level and actual power is determined by the switch level simulation of the synthesized designs.
As shown in the table, the percentage error in estimation for data path is in the range of 2.51% 12.58% with the average deviation from the actual value being 6.25%.
Table VI shows the comparison of powers for controller.The estimation error is in the range 3.53% 15.22% with the average deviation being 10.51%.Table VII shows the comparison of the power dissipated due to the system clock.The estimation error is in the range 18.59% 30.69% with the average deviation of 22.32%.Table VIII shows the power values for the entire design, which is the sum of the power, in datapath (Pdp), interconnect (einter), system clock (Pclock), and      controller (Pcon).The percentage error is in the range 4.26%-9.35%and the average deviation is 6.51%.This shows reasonable correlation between the estimated and actual values not only in the entire design but also in the datapath and controller seperately.
Electronics Directorate of the Wright Laboratory of the US Air Force under contract number F33615-9 l-C-1811 and by the Advanced Research Projects Agency under order no. 7056monitored by the Federal Bureau of Investigation under contract no.J-FBI-89-094.

DISCUSSION AND CONCLUSIONS
The following are some of the factors which have not been taken into account during the power estimation: 1. Effect of Placement and Routing 2. The random characterization of the RTL module library and PLAs gives rise to an inherent estimation error.This can be remedied by taking the activity on the inputs into the estimation procedure.3. Glitch power consumption. 4. PLA characterization based on only inputs, outputs and states is not sufficient.The state table information has to be taken into account. 5.In the estimation of power in multiplexors, we assumed that the activity on the inputs is added up to get the activity of the multiplexor.We are making a very conservative assumption that all the inputs are not switching simultaneously.This is another source of error.
In this work, we presented an accurate power estimation technique based on the profile data obtained at the behavior level.The estimation technique is implemented in the framework of a high level synthesis system.Compared to the estimation techniques at the lower levels of abstraction, the technique is faster in the execution time.For the six examples considered, the average estimation error at the design level is within 10%, which demonstrates that the estimation technique is reliable.
PORT(a, b IN INTEGER; FIGUREA Behavioral Specification in VHDL.

FIGURE 2 A
FIGURE 2 A RTL Data path Synthesized from Specification in Figure 1.

FIGURE 5 PLA
FIGURE 5 PLA Characterization with size of Inputs I 5.

TABLE II A
portion of the PLA Characterization table

TABLE V
Comparison of the power in the Data path and Interconnect

TABLE VIII
Comparison of the total power for the Entire Design