An Integrated Approach to Data Path Synthesis for Behavioral-level Power Optimization

This paper presents an integrated approach to data path synthesis which solves three important design problems: scheduling, allocation, and hardware partitioning with power minimization as a key design objective. Based on the rules of thumbs introduced in prior work on synthesis for low power we derive an integer programming formulation for solving the problems. We then, based on the formulation, develop an efficient algorithm which performs scheduling, allocation and hardware partitioning simultaneously so that the effects of them on power consumption are exploited more fully and effectively. Our experimentation results show that the algorithm is quite effective, producing designs with significant savings in power consumption.


INTRODUCTION
Power consumption in VLSI circuits has become an important consideration in circuit design in recent years [1].In many application domains, we need to use low-power circuits in order to lower the packaging and cooling costs and to extend the battery life.In designing low-power circuits, a number of techniques for reducing internal power dissipation has been proposed.These techniques often focus on reducing the dominant term in the equation for power dissipation in CMOS digital circuits 1]" P CLV2DDfp (1) where P denotes the power dissipated in charging and discharging the output capacitive load CL.
VDD is the supply voltage and fp is the output switching frequency.One way to reduce power consumption is to lower the total capacitive load CL.In general, there are different types of functional modules which perform the same computation but have different areas, speeds, capacitive loads and power consumptions.Consequently, for power efficient design it is important to make suitable choices among the different types of functional modules available.Another way to reduce power consumption is to reduce fp by inhibiting unnecessary circuit switching activity.For example, in the design of the PowerPC 603 the clock signal was disabled thereby inhibiting switching during periods of inactivity in the module [2].
Previous research efforts [3][4][5] in high level synthesis have mostly focused on speed and/or area optimization using a global periodic clock signal and a single type of functional modules for each operation.Much work on power optimization has focused at the logic level.The power trade-off between different types of adders and multipliers was studied in [6].In [7] power was minimized by modifying the function of each node in the circuit.[8] employed a re-encoding technique using gated clocks for reducing power in sequen- tial circuits.A high-level synthesis system, HYPER-LP, presented in [9] uses a variety of architectural and arithmetic transformations to optimize the power dissipation.
In this paper, based on the observations from the prior work on synthesis for low power [6, 8, 9,  11] we design a new high-level synthesis algorithm which performs the tasks of scheduling, allocation, and hardware partitioning in an integrated fashion for low power design.Specially, for a given unscheduled data flow graphs, we are to (1) select functional modules from a given general library (2) schedule the operations on the selected functional modules so that groups of functional modules with similar activity patterns can be deactivated and (3) allocate registers to the variables and partition the registers so that groups of registers can be deactivated.Our objectives are to minimize the total hardware cost as well as power consumption.Our algorithm employs the following techniques to reduce power consumption: functional module selection, selective shutdown offunctional modules, and selective shutdown of registers.

Functional Module Selection
In general, an operation can be executed on one of several different types of functional modules which have different execution times, areas and power consumptions.For example, Table I shows that a 32-bit addition operation can be executed on a ripple-carry adder (RCA) in 20 ns which consumes 22.7 mW or on a carry lookahead adder (CLA) in 10 ns which consumes 37.3mW, and a 32-bit multiplication operation can be executed on a Booth multiplier (BOOTH) in 160 ns which con- sumes 84.0 mW or on an array multiplier (ARR) in lOOns which consumes 295.6 mW [10].To reduce power consumption, utilization of functional mod- ules which consume less power is clearly desirable.

As an example, Figure
shows a given unscheduled data flow graph.Suppose the dura- tion of a control step is 120 ns, and to carry out a multiplication operation BOOTH takes 2 control steps and ARR takes control step.Schedule A in Figure 2 shows a schedule when only ARR is available.Schedule B in Figure 3 shows a schedule when both ARR and BOOTH are available.Schedule A uses 2 ARRs and consumes 4138.4 mW.On the other hand, Schedule B uses ARR and 2 BOOTHs, and consumes only 2657.2mW.Consequently, Schedule B yields a 35% saving in power consumption.

Selective Shutdown of Functional Modules
Recent studies [11] indicate that the clock signal in a digital circuit consumes somewhere between 15% to 45% of the total power.The clock signal is distributed globally to all functional modules and memory modules.However, in most cases, some portions of the circuit are inactive during some control steps (the functional modules are not performing any active computation) and the clock signal is not needed for these functional modules during these control steps.However, since every module is driven by the clock signal, there will be switching activity even in modules that are not performing any active computation, leading to unnecessary power consumptions in both the clock signal and the modules.In CMOS circuits, a clocked functional module (From now, we assume all functional modules are clocked functional modules.)has registers at its inputs.When the clock signal is on, the contents of the registers are loaded into the functional unit.In other words, if the clock signal is off during time period, there will be no dynamic power consuming activity during that time period [11].It is therefore desirable that a functional module be driven by a clock signal which is on only during control steps in which the functional module is active.
In order to save power consumption in a clock signal as well as in functional modules, we can regenerate the clock signal so that clock pulses are present only when they are needed.Figure 4 shows a clock regenerator that regenerates the clock signal using a clock gating signal which is designed according to the active/inactive control steps of the functional modules.(An active control step of a functional module is a control step in which the module is performing active computation.Simi- larly, an inactive control step is one in which the functional module is not performing active com- putation.)Note that power consumption is reduced both in the functional modules during the inactive control steps and in the clock signal when the corresponding clock pulses are suppressed.
From a power consumption point of view, each functional module should have its own regenerated clock signal.That is, a functional module will be deactivated in all the control steps in which it is not performing any computation.However, in this case, there will be too many distinct clock regenerators and the control logic for clock regenerators will become very complicated.Correspondingly, there also will be too many distinct clock gating signals.Consequently, the clock regenerators and their control logic might become dominant factors of the total power consumption.
A more practical approach is to use the same regenerated clock signal for a group of functional modules so that not only will the number of clock regenerators be reduced, but also the control logic for clock gating will not become excessively complicated.
In our approach, we use a regenerated clock signal for each type of functional modules.Figure 5 shows the clock regenerator logic used in our algorithm.Figure 6 shows active and inactive control steps of ARR and BOOTH for Schedule B. In a dynamic register, information is stored in the form of electric charge which leaks gradually over time.Thus dynamic registers need to be refreshed periodically, usually in each control step.Consequently, power is consumed in both the refresh circuit and t.he clock signal.However, data stored in a register might become obsolete, and it is unnecessary to refresh a register after the data stored in the register will no longer be used.Similar to the case of functional modules, by clock gating, we can turn off both the refresh circuit and the clock signal that drives the refresh circuit during control steps in which a register no longer needs to be refreshed.Again, it is not practical for each register to have its own regenerated clock signal.Consequently, we shall partition the registers into groups and let each group be driven by its own regenerated clock signal.
To partition the registers, we first partition the variables into groups.Each group of variables is then assigned to the registers to form a group.The partition is to be carried out in such a way that the total number of active control steps in the registers is minimized.(An active control step of a register is a control step in which the register contains the value of a live variable.Similarly, an inactive control step is one in which the register contains data which is obsolete.)Figure 7(a) shows the life times of 9 variables from Schedule B in Figure 3.In Partition A, there is only one group of 3 registers.They have only one (common) inactive control step.Thus, the total number of active control steps in the registers is 15.In Partition B, there are 2 groups of registers, P1 and P2.In group P1, there are 2 registers and one (common) inactive control step.In group P2, there are register and 2 (common) inactive control steps.
The total number of active control steps in the registers is 14.In Partition C, there are also 2 groups of registers, P1 and P2.In group P1, there are 2 registers and 2 (common) inactive control steps.In group P2, there are register and 4 (common) inactive control steps.The total number of active control steps in the registers is 10.In fact, Partition C achieves a saving of 33% of power consumption in register than using Partition A.
The three techniques mentioned above are closely inter-related.Our algorithm performs the tasks of functional module selection, scheduling, alloca- tion and partitioning simultaneously embodying these techniques.We are given a general library which contains several types of functional modules for each type of operation and the total number of control steps within which all operations in the data flow graph are to be executed.We are to (1) determine an execution schedule for the operations, (2) select the type of functional module for each operation, (3) partition the variables into groups, (4) allocate functional modules and regis- ters and (5) determine the regenerated clock signal for each type of functional modules and each group of registers.The hardware cost, the total power consumption, and the total number of active con- trol steps in functional modules and registers will be minimized.We propose a polynomial time algorithm which is an approximation algorithm for solving an integer programming (IP) problem.
Previous research efforts [3][4][5] in high level synthesis have mostly focused on speed and/or area optimization using a global periodic clock signal and a single type of functional modules for each operation.In [12, 13] the problem of selecting one type of functional modules for each type of operations from a given library was studied.In [14], the scheduling problem for general libraries was studied, combining both integer linear pro- gramming (ILP) and list scheduling techniques.In [11], reduction in power consumption in registers using a partitioning approach was studied.
of control steps, T, within which execution is to be scheduled.From the VHDL description, a set of operations and a precedence relation over the operations, -<, are derived.Let (.9   {op,opz,...,opu} denote the set of operations.
The relation opi -Opj means that the execution of opi must be completed before the commencement of the execution of opt.A library of functional module types is provided.Each type is specified by its cost, power consumption, execution time, and the operation types which can be executed on such functional modules.Let /2 {1,2,..., L} denote the library of functional module types.For a type- k functional module, let e denote its execution time in terms of the number of control steps which is an integer > 1, c its cost, and d1 the power consumption per control step.
The set of operations that can be executed on a type-k functional module is denoted O(k), i.e., O(k)={opilopi can be executed on a type-k functional module}.For a given operation opi, let H(opi) denote the set of types of functional modules on which opi can be executed, i.e., H(opi)={k]opi can be executed on a type-k functional module}.The time frame of an opera- tion opi, [si, ti], indicates the control steps within which the operation is to be executed in order to satisfy the timing constraints (so that execution of all operations will be completed within T control steps).
Let 12 {vary, var2,..., vary} denote the set of variables where vari is the output produced by operation opt.Let C(vari) be the set of operations which have var/as an input, i.e., C(vari)= {op#lvari is an input to operation opt}.The value of each variable is stored in a register.We are given the cost of a register, cr and the power consumptions per control steps, dr.

NOTATIONS
Our algorithm accepts as input a VHDL descrip- tion of a data path.We are given the total number

PROBLEM FORMULATION
The problem we want to solve is a very complex one.Our major contribution is an efficient approximation algorithm (Section 4) which is developed from an integer programming (IP) formulation of the scheduling, allocation and partitioning problem.In this section we provide the details of the formulation.For simplicity, we assume that the registers are to be partitioned into two groups.Our formulation can be easily extended to the multiple partitioning.
We introduce first the definitions of several sets of variables.First of all, for each operation opi, we have a 0-1 integer variable oijkp, <_ <_ N, si <_ j <_ ti ek + 1, k U(opi), p G { 1, 2}. o ijkp 0 if opi begins its execution at control step j on a type-k functional module and the output variable vari is stored in a register in Group-p otherwise Second, for each variable vari, we have a 0-1 integer variable Vikp, <_ <_ M, si <_ j <_ ti, p{1, 2}.
if vari is alive at control step j and is stored in a register in Group-p otherwise Third, to determine the total number of func- tional modules of each type, we introduce the in- teger variables fz: and F, _< k <_ L, _< j < T.
is the number of type-k functional modules used in control step j and F is the total number of type-k functional modules used throughout all control steps.
Fourth, to determine the total number of registers in each group, we have two sets of integer variables, rp.. and Rp, p { 1,2} and <_ j <_ r. rpj is the number of variables which are alive at control step j in Group-p.In other words, rp2 is the number of registers needed at control step j in Group-p.Rp is the number of registers in Group-p which is the maximum of rp2 for j-1,..., T. It should be noted that it is guaranteed that the variables in Group-p can be assigned to Rp registers when the left edge algorithm [15] is used since there are at most Rp variables that are alive in each control step.
Finally, to represent the active/inactive control step of each type of functional modules, we have the 0-1 integer variables, aj. and bpj, <_ k <_ L, p{1, 2}.aj-{ 0 if any type-k functional module is active at control step j otherwise Similarly, if any register in Group-p is active at control step j otherwise We have the following constraints.First, each operation must be scheduled for execution exactly once.Thus, for each (corresponding to operation opi), < <_ N, we have Second, the precedence relations over the operations are represented as linear constraints.Let U(i, j) {oij,kplSi <_j'<_ j-ek + 1, k U(opi), p { 1,2}} Thus, oEf(i,j) if the execution of opi is completed on or before control step j otherwise Similarly, let Z(i,j)-{oijp[kU(opi), p{1,2}} Thus, if the execution of opi is initiated at control step j otherwise For two operations Opa and opb such that Opa --Opb we have the following constraints: For each j, s, <_j<_ t,-1, which ensures that the execution of opb is initiated at control step j+ only if the execution of op, is finished on or before control step j.Finally, we have the following set of equations.First, for each variable vari we need to determine whether vari is alive at control step j and the group to which var belongs, var is alive at control step j and is included in Group-p if operation opi completes its execution before control step j and its output is included in Group-p, and any one of the operations which has vari as an input has not been initiated at control step j.For each p { 1, 2}, let ,Tp(i,j) {oij,kplSi <_j'<_ j-ek + 1, k U(opi)} Thus, O if the execution of opi is completed on or before control stepj and its output variable var is included in Group-p 0 otherwise 2+(i,j) {Oi/kplj<_j' <_ti, kU(opi), p{1,2}} Thus, oZ+(i,j) if the execution of opi is initiated o-at or after control step j 0 otherwise So we have the following equations: Vijp Fp(i,j) N U Z+(i"J + 1) (3) opi, CC(i) After the determination of values of the vari- ables Oijkp and Vi/p, we can compute the values of jk/, F, rp/ and Rp as: Rp max rpj (7) J Finally, after the values of fk/and rp have been determined, we know the active/inactive control step of each type of functional modules and each group of registers.Therefore the values of ak and bp# can be determined as: b; { 0 if rp/-0 otherwise (9) In other word, if any type-k functional module is used in control step j (fj.> 1), all type-k functional modules will be active at control step j and a equals to 1. bp# is computed similarly.
The objective function is F aF/+ Fr where Ff TZ CkFk + SE Eak/dkF +Z Eakj k k j k j p p j k j and c,/3, % 6 and co are weighting factors.F is a weighted sum of hardware cost, power consump- tion and clock regenerators for the functional modules.Similarly, Fr is a similar weighted sum for the registers.

APPROXIMATION ALGORITHM
This section presents an approximation algorithm for solving the IP problem formulated in Section 3, which yields an approximation solution to the problem of module selection, scheduling, alloca- tion and register partitioning.
In a feasible solution to the ILP, each 0-1 variable oii p will assume the value 0 or 1.In our approximation algorithm, we will determine the values of the 0-1 variables in a step-by-step fashion by examining intermediate solutions in which some of the variables Oijkp assume non- integral values.
At the beginning, we set the value of one of these variables Oijkp to 1.Such a choice might lead to the determination of the 0-1 values of other 0-1 variables because of the constraints in ( 1) and (2).Let us use the example in Figure 8 to illustrate the idea.
We have the following constraints when the total number of control steps is given to be 5.
After the value of a 0-1 variable is set to 1, and the values of some of the other 0-variables are determined according to the constraints (1) and (2), the remaining 0-variables oop are related by The variables in the equalities will be assigned equal fractional values.For the example in Figure 8, after setting o11 to After determining the values of the variables oi#p, the values of the other variables f#, Fj., rp#, Rp, aj-and bp# are computed according to the Eqs. in (3) to (9).We then compute the value of the objective function F. For the example in Figure 8, we obtain the values of F corresponding to all possible choices of setting one of the variables Oikp to 1.
(The values of the weighting factors in F are chosen to be: when c 1, /3--2, , On the basis of the value of F, the value of one of the variables oi;p will be set to 1.In other words, among all variables oi;p, the one that produces the minimum value of F will be set to which together with the values of other variables assigned accordingly constitutes an intermediate solution. For the example in Figure 8, 06511 is set to 1.In this case, we obtain an intermediate solution: The step can now be repeated.Among all 0-1 variables that assume non-integral values in the intermediate solution, one of them is set to 1.The corresponding values of other variables and the value of F are then computed.Finally, a solution is obtained when all 0-1 variables assume 0-1 values (and all other variables assume integer values).For the example in Figure 8, we have which corresponds to the schedule shown in Figure 9.
A summary of the approximation algorithm is shown below.The algorithm has complexity O((LTNP)2) in the worst case, where L is the number of types of functional modules in the library, T is the number of control steps, N is the number of operations and P is the number of groups the registers are partitioned into.
Algorithm IP_solver( ): repeat /* Determine an intermediate solution corre- sponding to each */ /* variable oijkp whose value is not integral */ for each variable oiikp whose value is not integral Set Oijkp 1; Compute the values of the other variables according to Eq. ( 1) to (9); Compute the values of the objective function F; end for Select the intermediate solution that yields a minimum value of F; until (all variables assume integral values) FIGURE 9 The resultant schedule for the example.

EXPERIMENTAL RESULTS
We tested our program on a number benchmark examples.Our algorithm described in Section 4 was implemented in C and executed on a Sun Sparc20 workstation.Example df.5 is the differential equation from [3] and the given number of control steps is 5. Example df.7 is the same differential equation example except that the given number of control steps is 7. Examples df2.10 and df2.13 are obtained by unrolling the differential equation example twice and setting the number of control steps to 10 and 13, respectively.Examples ar.8 ar.11 are the AR-Lattice Filter from [16] with the given number of control steps set to 8 and 10, respectively.Examples ewf.18 and ewf.20 are the elliptic wave filter from [16] with the given number of control steps set to 18 and 20, respectively.gives the sum of the numbers of functional modules in each active control step.The columns cost and Pfm give the total cost and total power consumption of the functional modules.The columns R1 and R2 give the number of registers in each group after the registers are partitioned into two groups.The column # of active steps in Registers gives the sum of the numbers of registers in each active control steps.The columns cost and Pfm give the total cost and total power consump- tion of the registers.The column F gives the value of the objective function F. The column time gives the CPU time in seconds.
We also compare the results for different choices of values for the weighting factors in the objective function.The results are summarized in Table III.For the case Hardware (functional_modules+ registers), we set c= 1, /3= 1, "= and 0. Consequently, the result minimizes the cost of the functional modules and registers only and does not take power consumption into consideration.
For the case Functional_module_only, we set c 1, /3 0, 2 and c 1. Consequently, the result minimizes the cost and power consumption in the functional modules only and does not consider the cost and power consumption in the registers.Similarly, for the case Register_only, we set c=0, /3= 1, /=2 and =co= 1.Consequently, the result minimizes the cost and power consumption in the registers only.For the case Power- +Hardware, we set c= 1, /3= 1, /=2, =1 and c-1.Consequently, the result minimizes the cost and power consumption in both functional modules and registers.The column Pfm gives the power consumption in the functional m6dules and the column Preg gives the power consumption in the registers.The column Ptota! gives the total power consumption in the functional modules and the registers.The column save shows the percentage reduction in power consumption when compared with the case Hardware.As an illustra- tion, let us examine the results for the example ar.8 closely.If we want to minimize the total hardware cost, we obtain a design which consumes 544 mW in the functional module and 336mW in the registers.The total hardware cost is 88.If we want to minimize the power consumption in the functional modules only, we obtain a design which consumes only 252 mW in the functional modules and 336mW in the registers.The total hard- ware cost has risen to 94.If we want to minimize the power consumption in the registers only, we obtain a design which consumes 288mW in the functional modules but only 252mW in the registers.The total hardware cost is 88.In Power+Hardware, when we take everything into consideration, we obtain a 42.8% saving in power consumption.The hardware cost is 57.Indeed, as is shown in Table III we can achieve up to 52.6% reduction in power consumption and 43.4% on the average.
In order to compare the result produced by our approximation algorithm with the optimal result, we should solve the IP in Section 3.However, since the formulation is cast as a non-linear integer programming problem, the computational effort will be substantial even for small problem in- stances.Hence, we first generate an integer linear programming (ILP) formulation from the integer programming (IP) formulations in Section 3. Since the objective function F is a quadratic function, we approximate it by a linear form using La Grange first order conditions.We then used the LINDO package on an IBM3081 to solve the ILP problem.Table IV shows a comparison of the results of example df.5 produced by our algorithm and by ILP.For df.5, ours produced the same results as LINDO did and took much less time, 2.15 vs. 4208 seconds.It should be noted that we can only test example df.5 since the ILP formulation for the other examples are too big and LINDO was not able to generate any solution.

CONCLUSIONS
We presented an integrated approach to the problem of solving scheduling, allocation and hardware partitioning with the power consump- tion as one of key design objectives.We first proposed an integer programming formulation for solving the problem, from which we derived an efficient approximation algorithm.Unlike pre- vious approaches for low power in which schedul- ing and allocation are performed independently, our approach combined scheduling, allocation and partitioning together to exploit the effects of them on power consumption more effectively.The experimental results confirmed that our algorithm is quite effective and robust.

Figure 7 (
b) shows three different ways to partition the variables (which are then assigned to registers).

FIGURE 7
FIGURE 7 Three possible partitioning.

TABLE II
Results over the library lib

TABLE IV
Comparison of our algorithm and ILP ondf.5