Memory Partitioning and Other Applications

Sleep mode operation and exploiting it to minimize the average power consumption are of great importance in modern VLSI circuits. In general, sleep mode refers to the mode in which part(s) of the system are idle. In this paper, we study the problem of partitioning a circuit according to the activity patterns of its elements such that circuit elements with similar activity patterns are packed into the same partition. Then a partition can be placed in sleep mode during the time intervals all elements contained in that partition are idle. We formulate the partitioning problem to exploit sleep mode operation and show that the problem is NP-complete. We present polynomial time algorithms for practical classes of the problem. Applications of the problem to memory and module partitioning and clock gating are discussed. The experimental data confirm that a careful partitioning allows upto 40% more sleep time which could be exploited to minimize the average power consumption.


INTRODUCTION
Advances in VLSI and packaging technologies have increased the average transistor count in a chip by about one-hundredfold every decade [2], allowing much more complex functionality.Moreover, the advent of portable and mobile commu- nication and computing services has stirred a great deal of interest in both the commercial and research areas.The dissipation of the heat genera- ted by highly integrated circuits is a crucial factor because virtually all failure mechanisms are boosted at higher temperatures [2].The minimiza- tion of power consumption in modern circuits is therefore of great importance.Due to this importance, there has been considerable shift of attention in the logic and layout synthesis areas [14,17,20,22, 23] and more recently in high-level synthesis [4, 5, 15] from the delay and area minimization issues towards low power design.
Previous research for low power synthesis of digital circuits has focused on issues such as activity-driven technology decomposition and mapping [17 20, 22], low-power state assignment [12,21], architectural transformation and reduc- tion of power supply voltage [4], wire and driver sizing [6, 18], and reversible and adiabatic comput- ing [7,26].For a survey of these techniques see [8].
Transition density or average switching rate at different sites in a circuit is introduced in [16] as a quantity to measure the circuit activity, which can be used to estimate the average power consumption in a digital circuit.Recent studies [1] indicate that the clock signal and memory unit in digital computers, each consumes somewhere between 15 to 45 percent of the total power.This suggests good opportunities for savings in power consumption due to these sources.Exploiting sleep mode opera- tion is an attempt to do so.In general, the term sleep mode refers to the mode in which there is no activity in part(s) of the system during certain periods of time.The sleep mode issue can be studied at different levels, e.g., behavioral level, register- transfer level (RTL), logic level and transistor level.
In this paper we study the partitioning problem to exploit sleep mode for power minimization in digital circuits.The general problem can be viewed as partitioning a set of circuit elements such that the savings in power consumption achieved by switching each partition as a whole into sleep mode is maximized.A partition can be switched into sleep mode during time interval I (l, r) if all the elements in that partition are idle during L The set of intervals during which an element rn is idle, is referred to as the idle set of m.We present a general formulation for the problem and study its complexity.The problem finds many applications in low power design, e.g., the following (see Fig. 1): memory segmentation.partitioning to power-down portions of the design.clock tree construction.
We assume we have synthesis (simulationbased) or statistical data on the idle.times of the Memory ..' ,.., data items (in case of memory segmentation) or the idle times of the modules (for the two other cases).We present a general formulation for this problem, propose polynomial time algorithms to solve special classes of the problem optimally, and show that the general problem is NP-complete.This rest of this paper is organized as follows: Section 2 presents the necessary background.Section 3 briefly describes how to obtain the idle times for a set of memory or clocked elements in a design.Section 4 presents the problem formula- tion.The complexity of the problem is discussed in Section 5. Exact algorithms to solve the general problem are presented in Section 6.Some special classes of the problem are discussed in Section 7 and polynomial time algorithms are presented to solve these classes.Section 8 focuses on some generalization of the problem.Experimental re- sults for memory segmentation are provided in Section 9, and Section 10 concludes the paper summarizing the key features of this study and provides directions for further research in this area.

BACKGROUND
There are three sources of power consumption in CMOS circuits: the charging and discharging of capacitive loads during transitions at gate outputs, the short circuit current which flows during output transitions, and the leakage current.The last two sources should be dealt with and optimized using proper device and circuit design techniques [24], hence the design automation community has focused on the minimization of the first source, which is frequently referred to as the switching power or dynamic power.The average dynamic power consumption for a CMOS gate g with load capacitance Cg is given by: Pav(g) 0.5 Cg V2dd O (g), where D(g) and Vdd represent the transition density 3 of the signal at the output of g, and the voltage of the power supply, respectively.This suggests that a signal has a high contribu- tion to the dynamic power consumption if it has either relatively large load capacitance or relatively high transition density.And these are both true about the clock signal in moderately sized synchronous digital systems.Recent studies [1] indicate that the clock signal and memory each consumes somewhere between 15 to 45 percent of the total power in digital computers.Hence, it would be worthwhile to study the mechanisms and approaches through which the power consumption due to these sources can be optimized.Exploiting sleep mode is an attempt to do so.Consider a scenario in which the access times to a set of dynamic memory elements are known.If we can partition these memory elements such that for long periods of time either of the partitions contains no data, then we can turn off the memory refresh circuitry for that partition during these periods and thus reduce the power consumption.A similar partitioning approach can be applied for clock- tree construction when the activity patterns of the clocked elements are known.The clock signal destinations with close activity patterns should be partitioned into the same subtree to allow maxi- mum savings in power consumption via clock gating (see Fig. 1).Clearly, there is some overhead involved, caused by the extra control logic needed to switch the partitions in and out of sleep mode and the amount of power that switching in and out of sleep mode will consume.This overhead is mainly dependent on the switching pattern and switching frequency of the partitions in and out of sleep mode.
To have a general formulation, we talk about elements.Depending on the application, an ele- ment may refer to a memory element, clocked element, or a module in the circuit.Given the activity patterns of a set of elements, the question is how to partition this set to maximize the savings in power consumption achievable through sleep mode, and that how much power would this technique save us.We believe that there is a high potential of savings in the power consumption using this technique and our paper is an attempt to study this problem.

OBTAINING IDLE SETS
In this section, we briefly describe methodologies to obtain the activity patterns and the idle sets of the memory and clocked elements in our design.Availability of these activity patterns are vital for the partitioning algorithm to be applicable.

Idle Sets for Memory Elements
Let M={ml, m2,...,mr} represent the set of dynamic memory elements (MEs) in an applica- tion.Assume that the access sequence for each 3Average number of transitions per unit time.
ME mi E M during a whole run cycle is given as a sequence of ordered pairs each of the form (ti, Ai), where ti corresponds to the access time, and AiE{.R,14/} represents the type of access, read (R), or write (W) (see Fig. 2).Given the access sequence for all the MEs, we can use the following rules to generate the set of intervals for each ME mi, during which mi need not be refreshed (see Fig. 2) and thus obtain the idle set for each ME.We say ME rni is idle during interval I if it need not be refreshed during L Therefore ME mi id idle: After its final access time, Before each write access until the closest read access (or the start of computation) To obtain the access sequence for the MEs, we can use simulation-based tools that take as input an application program and produce statistics on the resource utilization over time and space.this FU during this idle time, which will reduce the power consumption due to the clock tree.Furthermore, it guarantees that there would be no dynamic power consuming activity during this time in A/[.From the scheduled and allocated design we can say that if FU A4 is assigned to a control step c, then it is active during c.Otherwise, it is idle during this time.This allows us to generate the idle sets for each of the FUs in our design.Figure 3 shows how to obtain the idle sets from a design that solves a differential equation of the form y" + 3zy + 3y 0 after scheduling and allocation have taken place.The design contains the following functional units: three multipliers M1, M2, .M3, two adders A1, A2, one subtractor S1, and one comparator C1.The idle sets for the registers at the inputs of each FU is computed from the Control-Data Flow Graph (CDFG) after scheduling and allocation are done.

Idle Sets for Clocked Elements
Consider the description of a design after the scheduling and allocation steps have been per- formed.We assume that the functional units have registers at their input.This means that if an FU A4 is not used for a consecutive set of cycles, then we can gate the clock signal to the registers feeding  These idle sets are shown at the bottom next to the names of their corresponding FUs.Note that the multiplier units take 2 control steps to execute.
However, they are clocked only during the first of the two control steps.This is assuming that the multiplier units are purely combinational, and require no clock signal during their operation.In other contexts, the multipliers (or other multi-step FUs) may need to be clocked during their whole execution cycle.The idle times should be com- puted according to these requirements.We say that element m is idle during time interval I (l, r), < r, if m can be switched into sleep mode during I.We say that interval I (1, r) contains point p, or p is contained in I if < p < r.Intervals I1 (/1, rl) and 12 (12, r2) are non- overlapping if ll >_ r2 or 12 _> rl.A set of intervals are non-overlapping if they are pairwise non-over- lapping.The idle set Nm of m consists of a set of non-overlapping intervals or NISs (Non-overlapping Interval Sets) during all of which m is idle.We assume that the idle sets of elements in M are given as a set S={N1,N2,...,Nr}, where Ni- {Iil, Ii2,..., Iini} is the idle set of mi (see Fig. 4).
The notation () denotes an empty interval.Given intervals I1--(/1,rl) and I2- (12, r2), we say I1 covers I2 if either ll _< 12, r2 _< rl, or I2 0 (that is, all intervals cover the empty interval).The length L(I) of an interval I (l, r) is defined as the quantity r (or 0 if I 0).The intersection of two intervals I1 and I, denoted as I1 A I, is defined as the longest interval covered by both I1 and I2 (or empty if the two intervals do not overlap).The intersection of more than two intervals is defined similarly.It is easy to see that the intersection of more than two intervals is commutative, hence no parentheses are needed.
The intersection of two NISs N1-{I11, I12,..., Inl} and N {I,I2, Izn2}, denoted as N1 A Ne, is defined as the NIS formed of the non- empty pairwise intersections of the intervals one picked from N1 and the other picked from N2, that is" The intersection of more than two NISs is defined similarly.As for the intervals, the inter- section of multiple NISs is a commutative opera- tion and parentheses can be omitted without causing ambiguity.Given NISs N1, N, we say N1 covers N2 if N1 A N2 N2.The endpoint set EN of a NIS N is defined as the set of endpoints of the intervals in N, that is: EN {p[ p: (P, q) N or (q, p) N}.The duration D(N) of a NIS N {I1, I2,..., I} is defined as the sum of the lengths of the intervals contained in it, that is: D(N) i= L(Ii).Given a set S {NI, N.,...,N} of NISs, the internal-intersection A(S) of S is defined as the intersection of all the NISs in S, that is: For example, Figure 4 shows 6 memory elements ml,..., m6.A partitioning of the mem- ory is shown which dictates a corresponding partitioning (S1, $2) of their idle sets, where S1 {N, N2, N4} and Sa {N3, N5, N6}.The internal intersection of each partition is shown as a set of shaded regions for each partition.The endpoint set E of S is defined as the union of the endpoint sets of the NISs in S, that is Es N ES: pEEN}.Given a set S, (S1, S:) is a hi-partitioning for S, if: S, Sz c S, $1 fq $2 , and S1 U $2 S. Each of S1 and $2 is called a partition of S. The density of a partition at a given point p, is the nuanber of NISs in that partition that contain some interval containing p.The bi-partitioning (S1, $2) is b- balanced if ISl[ b and ISzl _> b, where the notaiton Isil denotes the cardinality of set Si.The gain Ga (S1, $2) of a b-balanced bi-partitioning ($1, $2) is defined as: Ga (S, S2) tl + tg. a (sw1 -+-sw2)  where In ( 4), the term tl + t2 accounts for the savings in power consumption due to sleep mode operation of partitions S1, $2, and the term a x (SWl + sw2) accounts for the overhead resulting from the extra control circuitry needed to supervise sleep mode operation.Parameter a is introduced to control relative significance of savings vs. overhead terms.Figure 4 shows an example of memory partition- ing to exploit sleep mode.
Note that many problems can be formulated as a decision or an optimization problem and that if the decision version of a problem P is NP- complete then its optimization version is also NP-complete, and if its optimization version is polynomially solvable then its decision version can also be solved in polynomial time.We now formulate our problem as a decision problem: PI: Instance: Ordered quadruple (a, b, c, S), where a is a positive (non-negative) number, b, c are positive integers, and S {N1, N2,..., Nr} is a set of NISs.
w, a (sD I; The attributes tl and t2 are called the sleep times of partitions S1 and $2, and the attributes SWl and sw2 are called the switchings of partitions S1 and $2, respectively.
Observation 1.The sleep time of a partition consists of the summation of the lengths of the maximal 4 intervals during which the density of the partition equals the size of the partition.Further- more, the switching of a partition is equal to the number of such intervals.

NP-CONPLETENESS
In this section we discuss the complexity of P1 and show that it is NP-complete.We present a transformation from the MIN-CUT INTO BOUNDED SETS problem [11], that we will denote as MCP (Min-Cut Problem).This problem can be stated as follows6: MCP: 4An interval is maximal with respect to a property P, if it has the property P, but no interval containing it as a proper sub-interval has property 79.Here, the property 79 is that the density of the partition during this interval is equal to the size of the partition.
6Note that we are using a special formulation of MIN-CUT INTO BOUNDED SETS problem which is still NP-complete.This special formulation is used to simplify our NP-completeness proofs for problem P1.
Objective: Determine whether there exists a Bbalanced bi-partitioning (V1, V2) of V such that: where, That is, the size of each partition is lower bounded by B, and the number of edges in E with one endpoint in V1 and the other endpoint in V2 is no more than K.We will refer to the number of such edges as the cost of the bi- partitioning and denote it as C (V1, V2).
Given a partitioning (V1, V2) of V, we define an attribute ci for each edge ei (vii ,vi2) in E, referred to as the cost of that edge under partitioning (V, V2), as follows: X {i,i,...,in,} represent the set of indices of the edges incident to vi, and let X= {1, 2,..., IEI}.Then the NIS N of the P1 instance corresponding to vertex v of the MCP instance will be defined as follows: Ni---{(3x, 3x + 1)Ix {(3x, 3x + )lx e (Jr (9) That is, each NIS N corresponding to a vertex v, consists of IEI intervals, one corresponding to each edge e E E. a=0 b=B .c=31El+g Figure 5 shows a given instance of MCP and the constructed instance of P1.Notice that each NIS N in the constructed instance of P1 consists of exactly IEI intervals, IXI of which of length and the rest of length 2. 0 Ci if vil, vi2 belong to the same partition otherwise (7) If c= 1 we say that edge ei is cut by the partitioning, otherwise eg is not cut or uncut.It is straightforward to show that: MCP is polynomial-time transformable Proof Given an instance (G (V, E), B, K) of MCP, with V= {V1, V2, ,VIVI} and E {el, e2,...,ele I}, we construct the corresponding instance (a, b, c, S) of Pl as follows: The set S {N1, N2,..., Nit } consists of a set of NlSs, where each NIS N S corresponds to a vertex v V. Let {ei, el2, eia,..., %} represent Main Idea: In the construction of the PI instance, the following issues have been taken into account: A partitioning (V1, V2) of vertices in the MCP instance should correspond to the partitioning (S1, $2) of the corresponding NISs in the PI instance and vice versa, e.g., the partitioning ({Vl, 113} {12, V4}) of V corresponds to the partitioning ({N, N3}, {Nz, N4}) of S.
The gain Ga of the partitioning (S1, Sz) of S in the P1 instance should be a decreasing function of the cost C of the corresponding partitioning (V, V) of V in the MCP instance.
Having established such relationship between the MCP and PI instances, it is easy to see that the cost of a partition in the MCP instance is minimized if and only if the gain of the corres- ponding partitioning of the P1 instance is max- imized.Then by selecting the.parameters a, b, c properly, we can show that the answer to the MCP instance is YES if and only if the answer to the constructed P1 instance in YES.The following elaborates further on these arguments: General Properties: As it is shown in Figure 5, the PI instance is constructed such that corresponding to each edge ei (vj, Vk) in the MCP instance there are IV intervals, one in each of the NISs, and they are all overlapping.Let 2i (Iil,Ii2,...,Iilvl} represent the set of these intervals, where lip is the interval corresponding to ei=(vj, vk) in Np.Among these intervals, I/ and Iik, the two in Nj, Nk (NISs corresponding to vertices v, vk, the two ends of edge e;), extend from.3i to 3i + 1, and the rest extend from 3i to 3i+2.Consider a bi- partitioning ($1, $2) of S. Note that this partitioning induces a corresponding partitioning (Zil ,Zi2 on each of the I's.We can make the following observations from the construction of PI instance: Observation 2. Two intervals in the constructed P1 instance corresponding to two distinct edges e, ej of MCP instance do not overlap each other.Observation 3. Since a 0, we have: Ga(S1, $2)  t+t2.
From Observation 2 and the definition of t and t2, it becomes clear that t and t2 can each be computed as the summation of IE] terms.Each of these IEI terms corresponds to the contribution of the intervals corresponding to one of the edges in the MCP instance.Therefore we can write: i=l where: tl, L(AIosz,, I); Vi E {1,2,..., IEI (14)   t2, L(AIsZIb); Vi {1,2,..., IEI} (5)   An example is shown in Figure 6.The partition- ing (V1, Vz) of the MCP instance and the corresponding partitioning ($1, Sz) of the con- structed PI instance are shown and the values C(V1, V2) Ga(S1, S2) are computed.
Now we can prove the lemma by showing that the answer to the given MCP instance is YES if and only if the answer to the constructed P1 instance in YES.
(If) Suppose that the answer to the constructed P1 instance (a, b, c, S) is YES.Then there exists a bi-partitioning ($1, $2) of S such that: ISll _> b-B, IS2I.> b B, and that: aa(Sl,S2) c--31EI-K (17) FIGURE 6 Corresponding partitionings of the P1 and MCP instances and the computation of gain and cost.This means for a cut edge ei in the partitioning (V1, V2) of V in the MCP instance, there is a contribution tl, + t2, 2, and for an uncut edge ei there is a contribution tli + t2i 3 due to the intervals in Z to the gain Ga(S1, $2).Therefore we Now consider the corresponding bi-partitioning (V1, V2) of V in the MCP instance.We have IVll ISll > b B, V2l [S2l > b B. On the other hand from (16) and (17) get: C(V1, V2) < K.This means that the answer to the given MCP instance in YES.
(Only if) Suppose that the answer to the given instance of MCP is YES.Then there exists a bi'partitioning (V1, V2) of V such that: IVll > B, Iv21 >B, and that: Consider the corresponding bi-partitioning (S1, $2) of S in the constructed PI instance.We have ISll:lVll_> B b, IS21 -..IV21 _> B-b.On the other hand from (16) and (18) we get: Ga (S1, $2) 3IEI-K=c.Therefore the answer to the con- structed P1 instance is YES.
It is easy to see that P1 is in NP.A non- deterministic polynomial time algorithm just needs to guess a bi-partition ($1, $2) of S and then check in polynomial time that the gain Ga(S1, $2) of this bi-partitioning satisfies Ga(S1, $2) _> C. Hence, we have the following result:

THEOREM
The problem P1 is NP-complete.

EXACT ALGORITHMS
The fact that PI is NP-complete, rules out the possibility of existence of a polynomial time algorithm for P1 unless P NP [11].The general strategy in such circumstances is to work at two fronts: towards the theoretical end, the complexity of special sub-classes of the general problem that are potentially solvable in polynomial time are studied.Pinning out such sub-classes, of course, is not always an easy task.Towards the practical .end,heuristic approaches are developed to solve the problem sub-optimally but in polynomial time.
Occasionally, it has been observed that formula- tion of an exact solution to a general NP-complete problem, despite its exponential running time, provides valuable insights on how to design practical heuristic algorithms for the problem.Such exact solutions may also help understanding some special sub-classes of the general problem that are optimally solvable in polynomial time.
In this section we address two algorithms PARTITION_EXACT1 and PARTITION_EXACT2 to solve P1.The outline of these algorithms are shown in Figure 8.
PARTITION_EXACT1: The first algorithm (Figure 8a) which is also the trivial one would be to try all possible ways to b-balance bi- partition S, compute the gain for each, and report the partitioning with maximum gain value Ga(S1, $2).The running time of this algorithm would be O(I(;)-+-(b-1) +'"+ (rrb)]fl(r'b)) where r ISl is the number of NISs in S, () is the number of ways to pick x elements out of a set with r elements, and fl(r,b) is the time required to compute the gain of a b-balanced partitioning (S1, $2) of S. Since in practice b is not a constant, this approach results in an exponential running time.
PARTITION_EXACT2: The second algorithm (Fig. 8b) which forms the foundation of poly- nomial time algorithms that we will present in later sections to solve special classes of P1, works as follows: it tries all possible ways to select a pair of NISs N, M that are potential internal-intersection of two partitions $1 and $2 of a b-balanced bi-partitioning of S, and reports the pair M, N, which results in maximum gain.Let p [Es[ represent the cardinality of the endpoint set of S, then there are p(p-1)+ 1 O(p2) intervals (including the empty interval) with endpoints picked from Es. Therefore there are no more than 2p2 ways to choose either of N and M. Thus the algorithm PARWlWION_EXACW 2 would have the time complexity O(22p:f2), where f2 is the time needed to perform steps 6 and 7 of the algorithm.Figure 9 presents an implementation of steps 6 and 7 of algorithm  Replacing this, we achieve time complexity O(2 2p2 (S -+rp)) for PARTITION_EXACT2 algori- thm.The following observation can be used to bound the search space in PARTITION_EXACT2 algorithm: Observation 4. Let /max represent the longest interval in a given P1 instance (a, b, S), and let (S1, $2) be a bi-partitioning of S, then no intervals in the internal-intersection of S1 or $2 could possibly be longer than/max.

SOME POLYNOMIAL TIME SUB-CLASSES
In this section, we focus on some sub-classes of PI for which we can present polynomial time algorithms.The polynomial time algorithm preented for each of these sub-classes is obtained by slight modifications of the algorithm PARTITION _Ex- ACT2.

Single Interval NISs
This section addresses the following sub-class of P1, denoted as P2: P2: Instance: Ordered quadruple (a, b, c, S).Where a is a positive number, b and c are integers, and S {N1, N2,..., Nr} is a set of NISs of the form: Ni {Ii}, that is each NIS consists of a single interval.Objective: Determine whether there exists a b- balanced bi-partitioning (S1, $2) of S such that Ga(S1, S2) c.
We can use the basic algorithm PARTI-TION_EXACT2 to solve P2, however, the following observation allows us to achieve a much faster algorithm.
Observation 5. Let P {N1, N2,..., Nk} be a set of NISs, each containing a single interval.Then the internal-intersection of P is a NIS that consists of either a single or no interval.This observation tells us that no matter how we partition the set of NISs S of P2 instance into S1 and $2, the internal-intersection of either of the partitions $1, $2 consists of only a single interval.That is, we do not need to spend time on multiple interval NISs for N and M, since such NISs cannot possibly be the internal-intersection of partitions for a bi-partitioning ($1, $2) of S. Therefore to solve P2 we can use algorithm PARTITION_ EXACT2 with the for loops modified such that only single interval NISs.are picked for N and M.This leads to f2 O(s + r) O(s) and the time complexity of O(sp4) where s is the number of intervals in the problem instance, and p is the cardinality of the endpoint set of S, and hence we have the following theorem: THEOREM 2 The problem P2 can be solved in polynomial time.Observation 6.Let /min and ImCd represent the intervals in the P2 instance with the smallest and b-th largest lengths, respectively.Then it is easy to show that for one of the partitions we only need to enumerate intervals of lengths no more than ImCd and for the other partition we only need to enumerate intervals of lengths no more than Imin.This observation allows limiting the solution space to be searched during the execution of the algorithm.However, it does not improve the asymptotic time complexity of the algorithm.
It should be noted that the solution to problem P2 suggests a heuristic algorithm for the general problem PI.The idea would be to devise a function " mapping the set S of multi-interval NISs in the given instance of P1 onto a set S' of single interval NISs and thus generate a P2 instance.Then the optimal partitioning solution for S' yields a heuristic partitioning solution for S.
The choice of could affect the quality of the heuristic solution, and is to be studied.A reason able candidate formaps an NIS N to NIS N', where N' {/} and I is the longest interval in N.
Figure 10 demonstrates the idea.

Bounded Number of Switchings
In practice, switching the partitions in and out of sleep mode is itself a power consuming activity which should be minimized.Moreover, as the number of such switchings is increased, the complexity of the extra control logic needed to supervise the sleep mode is also increased.As a solution to this problem, the following (restricted) version of P1, called P3 is introduced, in which the summation of the switchings of the partitions is upper bounded by an input parameter d. P3: Instance: Ordered quintuple (a, b, c, d, S).
Where a is a positive number, b, c, d are integers, and S {N1, N2,..., Nr} is a set of NISs.Objective: A b-balanced bi-partitioning (S1, $2) of S such that: aa(Sl,S2) c where swi IA(S1)l is the number of intervals in the internal-intersection of Si.
One may think that by increasing parameter a in a P1 instance we can control sw "+-SW 2 in the final solution.However, this does not affect the time complexity of the algorithm PARTITION _EXACT2.On the other hand, by upper-bounding SWl + sw2 in problem PI. we can achieve an O(sp2d) algorithm to solve the optimization ver- sion of P3 optimally, which is a pseudopolynomial algorithm7.To do this we restrict the algorithm PARTITION_EXAXT to only testing those combi- nations of N, M that satisfy INI + IM[ >_ d.Note 7It is polynomial in s, p, and exponential in log d, the minimum size needed to express d in the problem instance.
that INI--SW1 and IMI--SW2.The following theo- rem is an immediate result: THEOREM 3 The problem P3 can be solved in (pseudo)polynomial time.

GENERALIZATION
In this section we briefly mention a couple of the generalizations of P1 (and its counterparts P2, P3).This is intended to suggest that the basic formula- tion is easily adaptable to cover a broader range of optimization problems.We discuss two general- izations: the multi-way partitioning, and the weighted partitioning.Note that we could also have multi-way partitioning and weighted combined.
Weighted Partitioning: This version of the problem is applicable in circumstances where only statistical analysis is possible to obtain the idle times for each element.In such cases there is a weight wi associated with each interval Ii.The weight of an interval can represent the probability that the corresponding element is idle during that interval.The weighted version is also useful to model a circuit where different sub-circuits have different power attributes due to the fact that they result in various savings in power consumption even if they are switched into sleep mode for the same period of time.In that case, each NIS Ni has a weight wi associated with it.
Multi-way Partitioning: This is a straightforward generalization.To solve this problem we can either perform a recursive application of the algorithms presented for the corresponding bi- partitioning problem, or enumerating the po- tential internal-intersection for each of the partitions using rn nested loops that would replace the two nested loops in steps 4 and 5 of algorithm PARTITION_EXACT2.The trade off is between the quality of results and the running time of the algorithm.The recursive application of bi-partitioning approach is faster, however it generates results with inferior quality.Note that this problem is especially useful in partitioning for clock tree construction to maximize savings in power consumption by clock gating (see Fig. l c).

EXPERIMENTAL RESULTS
The algorithm PARTITION_EXACT2 and its modi- fications to optimally solve P2 are implemented in C and tested.Because of unavailability of test data due to novelty of the problem and its formulation, a set of randomly generated data with controlled parameters were used as test cases.The results of experiments are shown in Table I.To simplify the comparison, the following settings are made for all the test cases: ISI 100 (s is the set of elements).Factor a is set to 0 (a is the penalty factor for the total number of switchings).This makes sense because the switching of either of the partitions is in the range {0,1}, hence the sleep mode control circuitry will cause negligible overhead on the area or power consumption.
T width of the time window 50 (See Fig. 4).
To apply this algorithm for the general case, one can use a pre-processing step which takes as input the idle sets of the CEs, and generates as output a single idle interval for each CE.The generated single idle interval for a CE can simply be the longest interval in the idle set of that CE, or it can be obtained using a more complicated strategy.The parameter min-len shows the length of the shortest interval in each problem instance.For each value of min-len, 10 random inputs are generated and tested with the algorithm.The minimum, maximum and average values for the ratio (tl + t2)/T resuited from our partitioning algorithm and from a random partitioning algorithm are shown, where and t2 are the exploitable sleep time of the partitions in the resulting bi-partitioning.The higher this ratio is, the more the savings in power consumption would be if we place the corresponding partitions in sleep mode.Note that if we don't consider the idle times in a partitioning scheme (as it has been done so far) the result is essentially equivalent to a random partitioning.However, by partitioning the set of elements according to their idle times we can maximize ratio (tl + t2)/T, and minimize the power consumption by exploiting sleep mode.It can be observed that as the length of the minimum idle times (min-len) is increased to cover the whole time window, the results get closer.Note that since the computation window has width T 50, practical range for min-len is 5 to 25.These cases are shown in bold face in the first column in Table I.In such cases, our algorithm produces superior results, with an average of 7 to 40% more sleep time, compared to random partitioning. 10.DISCUSSION AND CONCLUSION In this paper we studied the circuit partitioning problem to exploit sleep mode operation for minimization of the average power consumption.The motivation is to de-activate the memory refresh circuitry, apply power down or just disable the clock signals during the inactive periods of operation of corresponding circuit elements.The idea is to partition the set of elements such that the elements with close activity patterns are grouped into the same partition so that each partition can be switched into sleep mode during the time intervals all of its elements are idle.We formulated the problem and showed that it is NP-complete.We also discussed some special classes of the problem which are solvable in polynomial time.
Experiments were conducted to show the effec- tiveness of the presented algorithms.The results of experiments show possibility of significant savings in power consumption if the sleep mode is exploited properly.Recently, a more realistic set of experiments have been reported in [9] for memory segmentation on a number of DSP and numerical applications, with considerable reduc- tion on the estimated power consumption of the memory unit, using an iterative improvement partitioning technique.To obtain the idle sets for memory elements in the work reported in [9] the applications were run on an emulator with a profiling tool that kept track of different resource utilizations over time and space.The idle sets were then calculated using an idea similar to the one mentioned in Section 3.1 from the access sequence provided by the profiling tool.Further work on gated clock tree design have been reported in [3,19].The following provides directions for further research in this area: Improving the time complexity of the algorithms.Although the algorithms presented for special cases P2 and P3 are polynomial time algorithms, the growth rate of the running time with problem size limits the applicability of this approach.
Having shown that sleep mode and its exploita- tion could lower the power consumption, gives rise to new problems in high level synthesis, that is, how to perform the scheduling and allocation tasks such that potential savings in power consumption achievable by exploiting sleep mode operation is maximized.It is noteworthy that the register allocation step in high-level synthesis tends to minimize the sleep time of the registers in order to reduce the required number of registers in the design.This brings up the trade off issue between area and power con- sumption in the high-level synthesis, which calls for further investigation.
We mentioned earlier that using a mapping function F to obtain single interval NISs from multi-interval NISs in S, we can construct a P2 instance from a given P1 instance.The con- structed P2 instance can then be solved opti- mally to lead to a heuristic partitioning solution for our original PI instance.Further theoretical and experimental studies can be pursued to identify suitable choices for the mapping func- tion F. It would be worthwhile to devise heuristics based on which to perform the partitioning sub- optimally, but fast.This could be of use as a design aid for low power design to provide a quick feedback to the designer on how the design modifications or decisions made at higher levels would affect the sleep times of the partitions.The geometric flavor of the problem demands for carefully designed algorithms that exploit the geometric features of the problem to achieve good solutions.Hence it is worthwhile to study this problem from a geometric viewpoint in search of fast approximation or heuristic algorithms for the general or special classes of the problem.Another interesting problem is whether or not P1 can be formulated as a (hyper)graph parti,tioning or in general any (hyper)graph problem at all.Our attempts indicate that such a formulation is unlikely to exist although we have no formal proof to present for it.Further research is in order to show whether or not such formulation is possible.In the case of positive answer, the existing algorithms for the (hyper)graph formulation can be applied to solve P1.
As PI is formulated as a set partitioning algorithm, it is nice to see how well the existing heuristics for MCP, e.g., Kernighan-Lin [13], Fiduccia-Mattheyses  [10], Ratio-Cut [25], etc., can be modified to operate on P1 instances, how fast they can be implemented, and how well they perform.
A crucial assumption in this paper was the availability of the activity patterns (idle times) as input to our problem.It is of particular interest to categorize the designs for which such patterns can be generated efficiently.Furthermore, in cases where such patterns may not be generated as a set of exact idle sets, statistical approaches could be employed to generate some weighted version of the idle sets in which the weights could represent the probabilities of being idle during different periods.It is therefore worth- while to formulate and study the weighted version of the problem.
A generalization of the problem would be to allow multi-way partitioning, and perhaps to compute the optimal number of partitions as well as the contents of each partition.
It is also useful to investigate other areas in which problem (P1) could find applications.
FIGURECircuit partitioning to exploit sleep mode opera- tion.

FIGURE 2
FIGURE 2 Computing idle sets for memory from the memory access sequence.

FIGURE 3
FIGURE3 Computing the idle sets for a scheduled and allocated design.

FIGURE 4
FIGURE 4 Example of memory partitioning for sleep mode.

FIGURE 5 A
FIGURE 5 A MPC instance and its corresponding P1 instance.

FIGURE 7
FIGURE 7 Categorization of edges in MCP instance.
P1 {Ni e S U ^N V N Pl M ^Ni M}; If (IPI < ) (IPal + IPI < 2b) Goto next iteration of loop nt step 5; aP {N PxlM A N M} (laPI + levi b){ AP A subt of P with size p, where(b IPI)

FIGURE 10 Using
FIGURE10 Using P2 to obtain a heuristic algorithm for P1.
Balance factor b 40 (each partition should contain at least 40 elements).A single interval per NIS (complying with P2 instance).min-len

TABLE
Comparison of our partitioning algorithm and random partitioning