Computation Reordering : A Novel Transformation for Low Power DSP Synthesis

A novel architectural transformation for low power synthesis of inner product computational structures is presented. The proposed transformation reorders the sequence of evaluation of the multiply-accumulate operations that form the inner products. Information related to both coefficients, which are statically determined, and data, which are dynamic, is used to drive the reordering of computation. The reordering of computation reduces the switching activity at the inputs of the computational units but inside them as well leading to power consumption reduction. Different classes of algorithms requiring inner product computation are identified and the problem of computation reordering is formulated for each of them. The target architecture to which the proposed transformation applies is based on a power optimal memory organization and is described in detail. Experimental results for several DSP algorithms show that the proposed transformation leads to significant savings in net switching activity and thus in power consumption.


INTRODUCTION
The recent rapid advances in the areas of wireless communications and multimedia technology made available a large number of portable batteryoperated systems such as cellular phones, pagers, wireless modems, portable videophones and hand- held digital video cameras.All these systems make extensive use of digital signal processing (either one or two-dimensional).Since power consump- tion is the overriding issue in the design of port- able systems, low power digital signal processing became an increasingly important research area.
One of the most common operations performed in digital signal processing applications is the inner product computation.An N-point inner product between an N-point vector of data D [do, dl,..., du-1] and an N-point vector of coefficients *Corresponding author.Tel.' (+) 30 61 997324, Fax: (+) 30 61 994798, e-mail: masselos@ee.upatras.grC [Co, c1,... c N_ 1] is described by the following equation: i=0 There are many ways of realizing inner product structures in hardware as part of digital signal processing systems.One way is to use dedicated hardware multipliers and adders (pure custom hardware approach).A second alternative is the use of an instruction set processor (either general purpose processor or digital signal processor) and its computational units.Especially in instruction set digital signal processors the main computa- tional unit is the multiply accumulator.Finally a multiply accumulator based custom hardware archi- tecture allowing sharing of hardware units (adders and multipliers forming the multiply accumulators) and offering a low-level programmability and flexibility can be used.In all cases data and coeffi- cients are stored in buffers either in background memories or in foreground memories.The power consumption in digital CMOS cir- cuits is ought to three sources [1], the dynamic (or switching), the short circuit, and the leakage power dissipation: Pavg Pswitching + Pshortcircuit -n t-Pleakage 00---.1CL V2DDfclk (2) -[-IscVDD -Jr-/leakage VDD where a0 is the node transition activity factor (the average number of times the node makes a power consuming transition in one clock period), CL is the load capacitance, fclk is the clock frequency and VDD is the supply voltage.Ic is the average short-circuit current and /leakage is the average leakage current which depends on fabrication technology.
Optimization for power can be achieved in all the levels of the design flow.Various techniques have been proposed in the past to minimize the sources of power dissipation.Since the switching component is often the most significant source of power dissipation most of the proposed power optimiza- tion techniques aim at reducing the switching or dynamic component of power dissipation.Furthermore the switching component of power dissipation can be optimized in the high levels of abstraction where the most significant power savings can be achieved [1].
A large number of data path synthesis techni- ques for power optimization have been proposed.
Transformation based data path synthesis techni- ques for low power are of significant importance and have been addressed in [2][3][4][5].In the area of low power digital signal processing several techni- ques for power optimization of FIR filters have been proposed.In [6,7] the transformation of the binary representation of the filter coefficients is proposed while the technique described in [8] per- turbs the filter coefficients to decrease the number of operations required to obtain the filter output while preserving the desired filter characteristics.
In [9] a technique for low power realization of FIR filters using differential coefficients is pre- sented while in [10] a transformation for power reduction, through minimization of the number of multiplications, in FIR filters is described.In [11, 12] low power architectures for the Discrete Cosine Transform (DCT) and the Discrete Fourier Transform (DFT) are presented respectively.Gene- ral methodologies for low power digital signal processing are described in [13][14][15][16].
In this paper a novel architectural transforma- tion for power consumption reduction in realiza- tions of inner product computational structures present in digital signal processing algorithms is described.The main idea is the computation re- ordering aiming at reducing the switching activity at the inputs of the computational units.Information related to both data and coefficients is taken into account during the reordering.As a metric of the switching activity the hamming distance is used.Different cases are identified based on the computational structures of digital signal processing algorithms.Formulations of the com- putation reordering problem for the different classes of algorithms as well as the target archi- tecture template are described.
The rest of the paper is organized as follows: In Section 2 an example illustrating the proposed computation reordering transformation is offered.The different classes of algorithms are presented in Section 3 while in Section 4 the target architecture template is described in detail.The proposed trans- formation is described in detail in Section 5. Experimental results for several digital signal processing algorithms are presented in Section 6.In Section 7 the generalization of the proposed methodology is described and finally in Section 8 conclusions are offered.

MOTIVATING EXAMPLE
Let EXAMPLE_DSP be a digital signal pro.ces- sing algorithm requiring evaluation of 4-point inner products between 4-point data and coeffi- cient vectors.Let the coefficient vector [Co, Ca, c2, c3] be equal to [13,2,9, 6] and assume that the data vector [do, d1, d2, d3] at a specific instance of the algorithm is equal to [1,3,14,2].EXAMPLE_DSP requires computation of 4-point inner products y using the following equation: where Pi denotes the partial product C X d The hamming distance between two partial products is defined as the sum of the hamming distances between the coefficients and between the data of the partial products.The hamming distances be- tween all possible pairs of partial products for the specific instance of EXAMPLE_DSP are given in Table I.If the computation of the inner product is performed as defined in the EXAMPLE_DSP on a single multiply accumulator the total hamming distance at the inputs of the multiply accumulator would be equal to 17.This case is illustrated in Figure 1.
If Table I is examined in detail the sequence of evaluation of the partial products that results in minimum switching activity at the inputs of the multiply accumulator can be derived.In fact the sequence of partial products Pa, P0, P1, P3 (as well as the P1,P3,Po, P2) results in a total hamming distance at the inputs of the multiply accumulator equal to 12.This means that if data were known before implementation the partial products could be scheduled as described above leading to a switching activity reduction equal to 5. However in real life applications data are dynamic i.e., not known before system realization.This dynamic behavior of data is a major problem in power esti- mation since power consumption is dependent on input data.
Based on the coefficient information which is available before realization, the sequence of evalu- ation of the partial products that minimizes the activity at the coefficient input of the multiply accumulator can be determined.The sequence of coefficients co, c2, Cl, c3 (as well as the sequence Cl, c3, co, c2) results in the minimum activity at the coefficient input of the multiply accumulator.The activity of the sequence of coefficients Co, c2, Cl, c3 is 5 while the activity of the original sequence of coefficients is 11.If the inner product is computed following the sequence of evaluation of the partial products corresponding to the minimum activity sequence of coefficients the total activity at the in- puts of the multiply accumulator equals to 13 i.e., it is slightly worse than the optimal (12-achieved when also data information were used to schedule the evaluation of partial products).This case is illustrated in Figure 2.
Although data cannot be known before algo- rithm's run time typical input data may be known before realization.Assuming that typical input data of the EXAMPLE_DSP algorithm are available, Hamming Dist.
between P and P between P and P2 b e t w e e i 6 P and Pa Order of evaluation of partial products simulation of these data can be used to approximate the hamming distances between all the pairs of data elements of the 4 point data vectors.Table II describes the result of the statistical analysis of the typical input data of the EXAMPLE_DSP where average hamming distances for all pairs of data in the 4-point vector are given.Combining this data related information with the already available coefficient information, the hamming distances for all pairs of partial products can be evaluated.This information is included in Table III.Based on the values of Table III the minimum activity sequence of evaluation of the partial products is the P2, P0, P1, P3 with an activity value equal to 13.2.The real activity of this sequence is 12 i.e., the real minimal activity.In the last case where the simulation determined average hamming dis- tance of the pairs of data is taken into account the hamming distances between pairs of partial products (as well as between pairs of data) are real numbers.This case is illustrated in Figure 3.
The results of the reordering of the evaluation of the partial products are summarized in Table IV and prove the significant savings in switching acti- vity at the inputs ofthe computational units that can     be obtained by the proposed transformation i.e., by changing the schedule of partial products evalua- tion.This reduction of the input switching activity results to switching activity and power consumption reduction within the computational units as well.It must be noted that the use of the simulation determined data related information does not always guarantee a global optimal solution of the reordering problem.(a) First class: Convolution type algorithms (FIR filtering, Wavelets).(b) Second class: Transformation type algorithms (DCT, DFT, FFT).

First Class of Algorithms
The main type of computation in the algorithms of this class is the convolution between data and coefficient vectors.A typical example belonging to this category is the Finite Impulse Response (FIR) filtering which is one of the most common DSP applications.An N-tap FIR filter performs the following convolution: i=0 where Ci's are the coefficients of the filter forming an N-point coefficient vector) and Xn, Yn are the nth terms of the input and output sequences re- spectively.From the above equation it is obvious that the evaluation of one point of the output requires computation of an N-point inner product between data and coefficient vectors.The main characteristic of the computation performed by the algorithms of this class is that for the evalu- ation of the output terms the same coefficient vector and different data vectors (overlapping by Nterms and one different term for successive output terms) are used.The computation required for one output point (as described by Eq. ( 4)) is defined as the basic computation for this class of algorithms.Another algorithm belonging to this class is the Discrete Wavelet Transform (forward and inverse).The main task of the Discrete Wavelet Transform is the FIR filtering.

Second Class of Algorithms
The main type of computation in the algorithms of this class is the matrix-vector multiplication be- tween a coefficient matrix and a data vector.Typical examples of algorithms belonging to this category are the common digital signal processing transfor- mations like the Discrete Cosine Transform (DCT), the Discrete Fourier Transform (DFT), and the Fast Fourier Transform (FFT).An where Y is the N-point output data (Yi's) vector, C is the N x M coefficient (C's) matrix, and X is the M-point input data (Xi's) vector.Evaluation of each term Y; of the output column vector requires computation of an M-point inner product between the ith row of the coefficient matrix and the input data vector.It must be noted that the above-des- cribed computation is present in the one-dimen- sional versions of the transformations.Since the two dimensional versions of the transformations are computed using the row-column decom- position it can be said that in the two dimensional versions of the algorithms the main computational structure is the one described in Eq. ( 5).The main characteristic of the basic computational structure of the algorithms of the second class is that the coefficient vector used for the computation of the output points is not always the same.The number of different sets of coefficients is equal to N i.e., the first transformation dimension.On the other hand the same data vector is used for N output points of the transformation.The computation required for N output points as described by Eq. ( 5) is defined as the basic computation for the second class of algorithms.

TARGET ARCHITECTURE
The structure of the proposed architecture is shown in Figure 4.The application data are stored in a background memory (either on-chip or off- chip).This memory may be quite large especially for multidimensional signal processing applica- tions like image and video processing [17].For both classes of algorithms a set of foreground registers is used to store the data terms required for each inner product computation.This memory hierarchy is introduced to exploit the data reuse present in both classes of algorithms and reduce the memory related power consumption [18].This extra foreground memory does not introduce a significant penalty since in submicron technologies the area and power costs of a register are very small.
For the algorithms of the first class computation of N-point convolutions requires N accesses (reads) to the background data memory for each convolu- tion output if no hierarchy is introduced.However for the computation of two successive output terms of the convolution, N-1 common input data terms and only one different are required (overlapping by N-1).If a set of N foreground registers is used to store the input data terms for each convolution the number of background memory accesses is heavily reduced.Specifically only one read from the background memory and N+ accesses (N reads and write) to the foreground registers (which are far less power expensive than background memory accesses) are required for each convolution computation.After the n-th output term of the convolution is computed the data term in each register is shifted to the previous register and the new data term is transferred from the background memory to the first register (register[n]) and the computation of the nth output term of the convolution may start.An algorithm of the second class requiring computation of an N M transformation accesses the background memory N M times if no memory hierarchy is introduced.Introduction of M registers to which the M input data terms required for the computa- tion of an output vector are transferred in a first step reduces the number ofthe background memory transfers significantly.The presence of registers leads to M accesses to the background memory and N M+ M accesses to the foreground re- gisters for the computation of an N-point output vector.
It is straightforward that the register based foreground memory organization leads to signifi- cant power savings and is also beneficial even if no computation reordering is performed (implemen- tation based on the original specification).Since the research under consideration aims at power consumption reduction the above described archi- tecture model is adopted and will be used for the comparison of implementations based on the original computation ordering and implementa- tions based on the derived computation ordering.
The proposed computation reordering transfor- mation does not introduce any addressing penalty although the order in which the data terms are transferred to the inputs of the functional units changes.The data are transferred to the foreground registers (either from the background memory or directly from the inputs of the circuit) in the same order as in the original case but read from them according to the new order of computation.The foreground storage elements (registers or register files) are addressed directly by the control unit.
Thus no address generation overhead is introduced by the reordering of computation.As far as the activity of the address lines (in case of register files) is concerned, activity reduction techniques like Gray encoding can be used as in the case where no computation reordering is performed.The coefficients are stored in a coefficient memory (usually a ROM).In some cases coeffi- cients are stored in the same memory block as the data.The computation of the output terms of the convolution is performed on a multiply accumu- lator based computational unit.Multiply accumu- lator modules, generated as complete functional blocks, may be used to achieve a higher operation speed than computational units consisting from a specific number of adders and multipliers [19].A multiply accumulator based custom hardware architecture for the implementation of the discrete wavelet transform,has been proposed in [20].The multiply accumulators can be either pipelined or not pipelined.The number of multiply accumulators (number of hardware resources) used is in general determined by the performance require- ments of the application.The number of register sets required depends on the number of basic computations performed in parallel (number of different output terms computed in parallel).One register set is required for each of the basic computations performed in parallel.
5. THE PROPOSED TRANSFORMATION

General
The proposed transformation reorders the se- quence of evaluation of the partial products idi that constitute an inner product.The criterion for the reordering of the computation is the mini- mization of the sum of hamming distances be- tween successive pairs of partial products of the form cidi and ci+ ldi+l.The hamming distance between two partial products cidi and cj.dj, is defined as: HD(cid/, cjdj) HD(Ci, cj)+ HD(d/, d j) (6) where HD(a,b) is the hamming distance of the binary representations of numbers a, b.Minimiza- tion of the sum of hamming distances between successive pairs of partial products results in switching activity reduction at the inputs of the computational units that perform the inner pro- d/act computation.The.reduction of the switching activity at the inputs of the computational units leads to switching activity and power consumption reduction inside them as well.To simplify analysis, in this section it is assumed that the computation is performed on a single multiply accumulator.The evaluation of the hamming distance be- tween coefficients is straightforward since coeffi- cients are statically determined ,in digital signal processing algorithms i.e., known before realiza- tion (hardware implementation or software com- pilation).On the other hand the evaluation of the hamming distance between two samples of data (belonging to the data vector of an inner product) is quite hard since data are dynamic i.e., not known before run-time.An estimate of the hamming distance between two data samples can be obtain- ed by simulation of typical data of a specific application.Consider a digital signal processing algorithm that requires computation of N point inner products.The simulation can be performed in the following way: For every possible pair of samples de, dj., that belong to the inner product data vector D, the Average Hamming Distance (AHD0.) is evaluated.A sequence of typical appli- cation data can be used for simulation.From this sequence the inner product data vectors are extracted (according to the algorithm) and the average hamming distances for all possible pairs of data that belong to the data vector are com- puted.The simulation stops when the following stopping criterion is satisfied" Where AHDgj(M) is the average hamming dis- tance between the ith and jth data elements of the inner product data vector after M simulation cycles (i.e., different sets of values for the inner product data vector).
In this way information related to both data and coefficients can be used to drive the reordering of computation.However only the static information (i.e., the values of the coefficients) can also be used to determine the final schedule of the computation of the partial products.In such a case the opti- mality (in terms of switching activity reduction) of the derived schedule is traded-off for an in- creased speed of the reordering procedure.

First Class of Algorithms-mathematical Model and Problem Formulation
As already stated the basic task performed by the algorithms of the first class is the convolution between data and coefficients.An N-point con- volution requires evaluation of an N-point inner product for each term of the output sequence.To simplify the presentation of the proposed trans- formation it is assumed that the basic computation for the algorithms of this class (as defined in the previous section) is performed on a single multiply accumulator.The convolution operation, for the evaluation of an output term, performed by the algorithms of this class is described by the following equation: where Yn is the nth term of the output sequence, C is the coefficient vector and X is the vector of the input data.The computation is performed accord- ing to the ordering function f(k)= k, k 0,...,N-1.The total hamming distance of the sequence of evaluation of the partial products as determined by the ordering function f(k) is given by the following equation: where Pf(k) is the partial product Cf(k)Xn_f(k).The hamming distance for a pair of partial products HD(pk, pt) is given as HD(pK,pt) HD(ck, c) 4-AHD(Xn-k, Xn-t) (10) The average hamming distance (AHD) between the two data samples is determined after simula- tion as described above.In Eq. ( 9) the term HD(po, PN_ l) denotes the hamming.distancebe- tween the first and the last partial product (according to the ordering function f(k)).This term is included because after the computation of the last partial product (PN-1) for the output term y, the first partial product (P0) for the term Y.+I must be computed.
The aim of the proposed methodology is to de- rive a new ordering function g(k), k 0,..., N 1, such that the total hamming distance of the convolution computation as determined by g(k) (Total HD(g(k))), given by Eq. ( 9) when f(k) is replaced by g(k), is minimal and the inner product value is computed according to the equation N-1 Yn Z Cg(k)Xn-g(k) (11) k=O Since there are no data dependencies between partial products and taking into consideration that the basic computation is performed on a single multiply-accumulator, the problem of finding the minimum cost (in terms of total input switching activity) ordering function is equivalent to finding the minimum cost schedule of partial products on the multiply accumulator.The cost function driving the reordering of the computation i.e., the derivation of the new ordering function g(k) may include only the static information (related to coefficients) to speedup the reordering procedure.
The problem of computation reordering for the first class of algorithms is formulated as a Travelling Salesman Problem (TSP).The graph G (V, E) of the problem consists of the set V of the N ver- tices, which are the partial products required for the computation of an inner product, and the set E of edges which model the unconstrained tran- sition from one partial product to another.The problem's graph is complete i.e., each vertex pair is connected by an edge.To each edge of the graph a weight, which is the HD between the two partial products that the edge connects as defined by Eq. ( 10) for a specific ordering function, is assigned.A closed path over all vertices (partial products) must be found, without passing from a vertex more than once, resulting in a minimum HD cost.The weights of the edges between the vertices are real positive values.Their lower bound is zero and their higher bound equals to the sum of the number of bits used for the representation of the coefficients and the representation ofdata terms.If the simpler cost function (including only the static information related to coefficients) is used the costs of the edges are well-bounded positive integer numbers.Their lower bound is zero and the higher bound equals to the number of bits used for the representation of the coefficients.
Several algorithms have been proposed for the solution of the TSP problem.If the size of the problem is relatively small, an exact solution can be found in short time.A large number of exact algorithms have been proposed for TSP which can be best understood and explained in the context of Integer Linear Programming (ILP) [21].The NP-complete class of the problem motivated the research for heuristic algorithms.Christofides in   [22] proposed a heuristic using a minimum-cost matching algorithm and requiring O(n4) time to find .anearly optimal solution.

Second Class of Algorithms-mathematical
Model and Problem Formulation Two different cases are identified for the algo- rithms of the second class.
Case 1 In a high level description (for example a C description) of a digital signal processing algorithm the computation described in Eq. ( 5) is typically expressed as described in Figure 5: This piece of code computes sequentially (one after the other) the output points of an algorithm of the second class by multiplying a row of the coefficient matrix with the input data vector.
The computation of the output points of a transformation of length N M is described by the following equation M-1 Yi=Zcijxj, for i=0,1,...,N-1 (12)   j=0 where yi is the ith output point, cij are the M coefficients of the ith row of the coefficient matrix and xj. are the M elements of the input data vector.Evaluation of one output term of an N x M trans- formation requires computation of an M-point inner product.To simplify analysis it will be assum- ed that the basic computation, as defined in Section 3.2, is performed on a single multiply accumulator.Since the coefficient structure is two- dimensional two ordering functions are required to describe the matrix-vector computation of the second class of algorithms.Using ordering func- tions the computation described in Eq. ( 12) can be expressed as The computation is performed according to the ordering functions f(i)= i, g(i,j)=j, 0,1, 2,...,N-1,j 0, 1,2,...,/12/-1.The f (i) order- ing function corresponds to the rows of the coefficient matrix and the g(i,j) ordering functions (one per row of the coefficient matrix) correspond to the columns of the coefficient matrix.The f(i) function determines the order in which the inner products are computed (order in which output terms are evaluated) while g(i,j) function deter- mines the order in which the partial products that form an inner product are computed and it is different for each inner product.The total ham- ming distance of the sequence of evaluation of the partial products that constitute the N inner pro- ducts as determined by the ordering functions f(i), g(i,j), is given by the following equation: Total HD (f, g) Z f(i)=0 g(i,j)=O HD (Pf(i),g(i,j), Pf i),g(i,j)+ q-HD(pf(i),N-I ,Pf(i)+I,O) (14) for(i=O;i<N;i+ +) f or(j=O;j<M;j+ +) where Pf(o,g(i,j) is the partial product cf(og(i,j)xg(i,y).
The hamming distance for a pair of partial products is given as HD(pf(i)g(i,j),Pf(k)g(k,l) HD(cf(i)g(i,j), Cf(k)g(k,l) n t-AHD(xg(i,j), Xg(k,l)) The average hamming distance (AHD) between two consecutive data samples is determined after simulation as earlier.In Eq. ( 14) the term HD(pf(i),N_ 1,Pf(i)+ 1,0) denotes the hamming dis- tance between the last partial product of the inner product f(i) and the first partial product of the inner product f(i)+ 1.This term is included be- cause after the computation of the last partial pro- duct (Pu(O,N-) for the output term YU(O the first partial product (Pf(o + l, 0) for the term YU(O + must be computed. The aim of the proposed methodology is to derive new ordering functions p(i), r(i,j), such that the total hamming distance of a matrix-vector product computation as described in Eq. (14)  (when f, g are replaced by p, r) is minimized and the inner product value is computed according to the equation M-1 Yp(i) Z Cp(i)r(i]) Xr(i,j) r(i,j)=O (16) for p(i) 0, 1,...,N- The cost function driving the reordering of the computation i.e., the derivation of the new order- ing functions p, r, may again include only the static information (related to coefficients) to speedup the reordering procedure.
No data dependencies exist among inner pro- ducts as well as among partial products forming an inner product.Furthermore it is assumed that the basic computation is performed, on a single multi- ply accumulator.Thus the problem of deriving a minimum switching cost ordering function for the inner products is equivalent to deriving a mini- mum switching cost schedule for the inner products on the multiply accumulator.The problem of deriving minimum cost ordering functions for.the evaluation of the partial products forming the inner products is equivalent to deriving minimum switching cost schedules for the partial products of each inner product on the multiply accumulator.
The derivation of new ordering functions will lead to a non-sequential order of appearance of the output points at the outputs of the computational units.However this is the usual case in all the fast transformation algorithms which usually require a kind of post-processing.
The formulation of the computation reordering problem is harder for the algorithms of the second class.This is because the problem is two-dimen- sional in this case.In a first step the minimum cost ordering of the inner products IPg must be deter- mined.In the second step the minimum cost order- ing of the partial products Pij (j 0, 1,..., M-1) that constitute each inner product, must be found.The problem is tackled in the following way: For all possible pairs of inner products IPi, IPj., (i 0, 1,...,N-1, j 0, 1,...,M-1) the mini- mum cost (switching activity) connection is deter- mined.The minimum cost connection is defined as the pair of partial products Pik (belonging to the inner product IPi, k 0, 1,... ,N-1) and (belonging to the inner product IPj, 0, 1,...,N-1) that minimizes the hamming dis- tance between all possible pairs of partial products as described by Eq. ( 17).The complexity of this procedure is M 2 N (N-1).
Minimum cost connection for IPi IPj: (Pi,PjI)  i,j O, 1, ,N k, O, 1, ,M HD(pic,pjt) < HD(Pim,Pjn) for all m,n, m,n O, 1,...,M- (17) Since the minimum cost connection has been determined for all possible pairs of inner products the graph G (V,E) of the problem can be constructed where V is the set of N vertices while E is the set of edges.The graph is complete and the N vertices correspond to the N inner products while edges model the unconstrained transitions from one inner product to another.To each edge the minimum cost connection (determined pre- viously) corresponding to the inner products (vertices) connected by the edge is assigned as a cost.
The determination of a minimum cost ordering of the inner products is formulated as a Travel- ling Salesman Problem (TSP) on this graph.A closed path over all vertices (inner products) must be found, without passing from a node more than once, resulting in a minimum HD cost.For the solution of the TSP the strategies proposed for the algorithms of the first class can be used.As soon as the minimum switching ordering of the inner products is determined, the partial products that constitute each inner product must be ordered to minimize the switching required for its computation.In fact each vertex (inner product) of the problem's graph is a graph with vertices corre- sponding to the partial products that constitute the inner product represented by the initial vertex.The hamming distances between the partial products are assigned as costs to the edges of the inner product graphs.However the ordering of the inner products performed in the first step results in the determination of the partial products that will be evaluated first and last for each inner product.Thus the derivation of the minimum cost ordering of the partial products of an inner product can be formulated as a restricted minimal spanning tree problem.This problem is a minimal spanning tree since only an open path over the vertices of the inner product graph (partial products) must be found.The problem is restricted since the starting and ending vertex of the path is determined by the inner product ordering (first step).An exact solution can be found if the number of the vertices of the inner product graph is relatively small.Two widely known algorithms for the solution of the minimal spanning trees problem are the algorithms of Kruskal [23] and Prim [23].In all cases the costs of the edges are well-bounded positive real numbers.Their lower bound is zero and the higher bound equals to the sum of the number of bits used for the representation of the coefficients and the representation of data terms.If the simpler cost function is used the costs of the edges are well-bounded positive integer numbers.Their lower bound is zero and the higher bound equals to the number of bits used for the representation of the coefficients.
Case 2 This case is described by the piece of C code illustrated in Figure 6.This piece of code was derived by applying a loop interchange [24] to the nested for statement of Case 1.The loop inter- change transformation increases the locality of data access [24].The locality of data access favours reduction of power consumption [25] since it reduces the transfers of data from background memories to computational units through highly capacitive global buses.
According to the piece of code described in Figure 6 the N output points of the transformation are computed simultaneously.Only one partial product cxj for each output point is computed and accumulated in each iteration of j.For simplification it is assumed that the basic compu- tation is performed on a single multiply accumu- lator.An implementation corresponding to the description of Case 2 requires only M accesses to the background memory (one access per input data term X) as a result of the increased locality of access created by the loop transformation.The penalty of an implementation based on the Case 2 description is the introduction of the N registers to accumulate the partial products of each output for(j=O;j<M;j++) for(i=O;i<N;i++) output_vector term as well as a number of accesses to these registers.It is straightforward that the loop inter- change and the introduction of the foreground registers leads to significant power consumption reduction by reducing the number of background memory accesses.
The computation performed by the algorithms of class 2 if the Case 2 description is adopted for a transformation of length N is described by the following equation for j-0,1,... ,M- (18) for all output points yi, yi cijxj, where Yi is the ith output point, c;/is the coefficient of the ith row and the jth column of the coefficient matrix and x/is the jth element of the input data vector.The data terms are assumed to be stored in successive memory locations and to be accessed sequentially.Evaluation of the partial products corresponding to the jth input data element for all the output terms of an N-point transformation requires computation of N partial products.For the complete computation of the output terms of the transformation computation of N M partial products is required.Using ordering functions the computation described in Eq. ( 18) can be ex- pressed as for j 0, 1,... ,Mfor all output points Yf(i,j), Yf(i,j) Cf(i,j),g(j)Xg(j) (19) The computation is performed according to the ordering functions f(i,j)= i, g(j)= j, i, j O, 1, 2,..., N-1.The f ordering functions correspond to the rows of the coefficient matrix (one per column of the coefficient matrix) and the g ordering function corresponds to the columns of the coefficient matrix.The f functions determine the order in which the partial products (using a specific input data term) of the output terms are computed for each column (order in which output terms are evaluated).The g function determines the order in which the input data are read from the memory.
The total hamming distance of the sequence of evaluation of the N M partial products required for the N output terms as determined by the ordering functions f(i,j), g(j), is given by the following equation: Total HD M-1 N-1 Z Z HD(pf(i,j),g(i),Pf(i,j)+l,g(j)) g(j)=Of i,j)=O n t-HD(pu_l,g(j),po,g(j)+l (20) where pf (i,j),g(j) is the partial product cf (i,j)g(j)Xg(j).The hamming distance for a pair of partial pro- ducts is given as HD(pf(i,j),g(j) Pf(k,l),g(l) HD(f(i,j),g(j) f(k,l),g(l) (21) Data related activity information (AHD between two consecutive data samples) is not included since the data term remain the same for N consecutive partial products (j constant for 0, 1,...,N -1 and partial products corresponding to a specific data term for all the output terms are computed).In Eq. ( 20) the term HD(pN_I,g(j), P0,gj)+ 1) denotes the hamming distance between the last partial product of the column g(j) and the first partial product of the column g(j)+ 1.This term is included because after the computation of the last partial product (PN-1,g(j)) for the output term YN-1 the first partial product (PO, g(j)+ 1) for the term Y0 must be computed.The sequence in which the data terms are read from the memory and the corresponding partial products are com- puted is determined by .thefact that the data terms are accessed sequentially from the background memory.Under this assumption it can be said that the g(j) ordering function is always equal to j and no reordering with respect to the input data can be performed.
The aim of the proposed methodology is to derive new ordering functions p(i,j) for each of the columns of the coefficient matrix, such that the total hamming distance of a case 2 matrix-vector product computation as described in Eq. ( 20) is minimized (when f(i,j) is replaced by p(i,j)) and the inner product value is computed according to the equation for j 0, 1,... ,M for all output points (22) Yp(i,j) Yp(i,j) Cp(i,j),jXj No data dependencies exist among partial pro- ducts using the same input data term.Furthermore it is assumed that the basic computation is performed on a single multiply accumulator.The problem of deriving minimum cost ordering fUnc- tions for the evaluation of the partial products is equivalent to deriving minimum switching cost schedules for the partial products using the same input data term on the multiply accumulator.
The formulation of the computation reordering problem in this case is similar to that of Case formulation.The inner products of Case are replaced by sets of partial products using the same input data term.The order in which the sets of partial products are evaluated is determined by j (under the assumption of sequential background memory accesses).The costs assigned to the edges of the graphs corresponding to the sets of partial products are well-bounded positive integers and not real numbers.This is because no data related information is included this time.Their lower bound is zero and the higher bound equals to the sum of the number of bits used for the representation of the coefficients.

Analysis of Power Savings
In this section the effect of the proposed transfor- mation on the power consumption of the imple- mentations of digital processing algorithms is summarized.
The proposed transformation derives an activity optimal schedule for the evaluation of the partial products, that form the inner products required for the evaluation of the output terms, is derived.In this way the switching activity at the inputs of the computational units is significantly reduced.The switching activity of a circuit depends on the internal node structure and on the switching at the-inputs.In the high levels of abstraction the detailed internal architecture (gate level implemen- tation) of the computational units (adders, multi- pliers) is not known and the units are treated as black boxes.It can be safely assumed that the switching activity inside the computational units and thus the power consumption is proportional to the input switching activity.In general in the high levels of the design abstraction the power Consumption is estimated based on the total input switching activities [26].Thus the computation re- ordering transformation leads to important power reduction in the computational units.
The proposed transformation achieves reduc- tion of the switching activity in the coefficient input of the computational units.This leads to re- duction of the switching activity in the bus connect- ing the coefficient memory and the computational units (very important since bus capacitance is high leading to important power consumption in them) and to the coefficient memory periphery unit as well.The savings in these circuits are equal to the savings achieved in the coefficient inputs of the computational units.

AND ANALYSIS
The proposed transformation was applied to sev- eral digital signal and image processing algorithms belonging to the classes described in Section 3. Word parallel implementations on multiply- accumulator based data paths were generated for the algorithms.Two's complement fixed point arithmetic was assumed for all cases.Real application data were used for determining the average hamming distances between data terms of the input vectors and for simulation.The results for the different algorithms are described in the following sub-sections.The power consumption reduction is approximated by the reduction of the total input switching activity and of the internal switching activity of the computational units.

FIR Filters
The results of the application of the proposed methodology to several FIR filters are included in Table V.
Inthe second column the type of data used for the simulation is indicated.In the third and fourth columns the changes in switching activities (after the application of the proposed transformation) in the coefficient and data inputs of the computational units are given respectively.In the fifth column the total change at the inputs ofthe computational units is given and finally column six gives the switching activity changes inside the computational units.The reduction of switching activity inside the com- putational units is directly proportional to power consumption reduction since the other factors affecting power consumption (supply voltage, fre- quency of operation, and capacitive load) do not change by the application of the proposed metho- dology.For all cases-denotes savings in switching activities resulted by the application of the pro- posed transformation, while / denotes penalties introduced by the proposed transformation.The first two filters (63 and 14 taps) are typical band pass filters and were simulated using random (Ran.)data.The coefficients were represented using 16 bits (15 bits dynamic range) while 12 bits repre- sented data.For the outputs 16 bits were used.Because of the random type of the data no infor- mation related to data (average hamming distances between data terms of the input vector) could be derived and used for the reordering of computation.Only the static information related to coeffi- cients was used for determining the optimal order of computation.The effect of the proposed transformation on the activity of the data input is random.In one case penalty is introduced while in the other savings are achieved.Both savings and penalty are very small.The remaining filters (15, 11,9,8,6 taps) are low-pass wavelet filters.
In this case for the simulation real image data were used.For the representation of the data 8 bits were used while for the coefficients and the outputs 16 bits were used.The dynamic range of the representation of the coefficients is 15 bits.In this case the effect of the proposed methodology on the data activity is clearer.The proposed trans- formation introduces significant penalty although information related to the application data was extracted (as described in Section 3) and used for the reordering of computation.This is because the reordering of computation causes a reordering of the data terms, which destroys their correlation and increases the data activity.However the penalty introduced is much smaller than the savings in coefficient activity leading to significant total switching and power savings.The above results assume computation of each output term (inner product) on the same multiply accumulator.

Two Dimensional Wavelet Transforms
The proposed methodology was applied to the typical three-level two-dimensional wavelet transform.
Filter Data Imag.
The EPIC filters are described in [27], the Daube- chies filters are described in [20] while Johnston's filters are described in [28].The simulation results for two grayscale test images (Lena and Man) and different set of filters (high-pass and low-pass) are included in Tables VI, VII.For the coefficients, a 16-bit representation (15 bits dynamic range) was used.The input data were represented by 8 bits while for the representation of the wavelet coefficients 16 bits were used.Implementation on a single multiply accumulator was assumed.The switching activity in the coefficient inputs of the computational units is significantly reduced (13.3%worst case, 51.8% best case).
The effect of the computation reordering is different in the data inputs of the computational units.In general the computation reordering intro- duces a penalty (worst case 12%) in the switching activity in these inputs.This is because the re- ordering destroys data correlation.Specifically the reordering of computation and thus of data, introduces switching activity penalty in the in- puts of the three levels of the wavelet decom-position because at these points the correlation of data is high.In the intermediate and output stages of the decomposition levels data are less correlated and thus the result of the reordering is more random (introduces either small savings or small penalties in switching activity).However, the penalty in data activity introduced is much smaller in comparison to the savings achieved, making the use of the proposed methodology advantageous.Two basic parameters affecting the result of the reordering on the switching activity at the inputs of the computational units are: (a) The filter length and (b) The symmetry of the filter.From Tables VI, VII it is obvious that as the length of the filter increases the savings in coefficient activity increase but the penalty in data activity increases also.In general as the length of the filter increases the total activity required for the wavelet decomposition increases either with or without computation re- ordering.In cases of symmetrical filters (h[n]-h[N-l-n] N: filter length) the savings in coefficient activity by the application of computa- tion reordering are higher.
Filter set

Fourier Transform
The proposed transformation was applied to the one dimensional 8-point brute force Discrete Fourier Transform (DFT) and three common one dimensional Fast Fourier Transforms (FFT) [29], the 9-point PTL FFT, the 7-point Singleton FFT and the 9-point Swift FFT.Real image data (test images Lena, Man, Camera and Lax) were used for the simulation.The bit-width of the input data was 8 while 16 bits were used for the repre- sentation of the output fourier data.The simula- tion results for both Cases and 2 of description (presented in Section 3) and for several coefficients bit-widths (C 16, 14, 12, 10, 8) are shown in Tables VIII-XV.
As far as the case descriptions are concerned a penalty in data switching activity is introduced in the brute force DFT.However this penalty is much smaller in comparison to the large savings in coefficient activity achieved by the application of the proposed methodology.The FFTs effectively reduce the length of the input data vector through a preprocessing stage to reduce the number of operations required.For the 9 points PTL and Swift FFTs the input data vector used for the computa- tion of the fourier data is a 4 point one, while for the 7 point Singleton FFT the input data vector is reduced to a 3-point one.This leads to small changes (either savings or penalty) in the data switching activity by the application of the compu- tation reordering.Although the length of the FFTs is relatively small the reordering of computation introduces significant savings in the coefficient inputs of the computational units.For the Case 2 descriptions large switching activitiy savings are achieved in the data inputs of the computational units.The savings at the coefficient inputs of the computational units are of the same order as for the Case descriptions.Larger savings at the inputs and inside the computational units are achieved in comparison to Case descriptions.The savings percentage at the inputs of the computational units is not heavily affected by the changes of the coefficient bit-width.The activity savings percentage inside the computational units almost remains unchanged as the coefficient bit-width changes.

Discrete Cosine Transform
The reordering of computation is also applied to both 1-D and 2-D Discrete Cosine Transform.Specifically the proposed methodology was applied to 8 and 16 point 1-D DCT.The technique for fast DCT computation that decomposes an N-point input vector to two N/2 point sequences [28] was used.Real speech data were used for the simulation.
Two sequences with real speech data (Speechl, Speech2) were used specifically.The simulation re- suits for both Cases and 2 of description and for several coefficients bit-widths (C 16, 14, 12, 10, 8)  are shown in Tables XVI-XXI.
The proposed methodology was applied to the 2-D DCT computed using the row column de- composition and the fast technique described in [28].For simulation the test images Lena and Man were used.For the representation of the coefficients 16 bits were used while the input data were represented using 8 bits.For the representa- tion of the output terms 16 bits were used.The simulation results are presented in Table XX.
Conclusions similar to those drawn in the previous sections can be derived.As the effective length of the input data vector is increased the penalty introduced by the reordering of computa- tion is increased (row-column 8-point DCT).
The reduction of the effective length of the input data vector by the preprocessing step of a fast algorithm leads to smaller changes (either savings or penalties) of the switching at the data inputs of the computational units after the application of the proposed transformation.The results for the 1-D DCTs prove that the savings in input switching activity do not always increase as the transform length increases.The same is true for the co- efficient bit-width.Increase of the coefficient bit-width does not always lead to higher savings percentage.Changes ofthe coefficient bit-width lead to small changes of the savings percentage in the total activity at the inputs of the functional units Input c      while the activity savings percentage in the compu- tational units remains almost unchanged.For the 2- D DCT the behavior of the outputs of the rows DCTs (rowcolumn decomposition is always assumed for DCT computation even when a fast algorithm is used) is random since these data are less correlated.This leads to smaller penalties intro- duced by the computation reordering transforma- tion in the data inputs of the computational units during computation of the columns DCTs.

GENERALIZATION OF COMPUTATION REORDERING METHODOLOGY
So far it was assumed (to simplify analysis) that the total inner products computations required by the algorithms were performed on a single multiply accumulator.Within inner product computations there are no data dependencies meaning that the partial products forming the inner products can be evaluated independently from each other and thus in parallel.Under these assumptions the problem of the computation reordering was formulated as a minimum cost (input switching activity) schedul- ing of a number of independent operations (partial products).For the second class of algorithms it was assumed that the inner products (Case 1) or the sets of partial products (Case 2) were computed one after the other in the order specified by the first ordering function.
However in many real-life cases the number of the available hardware resources (multiply accu- mulators) is larger and determined by the perfor- mance requirements of the specific application.
Two cases exist: (a) Each one of the basic computations included in the algorithm is performed on a single multiply accumulator and not distributed over a number of hardware resources.
The available multiply accumulators are used to perform different basic computations in parallel.
In this case the formulation described in Section 3 still applies.(b) Every basic computation is divided to a number of parts that are assigned to an equal number of hardware resources.The computational parts are executed in parallel.Modification of the formulation described in Section 3 is required for this case.
For the first case experimental results proved that the savings (%) from the application of the computation reordering are independent on the number of the available hardware resources and remain the same as in the case where the total computation was performed in a single multiply accumulator.As far as the second case is con- cerned experimental results proved that the switch- ing activity savings from the application of the computation reordering depend on the number of resources on which the basic computations are assigned and specifically they slightly decrease as the number of the resources increases.This is illustrated in the following section.In this case the basic computations are divided into equal parts, assigned to the available multiply accumulators and performed in parallel.The num- ber of parts into which the basic computations are divided is equal to the number of the available multiply-accumulators.Since no dependencies exist between the partial products of the basic com- putations, for both classes of algorithms, the assign- ment of the computational parts to the hardware resources can be performed randomly and the only restriction is the load balancing.
The formulation of the computation reordering problem, when basic computations are distributed to a number of multiply accumulators, for the two classes of algorithms is as follows: Assuming computation of convolutions of size N and M multiply accumulators available, the basic computation must be divided to parts of N/M partial products.(When N mod M-0 some multi- ply accumulators may perform less than N/M operations).The assignment of the partial products to the multiply-accumulators can be done in a random way.Another solution is to reorder the basic computation as described in section 3 and then assign IN/M consecutive (wrt the new ordering) partial products to each one of the available hardware resources.In this way the cor- relation of the partial products assigned to the same multiply accumulator is improved in com- parison to the random assignment.In both cases the partial products assigned to each multiply- accumulator can be thought to form a "virtual" basic computation (ofreduced size in comparison to the original one).The formulation described in Section 3,1 can then be applied to each virtual computation and a minimum input switching activity schedule of the partial products on the multiply accumulator can be derived.

Second Class
Assuming computation of transforms of size N M and K multiply accumulators available, the basic computation must be divided to parts of IN M/K].The assignment can be performed once more either in a random way or by performing reordering of computation as described in Section 3 in a first step to increase the correlation ofthe partial products assigned to the same multiply accumulator.In the first step complete inner pro- ducts are assigned to the available multiply ac- cumulators.Inner products are divided to different multiply accumulators only for load balancing reasons.The partial products assigned to each multiply accumulator form a "virtual" basic com- putation.Two cases exist: (a) The partial pro- ducts assigned to a multiply accumulator belong to the same inner product of the initial computa- tion.In such a case the virtual basic computation is though to be of first class (convolution type) and the formulation described in Section .3.1 can be applied to derive a minimum input switching activity schedule of the partial products on the multiply accumulator.(b) The partial products assigned to a multiply accumulator belong to n different inner products of the initial computation.In such a case the virtual basic computation is though to be of the second class (transformation type) and the formulation described in Section 3.2 can be applied to derive a minimum input switching activity schedule of the partial products on the multiply accumulator.
To demonstrate the effect of computation dis- tribution on the savings achieved by the proposed methodology the four parallel multiply accumu- lators architecture for the computation of the two-dimensional 8-point DCT described in [19] is used.The fast DCT that decomposes the 8-point input data vectors to two 4-point vectors (u, v) is used.For the computation of the 8 output points for a specific 8-point input data vector, 8 4-point inner products must be computed.These ,inner products are assigned to the 4 available multiply accumulators (2 inner products on each multiply accumulator).Having determined the minimum connection costs between all pairs of inner products as described in Section 3 the pairs of inner products with the smaller minimum connection costs are selected and assigned to each multiply accumulator.Now it can be assumed that each multiply accumulator has to compute a 2 4 transform (2 4-point inner products).The pro- blem of finding the best computation ordering for each multiply accumulator can be tackled using the formulation proposed for the algorithms of class 2.
Simulations using the test images Lena and Man were performed.The bit-width of the input data was 8 while 16 bits were used for the representation of the coefficients and the output terms.The results are described in Table XXI.The savings in coefficient activity, total input activity and activity inside the computational units are slightly decreased in comparison to the results of Table XX describing the results of the proposed trans- formation under the assumption that the basic computation is performed on the same multiply accumu-lator.The distribution of the inner products restricts the possibilities for computation reordering thus leading to slight decrease of the activity savings.The proposed transformation can be applied in a more general data path structure (i.e., not neces- sarily multiply accumulator based).The sets of partial products are assigned to the available multipliers (instead ofmultiply accumulators) and computed in parallel.The formulation of the com- putation reordering problem in this case is the same as the one described in previous section (i.e., for multiple multiply accumulators) except from the fact that the multiply accumulators are re- placed by multipliers.
To demonstrate this the 4 parallel multipliers architecture [28] for the computation of the DCT Man -673366(-17%) 1245161 (-35%) - 24 18 is used.In this architecture an 8-point DCT can be computed using the fast DCT technique described in the previous section on four multipliers and one accumulator/adder.The structure of the architec- ture imposes that the 4 partial products required for the evaluation of an inner product (a trans- form's output point) are computed in parallel each one in one of the multipliers.A heuristic strategy was adopted for scheduling of the inner products and the assignment of the partial products to the multipliers.For all the pairs of inner products the hamming distances between all possible pairs of partial products (one partial product from each inner product) are computed.The minimum cost connection on the specific architecture for every pair of inner products is determined.The connec- tion cost between two inner products is defined as the sum of hamming distances of the partial products (belonging to the inner products) assigned to the same multiplier.Having determined the minimum connection cost for all the possible pairs of inner products the problem of finding the optimal in terms of switching activity at the multipliers inputs is formulated as a Travelling Salesman Problem on a graph with vertices the inner products.The costs of the graph's edges are the hamming distances of the minimum cost connections of the inner products corresponding to the nodes.Thus an optimal scheduling of the inner products on the available hardware can be determined.
Simulations for the test images Lena and Man were performed.The bit-width ofthe input data was 8 while 16 bits were used for the representation of the coefficients and the output terms.The results are described in Table XXII.
The results prove that significant savings can be obtained for the data input of the computa- tional units by the application of the proposed methodology.

CONCLUSIONS
In this paper a novel architectural transformation for power consumption reduction in hardware realizations of digital signal processing algorithms requiring inner product computation was pre- sented.Two different classes of algorithms requir- ing inner product computation between data and coefficients were identified.A low power target architecture for both classes of algorithms was des- cribed.It is based on a set of foreground registers as a form of memory hierarchy that allows data reuse and reduces the background memory related power consumption.
The transformation reorders the computation to reduce the switching activity at the inputs and thus inside the computational units.Reduction of the switching activity at the inputs and inside the computational units is equivalent to power con- sumption reduction.The reordering of computa- tion aims at deriving an optimal, in terms of switching activity at the inputs of the computa- tional units, schedule (and assignment in cases with more than one hardware units available) of the partial products constituting the inner products required by the algorithms on the com- putational units.Information related to both algorithm's coefficients (which are statically deter- mined i.e., known before run time) and application data is used for this reason.The problem of the computation reordering has not the same structure for the different classes of algorithms.Formula- tions of the reordering problem are proposed for the two classes of algorithms.In both cases a

FIGURE 3
FIGURE 3 Computation of the 4 point inner product after reordering of computation wrt both coefficients and results of the simulation of typical data.
N x M trans- formation performs the following computation:

FIGURE 4
FIGURE 4 Target architecture model.

FIGURE 5
FIGURE 5 Typical piece of code for Case 1.

FIGURE 6
FIGURE 6 Typical piece of code for Case 2.

7. 1 .
Distribution of Basic Computation to a Number of Multiply Accumulators

7. 2 .
Application of the Proposed Methodology in General Architectures

(
200 K. MASSELOS et al.Image TABLE XXII Simulation results for the 8-point 2-D DCT realized on the 4 parallel multipliers architecture Change in data act.Change in coeff, act.% change in total input % change in internal

TABLE Hamming
Computation of the 4 point inner product after reordering of computation wrt coefficients.
between P and P between P and P e t w e e n

TABLE IV ,
Results of the reordering of the evaluation of the partial products

TABLE V
Experimental results for FIR filters

TABLE X
Simulation results for 9-point PTL FFT and Case description

TABLE XII
Simulation results for 7-point Singleton FFT and Case description

TABLE
XV Simulation results for 9-point Swift FFT and Case 2 description

TABLE XVI
Simulation results for 16-point 1-D DCT and Case description

TABLE
XVIII Simulation results for 8-point 1-D DCT and Case description