A Combined Coefﬁcient Segmentation and Block Processing Algorithm for Low Power Implementation of FIR Digital Filters*

A combined coefﬁcient segmentation and block processing algorithm for low power implementation of FIR digital ﬁlters is described in this paper. The algorithm processes data and coefﬁcients in blocks of ﬁxed sizes. During the manipulation of each block, coefﬁcients are segmented into two primitive components. The accumulative effect of processing a sequence of blocks and segmentation results in up to 80% reduction in power consumption in the multiplier circuit compared to conventional ﬁltering. The paper describes the implementation of the algorithm, its constituent components, and the power evaluation environment developed. Simulations are performed using eight practical digital ﬁlter examples with various ﬁlter orders and data/coefﬁcient wordlengths. In addition, the algorithm is compared with conventional ﬁltering implementations and those using block processing and coefﬁcient segmentation algorithms alone.


INTRODUCTION
The demand for high performance portable systems incorporating multimedia capabilities has elevated the design for low power to the forefront of design requirement in order to maintain reliability and provide longer hours of operation [1]. Multimedia applications demand real-time signal processing, which consists of intensive multiply and multiply -accumulate operations unique to digital signal processing (DSP) algorithms, such as filtering, fast Fourier transforms, and discrete cosine transforms. For this reason, the multiplication procedure plays a key role in achieving low power implementation of these algorithms. In fact, in many DSP algorithms, such as filtering, the multiplier lies in the critical delay path and ultimately determines the performance of the algorithm. These DSP algorithms are implemented mainly on CMOS-based VLSI devices, which could be dedicated ASICs or DSP processors. Such devices typically integrate parallel multipliers as the central units to handle the computational burden [2].
It can be shown that the most significant factor affecting power consumption in a CMOS VLSI device is the switching power, which is expressed by the following equation [3]: where C is the physical capacitance, V dd is the supply voltage, f is the frequency and k is the switching activity factor, which is defined as the average number of times that a gate makes a logic transition (1 ! 0 or 0 ! 1) in each clock cycle. Therefore, one or more of these parameters must be targeted in order to reduce the power consumption of a circuit.
Researchers in the literature have developed a number of techniques for low power implementation of digital filters. The authors in Ref. [4] utilise a technique which involves using various orders of differences between coefficients along with stored intermediate results rather than using the coefficients themselves directly for computing the partial products in the FIR equation.
Another approach used in Ref. [5] is to optimise wordlengths of input/output data samples and coefficient values. This involves using a general search based methodology, which is based on statistical precision analysis and the incorporation of cost/performance/power measures into an objective function through wordlength parameterisation. In Ref. [6], Mehendale et al. present an algorithm for optimising the coefficients of an FIR filter, so as to reduce power consumption of its implementation on a programmable digital signal processor. Power reduction in the algorithm is achieved in two stages. In the first stage, all coefficients are scaled uniformly by a scaling factor, chosen such that the total Hamming distance between successive scaled coefficients is least. The second stage is an iterative procedure in which coefficients are selected iteratively and incremented/ decremented slightly in order to achieve a reduction in the total Hamming distance. The iterative procedure maintains the desired filter characteristics. The authors in Ref. [7] developed a dynamic programming algorithm to assist in the search for an optimal coefficient set whose member coefficients are restricted to the set {2 1, 0, 1}. The authors in Ref. [8] have investigated the use of primitive operator technique in area/power reduction. In Refs. [9,10], we have presented techniques that utilise various folded/unfolded filter realisation structures in conjunction with coefficient ordering algorithms for minimising power consumption in FIR filters. Other techniques used by researchers include the use of multirate architectures [11,12], and dynamic adjustment of filter order for adaptive filters [13].
We have proposed block processing [14] and coefficient segmentation [15] algorithms for low power implementation of FIR filters. Both algorithms reduce power by reducing the amount of switched capacitance within the DSP hardware platform utilised for filter implementation. Block processing reduces switched capacitance within the multiplier unit and data/coefficient busses. Whereas, coefficient segmentation reduces switched capacitance within the multiplier unit in addition to a reduction in the effective wordlength of the coefficient input to the multiplier.
In this paper, we present an algorithm that utilises block processing and coefficient segmentation in a hierarchical framework to combine their advantages for more reduction in power. The algorithm processes data and coefficients in blocks of fixed sizes. During the manipulation of each block, coefficients are segmented into two primitive components. The accumulative effect of processing a sequence of blocks and the segmentation results in up to 80% reduction in power consumption compared to conventional filtering. Power reduction is achieved through a reduction in the amount of switched capacitance within the multiplier section and on both data and coefficient busses. This in turn is due to less data and coefficient memory access operations and reduced switching activity at multiplier inputs.

IMPLEMENTATION
A typical single multiplier DSP processor architecture for the implementation of FIR filters consists of input/output units, data and coefficient memories, a multiplieraccumulator (MAC) unit, and a control unit [9]. In the direct form realisation of the filter a new data sample, x(n ), and the corresponding coefficient, h(k ), are multiplied at each clock cycle. For this reason each time a multiplication is performed both inputs of the multiplier receive new data. This continuous change at both inputs of the multiplier leads to a relatively high overall switching activity within the multiplier and hence a correspondingly high power consumption. Therefore, any multiplication strategy, which could reduce the switching activity for this realisation is highly desirable. Another source of power consumption in DSPs is the activity in data and address buses. Since each time a new data sample is to be multiplied with a new coefficient, both data and address busses experience high switching activity. This has significant power overheads since bus capacitances are usually several orders of magnitude higher than those of the internal gates of a circuit. Consequently, a considerable amount of power can be saved by reducing the number of memory accesses.
The coefficient segmentation algorithm reduces the switched capacitance by decomposing individual coefficients into two less complex sub-components. The decomposition, performed using a heuristic approach, separates a given coefficient such that a part is produced which can be implemented using a single shift operation leaving another part with reduced wordlength to be applied to the inputs of the hardware multiplier. Hence, resulting in a significant reduction in the amount of switched capacitance and consequently power consumption. The flow chart in Fig. 1 shows the main stages (indicated in circles) of the algorithm developed. Given the coefficient set H ¼ ðh 0 ; h 1 ; . . .; h L21 Þ; where L is the filter order, the algorithm proceeds through the coefficients sequentially. For a given coefficient h k , the algorithm targets dividing it such that h k ¼ s k þ m k ; where s k is the component to be implemented using a shift operation and m k is the data to be applied to the hardware multiplier. In order to reduce the switched capacitance of the multiplier consecutive values of m k must be of the same polarity, to minimise switching activity at the multiplier inputs, and have the smallest value possible, to minimise effective wordlength. This criteria can be met by careful selection of s k and consequently m k values. This selection procedure is the pivot of the stages shown in Fig. 1. For a small positive m k , s k must be the largest power-of-two number closest to h k . For this reason, stage 1 is an iterative procedure which aims to find the largest power-of-two number greater than or equal to jh k j. Stage 2 deals with coefficients which are already power-of-two numbers, in which case the complete coefficient is realised using a single shift operation (i.e. s k ¼ h k and m k ¼ 0). In stage 3, the polarity of h k is monitored. If h k is a positive number then s k is chosen as the largest power-of-two number smaller than h k (i.e. s k ¼ 2 i21 ). On the other hand, if h k is negative, s k is chosen to be the smallest power-oftwo number larger than jh k j (i.e. s k ¼ 22 i ). In both cases, m k is equal to ðh k 2 s k Þ: The block processing algorithm requires a number of accumulators and a bank of registers (often called a register file) that can be used as operands for arithmetic and multiplication operations [14]. This facility is available in a number of DSPs, e.g. Texas Instruments TMS320C54x, NEC mPD7701x, Zoran ZR3800x, AT&T DSP16xx, and Motorola DSP5600x. Data block processing commences when a coefficient and L data samples are fetched from the memory and stored into registers in the register file. Next, these data samples are presented to the multiplier through the registers and multiplied with the same coefficient one after the other and their products are added to their respective accumulators. This is repeated for each coefficient, with each time only one data sample (in the block) being replaced with a new one. This reduces the switching activity at coefficient inputs of the multiplier, since the same coefficient is used for all data samples in the block. In addition, less memory accesses to both data and coefficient memories are required since coefficient and data samples are obtained through registers. It is well known that register operations consume less power than memory operations [16]. Figure 2 shows the scheme with a block size of 3, L ¼ 3; for an example of a 6-tap filter, N ¼ 6; at a given instant in time, n ¼ 3: When both block processing and coefficient segmentation algorithms are combined a more powerful algorithm will emerge in which multiplications could be processed in blocks of fixed sizes, leading to a reduction in switched capacitance within multiplier and data and coefficient memory buses. Each individual multiplication operation in turn is segmented for more reduction in switched capacitance within the multiplier circuit.
Using the combined algorithm the filtering commences by fetching s k and m k values and applying these to both shifter and multiplier inputs, respectively. Next, a block of L data samples ðx 0 ; x 1 ; . . .; x L21 Þ are fetched from the data memory and stored in the register file. This is followed by applying the first data sample, x 0 , in the register file to both shifter and multiplier units. The resulting values from both shifter and multiplier units are then summed together and the final result is added to the first accumulator. This is  repeated with the second data sample, x 1 , in the register file and the final result of this is added to the second accumulator. All data samples in the register file are processed in a similar manner. Next, new s k and m k values are fetched and processed with almost the same data samples in the register file, in a manner similar to above. The contents of the register file is updated with the addition of a single new data entry which will replace the first entry in the previous cycle. This procedure reduces the switching activity at coefficient inputs of the multiplier for the following reasons: (a) the same coefficient is used for all data samples in the block, (b) the wordlength of the segmented coefficient (m k ) is less than the original coefficient (h k ), (c) the Hamming distance between consecutive coefficients (m k values) is reduced. In addition, less memory accesses to both data and coefficient memories are required since coefficient and data samples are obtained through internal registers.
The sequence of steps for the algorithm can be summarised as follows: 1. Clear all accumulators (ACC 0 to ACC L21 ). 2. Get the multiplier part, m(N 2 1), of the coefficient h(N 2 1) and apply it to the coefficient input of the multiplier. 3. Get the shifter part, s(N 2 1), of the coefficient h(N 2 1) and apply this to the control inputs of the shifter. 4. Get data samples x½n 2 ðN 2 1Þ; x½n 2 ðN 2 2Þ; . . .; x½n 2 ðN 2 LÞ and store these into data registers R 0 ; R 1 ; . . .; R L21 ; respectively. This will form the first block of data samples. 5. Apply R 0 to both the multiplier and the shifter units.
Add their results and the content of accumulator ACC 0 together and store the final result into accumulator ACC 0 . Repeat this for the remaining data registers, R 1 -R L21 , this time using accumulators ACC 1 to ACC L21 , respectively. 6. Get the multiplier part, m(N 2 2), and the shifter part, s(N 2 2), of the next coefficient, h(N 2 2), and apply these to the multiplier and the shifter inputs, respectively. 7. Update the data block formed in step (4) by getting the next data sample, x½n 2 ðN 2 L 2 1Þ; and storing it in data register R 0 overwriting the oldest data sample in the block.   (5). However, start processing with R 1 , followed by R 2 ; . . .; R L21 ; and R 0 , in a circular manner. During this procedure use accumulators in the same order as data registers, e.g. first ACC 1 , then followed by ACC 2 ; . . .ACC L21 ; and finally ACC 0 . 9. Process the remaining multiplier and shifter parts as in steps (6) to (8). 10. Get the first block of filter outputs, yðnÞ; yðn 2 1Þ; . . .; yðn 2 LÞ; from ACC 0 ; ACC 1 ; . . .; ACC L21 ; respectively. 11. Increment n by L and repeat steps (1) to (10) to obtain the next block of filter outputs.

SIMULATIONS AND RESULTS
To demonstrate our results, we have implemented a two's complement (Baugh -Wooley) array multiplier which was selected as an example of a commonly used multiplier in DSP implementation. 8 £ 8, 16 £ 16 and 24 £ 24-bit multipliers were implemented using Cadence VLSI suite with 0.7 mm CMOS technology. The multipliers were constructed using AND, OR, XOR and INVERTER gates only. Coefficient sets were obtained by designing eight practical FIR filters with the Remez exchange algorithm developed by Parks and McClellan [17]. These are: (1) A low-pass filter with a filter order of N ¼ 24: (2) A band-pass filter with N ¼ 32: (3) A band-pass filter with N ¼ 50; in which unequal weighting is used in the two stop-bands. Thus the peak error in the upper stop-band is ten times smaller than the peak error in the lower stop-band. (4) A band-stop filter for N ¼ 31 and with equal weighting in both pass-bands. (5) A five-band filter for N ¼ 55 with three stop-bands and two pass-bands. The weighting in each of the stop-bands is different, making the peak approximation error differ in each of these bands. (6) A full band differentiator for N ¼ 32 and the peak approximation error ¼ 0.0062. (7) A Hilbert transformer for N ¼ 20 and the peak approximation error ¼ 0.02, where the upper cutoff frequency is 0.5 and the lower cutoff frequency is 0.05. (8) A band-pass filter with an arbitrary weighting function and for N ¼ 128: These benchmark examples were obtained from Ref. [17]. The coefficient sets were quantised to 8, 16, and 24-bits. This was followed by generating zero mean uniformly distributed data samples for each filter. Next the coefficient sets were processed by the segmentation algorithm in order to produce s k and m k values for each coefficient h k . This was followed by generating input simulation files, in which the generated input data samples were associated with the corresponding m k values for a  given filter, for the Cadence's Verilog-XLe digital simulator. Verilog-XL uses the gate level netlist of the multiplier circuit for the simulation procedure. For each simulation the number of signal transitions was monitored. Capacitive information (wiring and loading capacitances) for each gate was extracted by performing a layout of the multiplier circuit. Both capacitive information and the switching activity figure were used to obtain the switched capacitance of each gate. This was then accumulated to give an overall figure for the switched capacitance of the multiplier (see Fig. 3).
The average results for the eight filter examples are shown in Table I  In the case of 8 £ 8-bit multiplier, block processing alone achieves a maximum of 33.15% reduction for a block size of 2. On the other hand, coefficient segmentation alone achieves 53.59% reduction. However, when block processing is used together with coefficient segmentation the reduction is increased to 71.82% for a block size of 2. The power reduction profile for 16 £ 16 and 24 £ 24-bit multipliers is similar to that of the 8 £ 8bit multiplier case, where the best reductions (56.50 and 39.29%, respectively) are achieved using block processing together with coefficient segmentation.
The above simulations were repeated after swapping the multiplier inputs in order to examine the effect on power reduction. The results are shown in Table II. It can be seen that the reductions in switched capacitance are increased in all cases. Specifically, the best reductions for 8 £ 8, 16 £ 16 and 24 £ 24-bit multiplier cases are increased to 80.00, 67.09 and 51.51%, respectively. Table III illustrates the effect of swapping multiplier inputs on the switched capacitance. It clearly demonstrates that the switched capacitance has increased for both conventional filtering and block processing cases, whereas it decreased for both coefficient segmentation alone and block processing with coefficient segmentation together, in various degrees, in the cases of 8 £ 8 and 16 £ 16-bit multipliers. In the case of 24 £ 24-bit multiplier, switched capacitance decreased in all cases. Our analysis revealed that this could be due to the fact that switching activity at coefficient input bits of the multiplier is not uniform. This can clearly be seen in Fig. 4, where in the upper half of the coefficient word switching activity is higher for conventional filtering, and lower for both segmentation alone and block processing with segmentation together. The figure also demonstrates the reductions both in switching activity and effective wordlength resulting from the use of our algorithm.
There is an overhead element associated with the algorithm. This is mainly due to the added shift operations imposed by coefficient segmentation. However, it could be shown that this is typically under 4% [15].

CONCLUSIONS
The authors present a combined block processing and coefficient segmentation algorithm for low power implementation of FIR filters. Low power consumption is achieved through a reduction in the amount of switched capacitance within the multiplier circuit and data/coefficient memory buses. This reduction in switched capacitance is achieved within a hierarchical framework in which coefficients, processed in fixed-size blocks, are segmented into sub-components that are less computationally complex. The algorithm is compared to conventional filtering implementations and those using block processing and coefficient segmentation alone. Results, obtained with different block and multiplier sizes, indicate up to 80% reduction in power consumption.