Low Power VLSI Implementation of the DCT on Single Multiplier DSP Processors *

A generic multiplication scheme for the low power VLSI implementation of the DCT is described in this paper. The scheme concurrently processes blocks of cosine coefficient and pixel values during the multiplication procedure, with the aim of reducing the total switched capacitance within the multiplier circuit. The cosine coefficients, within each block, are manipulated such that some are processed using shift operations only. The remaining coefficients are presented to the multiplier inputs as a sequence, ordered according to bit correlation between successive cosine coefficients. The paper describes the multiplication scheme, the power evaluation environment used, and presents results, with a number of standard benchmark examples, demonstrating upto 50% power saving.


INTRODUCTION
Currently there is considerable interest in the low power implementation of the Discrete Cosine Transform (DCT).This is mainly due to the DCT being the computational bottleneck of standards such as JPEG and MPEG [1].Most research work considering low power implemen- tation of the DCT have targeted reducing the computational complexity of the design or modi- fying it for operation under a lower supply voltage [1,2].Both these techniques have a limited effect on power reduction.Another major contribution to power consumption is due to the effective switched capacitance [3,4].Only a few researchers have targeted reducing power of a DCT imple- mentation through a reduction in the amount of switched capacitance.This reduction has been achieved through techniques such as the detection of zero-valued DCT coefficients and lookup table partitioning [5].
This paper presents a generic multiplication scheme for the low power VLSI implementation of the DCT on CMOS-Based Digital Signal Based on "Low Power DCT implementation approach for VLSI DSP Processors" by S. Masupe and T. Arslan which appeared in Proceedings of International Symposium on Circuits and Systems; Florida 30 May-2 June, 1999, Pages I-149 to I-152. (C) 1999

IEEE.
Corresponding author, e-mail: Arslan@ee.ed.ac.ukProcessors.A basic form of the scheme, which operated on primary images with bi-level pixel value patterns, was first proposed in [6].The scheme, outlined in this paper, can be used with a wide varity of images including those with conti- nuous and sharply varying pixel values.The scheme concurrently processes blocks of cosine coefficient and pixel values during the multiplication procedure, with the aim of reducing the total switched capacitance within the multiplier cir- cuit.The cosine coefficients, within each block, are processed such that some are processed us- ing shift operations only.The remaining coeffi- cients are presented to the multiplier inputs as a sequence, ordered according to bit correlation be- tween successive cosine coefficients.The paper describes the multiplication scheme, the power evaluation environment used, and presents results, with a number of standard benchmark examples, demonstrating upto 50% power savings.

IMPLEMENTATION
The computational bottleneck for the implementa- tion of the DCT is the multiplication of the cosine coefficients matrix [E], by the pixel matrix [D]. in order to obtain the DCT coefficients [C], i.e.,
(2).In some applications where the number of multipliers is limited, this implies new data at both multiplier inputs everytime a multiplication is conducted.This in turn implies a large switching activity inside the multiplier circuit.
A number of experiments were carried out with some cosine matrix and various pixel matrices.
Careful analysis of the multiplication procedure revealed two distinct categories of cosine elements: (1) elements that are powers of 2, and (2) those which are non-powers of 2 numerics.For this reason, the algorithm processes the elements in (1) by performing a simple shift operation.It is well known that the switched capacitance of a shift is significantly less than that of a multiplication [2].
Examination of the multiplication procedure embodied in Eq. ( 2) can reveal more scope for power optimisation.This can be demonstrated by expanding Eq. ( 2) with n 8, as shown in Figure 1.As could be deduced from the figure, when com- putation is performed on the basis of evaluating coefficients successively, i.e., Cll CSl then new pixel and coefficient values are processed at both inputs of the multiplier.If the multiplication procedure is performed on a column-after-column basis (i.e., Eil * Dll, Ei2 * O12, and so on) then the value of Dxk will be constant (for 8 multiplica- tions) at the corresponding multiplier input.This implies less switching activity at the multiplier inputs and memory buses.For each column being

Dll D12 D18
D21" D22' D28' processed, the multiplication procedure outlined above has the product form (m, constant), where m are the cosine matrix elements in the column being processed.The constant remains valid for only these values of m, if the column changes, a new constant is activated.The multiplication results are unchanged irrespective of the order in which the constant is multiplied by the sequence of cosine coefficients in the column.For this reason, the algorithm performs the multiplication of ele- ments in (2) in a column-by-column fashion and orders the elements, in the column being processed, according to some criteria.In our case the criteria being minimum switching activity (hamming distance) between consecutive coeffi- cients multiplied by the constant.This guarantees that similar elements are subsequently applied to the inputs of the multiplier causing minimum switching activity within the internal circuit of the multiplier.Although, our investigations were car- ried out using the example of ordering elements according to minimum hamming distance, in practice, any ordering algorithm can be used and the amount of power saving is determined by the power of the algorithm used.The steps in Figure 2 outline the algorithm, which commences with the following initialisation steps: (1) Process entries Ex with shift operation. (2)Order remaining co- efficients according to some ordering algorithm.
(3) Save entries Eix in (E) memory, see Figure 3,   with elements of the same column being adjacent to each other, followed by the next column, and SO on.
To illustrate the above algorithm, consider an example where; After the first iteration (i 1, x and k 1), the first column, E;, is ordered according to mini- mum hamming distance to produce the ordered sequence (2.00, 0.77, 1.41, 1.85).The original loca- tions of these elements are stored.Next the above sequence is multiplied by the first entry in Dx, 35.The entries in registers R; will be (70, 64.75, 49.35,   26.95), where the first entry (70) is saved in R1 and the second entry is saved in R2 and so on.Similarly at the end of iteration 2 (i 2, x--and k--1), Ri will contain (134, 89.39, 4.23, -32.25).The last iteration (i 4, x and k 1) for this particular column, results in R; containing the first column of

SIMULATION AND RESULTS
Figure 4 illustrates the framework utilised for the evaluation of the scheme with a number of 512 512 pixel example benchmark images.
These, which include Lena, Man and checked images, are shown in Figure 5.The cosine co- efficient matrix used was obtained using the MATLAB signal processing toolbox [9].This was scaled such that entries can be represented by numbers between -128 and 127.The evaluation environment is based upon the MPEG standards, where images are processed in blocks of 8 8.For this reason our results are obtained with n--8, i.e., an 8 8 cosine matrix is used with 512 512 pixel images being processed in blocks of 8 8.An the DCT coefficient matrix [C], i.e., Ri=Cxl--(244, 18.96, 0, 1.35)7.This procedure is carried out until x-k n, at which case R; will contain C44.
As it stands, the algorithm can be implemented on traditional DSPs without any loss in through- put.However for high throughput applications, a modified processor architecture is required.The architecture, a simplified version of which is shown in Figure 3, requires an internal register bank in order to store the partial products, (R), which eventually result in the DCT coefficients, [C].In addition, a special memory unit is allocated for both the cosine and the pixel matrix elements, Eix and D xk respectively.A shifter is included to cope with the additional shift operations.The multiplier and the adder units are included to perform the normal multiply-add DSP opera- tions.Since the outputs of both the multiplier and the shifter have to share the same input of the register bank, a multiplexer is needed to re- solve which one output can use the register bank input.Another multiplexer is required to handle outputs from multiplier/shifter and adder units since some of the inputs to the register bank proceed directly to the bank without passing through the adder.A C-program based test-fixture mapping system was developed to generate input simulation files for the Verilog-XLr digital simulator [10].This involved forming the appropriate image-pixel/ cosine-coefficient pairs, in the order imposed by the multiplication algorithm, so that they can be applied to the inputs of the hardware multiplier.Three types of input simulation files are generated, representing the use of one of the following: (1) Traditional cosine DCT multiplication (Traditional).( 2) Column-based multiplication algorithm without ordering (Column-based). (3)  Column-based multiplication algorithm with or- dering according to minimum hamming distance (Hamming).In each simulation, the number of signal transitions (switching activity), at the out- put of each gate, is monitored.Capacitive informa- tion for each gate is extracted from the layout of the multiplier circuit.Both of these are used to obtain a figure for the total switched capacitance of the multiplier.
Table I illustrates the results obtained with the different bench mark images.Clearly, power saving is achieved in all cases with a maximum of 50% with the checked image using Hamming, The paper presents an effective scheme for the low power VLSI implementation of the DCT within an MPEG based environment.The scheme re- duces power through the utilisation of shift opera- tions, where possible, together with sequencing

FIGURE 3
FIGURE 3 Simplified architecture of the processor.

FIGURE 4
FIGURE 4  Framework for the algorithm evaluation.

FIGURE 5
FIGURE 5 Some of the tested images.
a Ph.D. research student at the University of Edinburgh.His research interests include Low Power VLSI Design, DCT-Based Image Compression Techniques, HDL based 8 x 8-bit array Multiplier was constructed, us- ing the Cadence VLSI suite with ES2 0.7 CMOS Technology, and processed down to layout level.