A Comparison of Bit Serial and Bit Parallel DCT Designs

n recent times the Discrete Cosine Transform, DCT, has become the de facto standard for compressing video images prior to transmission over limited bandwidth telecommunications channels. Indeed, the DCT has been incorporated into both the Joint Photographic Experts Group, JPEG, standard [1], as well as the Video Codec for Audiovisual Services Recommendation H.261 [2]. The Discrete Cosine TransformmDCTmpair are defined as follows:


INTRODUCTION
n recent times the Discrete Cosine Transform, DCT, has become the de facto standard for com- pressing video images prior to transmission over limited bandwidth telecommunications channels.Indeed, the DCT has been incorporated into both the N-point forward and inverse DCT pair can be trans- formed into two (N/2)-point DCTs and two (N/2)point IDCTs respectively, (assuming N is an inte- gral power of 2).If this procedure is continued until n and k both become zero, then the butterfly mod- ule of Figure 1 results (for N 8).For an 8 point DCT there are 13 multiplications and 25 additions (subtractions).
Implementation of the DCT was investigated for both bit parallel and bit serial designs.In each case the arrangement of the algorithm was investigated to determine how it could be implemented to meet the design criteria specified in Section 2.

VLSI DESIGN SPECIFICATION
The constraints imposed on the VLSI implementa- tion of the refined Lee DCT algorithm stemmed primarily from the VLSI design tools at our dis- posal, together with the associated silicon foundry support.Our design was allocated a floorplan of up to 6.8,6.8 mm of a multi-project chip to be fabri- cated in 2.5/xm double metal CMOS using the AWA Microelectronics ASIC facility at Homebush, Syd- ney, Australia.The design is to operate in real time in the 10 to 15MHz range, and process 288 352 pixel images at 30 frames per second.This requires 47520 sets of 8 data values to be processed per second (288,352,30/8,8), which equates to 2104 nsecs for a set of 8 data values.
The essential design criterion was to perform a 1D transform in real time.Additional (desirable) design criteria were" (i) easy adaptability to 2D transform computa- tions, (ii) minimise the required silicon real estate, and (iii) simple timing and control circuitry.3o

BIT PARALLEL DESIGN
The first design considered was a pipelined bit par- allel design, based on that of Arnould & Durge [8].
Their design incorporated a single parallel multi- plier, together with a pipelined subtractor and parallel adder, but was based on a different DCT algo- rithm to Lee.The bit parallel design adopted in the present study (Figure 2), uses a parallel multiplier with a pipelined subtractor connected to its front end.This maximizes the data flow through the ALU, buffer FIGURE 2 Bit Parallel Floor Plan since all but one multiplication is preceded by a subtraction (Figure 1).
There is an adder in parallel with the subtractor/ multiplier to accommodate the five additions at the end of the butterfly.The results of the adder and multiplier are placed back onto the data bus and stored in the buffers from where the original two data values were read.This requires 13 passes through the bit parallel ALU for multiplications, and a further 5 passes to evaluate the 5 final addi- tions.
The bit parallel multiplier is based on the design of Weste and Eshraghian's [9], and requires 8-bit coefficients.With 12 data bits and 8 bit multiplier coefficients, the settle time of the multiplier is 12 + 8 + 1 21 adder settle times.As there is a subtrac- tion or addition preceding the multiplications, these settle times have to be included in the settle time of the ALU.The simplest 12 bit adder or subtractor would consist of 12 serial adders and have a settle time of 12 adders.However, if an efficient 12 bit adder or subtractor is designed as a 12 bit carry select adder then the settle time can be reduced to 6 adder settle times.Thus the total settle time of the ALU is 21 + 6 27 adder settle times.SPICE anal- ysis gave a worst case adder settle time of 5 nsec, which amounts to 27,5 135 nsecs per ALU pass.Therefore, the total ALU processing time is a mini- mum of 135 nsecs for multiplications, and only 30 nsecs for additions.The data can then be read on the next clock cycle following these settle times.Performance of a bit parallel design islimited by the settle time of its slowest combinational element which in the present design is the multiplication ALU.
.Loading data into the multiplication ALU can be done on the first cycle of the ALU evaluation period and then 135 nsecs are then required to allow the ALU to settle.Reading of the ALU will then be necessary on the next clock cycle.Using a clock rate of 15 MHz a single clock cycle is 66.7 nsecs.There- fore, 135 nsec settle time in the ALU plus one clock cycle to read the results requires three clock cycles.
For the final 5 additions one cycle is needed to load the adder and allow it to settle, then a second cycle is needed to read the results.The total processing of a single transformation is then (13 3 66.7) + (5 66.7 2)= 3268.3 nsecs which is outside the allowable 2104 nsecs.
The reader may ask why not divide the ALU into two sections so that on the first cycle data could be written into the first section while concurrently re- suits are read from the second section, then on the second cycle data is fed from the first section into the second section.On the third cycle the results on the second section are read and the first section is loaded again.This would require only 2 66.7 133.4 nsecs to process each pass through the ALU.
Considering that there would need to be an initial load cycle for the first data into the ALU, this arrangement would result in a processing time of 66.7 + (13 2 66.7) + (5 2 66.7) 2468 nsecs.This design would then require a second set of data buffers so that data could be read and written to the ALU concurrently.By continuing to refine and optimize this design it may be possible to achieve the required processing time of 2104 nsecs.
The only way we could imagine achieving this would be to add a second ALU.However, this modifica- tion would have resulted in the design exceeding the allowable silicon area. 4. BIT SERIAL DCT BUTTERFLY DESIGN In investigating bit serial designs it was necessary to formulate the algorithm so that the data could be passed through the computational elements in bit serial form.For example, the original Lee algorithm requires that at each point of the butterfly the relevant data must have already been evaluated.Therefore, control over the processing must enable the data to be processed in the correct order.Investigations failed to present a method of implementing the original algorithm that would meet the throughput requirements and so efforts were then directed to refining the original algorithm into a form that would allow the throughput to meet these require- ments.
Closer examination of the INMOS approach re- veals that most of the matrix coefficients are dupli- cated in a number of positions within the matrix.By using this approach of expanding the transformation into eight equations and then removing the duplica- tion, the "Lee algorithm equations can be expanded to the point where the eight transformed values are a function of the eight input values, as indicated in Figure 3.The transform equations take one of three forms, containing either 1, 2 or 4 multiplications.This enables the computations to be reduced to 22 multiplications and 28 additions (subtractions).This represents a significant improvement over the IN-MOS design [5] which requires 64 multiplications and 64 additions.
The following equations are derived from Fig- ure 3.

BIT SERIAL DESIGN
There are a number of different ways in which the above equations could be implemented.The final design was adopted primarily because it had the highest throughput rate, the simplest timing and control circuitry and could be easily adapted to 2D transform calculations.This design requires 30 cy- cles of the 15 MHz clock, which results in 30.66.7 2001 nsecs to perform a single transform.Timing and control circuitry for sequencing the data through the chip is achieved by routing the two clock phases throughout the design rather than enabling different sections at different times as in the bit parallel design.Although this bit serial design is slightly larger in size than alternative bit serial designs in- vestigated, it still fits within the 6.8.6.8 mm allocation.Now since the multiplication coefficients are al- ready known, and multiple data streams are pro- cessed in parallel, these coefficients can be incorpo- rated on-chip, rather than resorting to a generic multiplier which could cater for more than one coefficient.This leads to a simpler design.
From Figure 4, we see that each generic bit serial multiplier stage requires an adder together with associated AND and delay elements.The coefficient enable AND-gate can be removed, as each multi- plier unit which has the coefficient enabled is simply connected to the above data line.
The throughput of this design can be improved further by realising that several consecutive adders compute results during a single clock period.Accordingly, the delay elements between the adder cells and the data stream can be removed, as indi- cated in Figure 5.
Figure 6 shows a specific multiplier cell for the coefficient 101012 In stages where the coefficient is zero, the adder and AND cells can be removed and replaced with delay cells for the data transfer.This Thus, 22 generic multipliers would require more circuitry than 22 specific multipliers, with some of these duplicated in order to enable parallel process- ing.

RESULTS AND DISCUSSION
The first design considered for implementing the DCT chip was a bit parallel one.It was found that two ALUs are necessary in order to perform DCTs in real time [8].With a single ALU the best possible performance to perform a single transform was 2468 nsec, which exceeds the 2104 nsecs required to perform a transformation in real time.Therefore, a second ALU would be required to enable the calcu- lations to be performed in parallel and to enable the processing to be within the real time constraint.
Moreover, the limited silicon real estate at our dis- posal prohibited any further consideration of a par- allel solution.
Several bit serial designs were then considered, each of which used a refined version of Lee's algo- rithm, in order to improve computational efficiency.
The final design was selected on the basis of having the fastest throughput, simplest timing and control circuitry and easy extension to 2D transforms, de- spite its slightly larger size.
The Lee algorithm, in both original and refined forms, was verified via C code implementations on a SPARCstation (by confirming that a DCT cascaded with an inverse DCT produced the original data set).The bit serial DCT chip cycle time to process eight data points comprises: (a) 8 cycles to read(write) data to(from) the chip, plus (b) 21 cycles to cycle the data through the ALU, plus (c) 1 more cycle to write the last data bit.cam/ product OUT NOTE If the most significant coefficient 0 then the partial product is initialized to '0' and not to' data in' FIGURE 6 Specific Multiplier Cell for a coefficient of 101012 The DCT chip uses an integer multiplication unit, and thus will lead to roundoff errors.The maximum expected error occurs when multiplying the largest result, coefficient x(0), by half the multiplier accu- racy: (255.8) .(1/256).(0.5) 3.9843 Actual tests using TREK (see below) revealed that an error of 2.0 would be more typical in prac- tice.
SPICE was only able to be used with individual modules of the DCT chip design, not with the entire design.In this manner, we were able to conform that the following modules performed as expected: (i) multiplier cell, (ii) delay buffer, (iii) clock driver, and (iv) input & output buffers.
The only extensive testing which was possible using the VLSI design tools at our disposal involved TREK--the switch level simulator.Data sets were run using the TREK simulator to verify that the design were correct.The data simulated a normal 1D transform, transforming the results of the 1D transform to simulate a 2D transform, interrupt control of input and output control lines, cascading two transforms in order to demonstrate the chip's cycle performance.All these tests were verified with a C program to ensure the correct results were obtained.

CONCLUSION
The final chip design had the following characteris- tics: (i) 15 MHz master clock which is supplied from the master processor, (ii) parallel data IO (requiring 8 clock pulses each), two IO buffers, for storage of both the (eight 12-bit) input and transformed data, single-pass (integer) ALU, 8 clock pulses required to read(write) data to(from) the chip, 22 clock pulses required to transform the data, and occupies 5.4 mm of silicon (cf.6.8.6.8 mm allocation).
Thus the time required to compute a 1D DCT transform of an 8 point DCT is 2001 nSec, which increases to 32 mSec for a 2D transformation, (an 8.8 point DCT).The design objectives were satis- fied, primarily the ability to perform 1D DCTs in real time.For the purposes of comparison, Table 1 shows the performance of our bit serial DCT design with earlier bit parallel ones [10,11].Allowing for restrictions in transistor count and clock rate, the bit serial design exhibits throughput comparable with the bit parallel chips.
Even though our final bit serial design was not committed to silicon, we were able to gain signifi- cant insights into the advantages of various bit paral- lel and bit serial solutions to this problem.Based on our experience, we conclude that bit serial is the preferred approach for this application.Indeed, this confirms previous work which purports that bit se- rial designs are admirably suited to applications involving computationally complex tasks, exhibit a high degree of concurrency, have a large dynamic range, yet have a narrow signal processing bandwidth [12].We concur from our present study that the advantages of bit serial designs are their effi- ciency (in terms of speed/area product) and simple interconnections (which has significant repercus- sions in realisation in VLSI form).
FIGURE 8-point Lee DCT Forward Transform FIGURE 3 Expanded Lee Algorithm 10