A novel approach is proposed in this paper for the implementation of 2D DWT using hybrid wave-pipelining (WP). A digital circuit may be operated at a higher frequency by using either pipelining or WP. Pipelining requires additional registers and it results in more area, power dissipation and clock routing complexity. Wave-pipelining does not have any of these disadvantages but requires complex trial and error procedure for tuning the clock period and clock skew between input and output registers. In this paper, a hybrid scheme is proposed to get the benefits of both pipelining and WP techniques. In this paper, two automation schemes are proposed for the implementation of 2D DWT using hybrid WP on both Xilinx, San Jose, CA, USA and Altera FPGAs. In the first scheme, Built-in self-test (BIST) approach is used to choose the clock skew and clock period for I/O registers between the wave-pipelined blocks. In the second approach, an on-chip soft-core processor is used to choose the clock skew and clock period. The results for the hybrid WP are compared with nonpipelined and pipelined approaches. From the implementation results, the hybrid WP scheme requires the same area but faster than the nonpipelined scheme by a factor of 1.25–1.39. The pipelined scheme is faster than the hybrid scheme by a factor of 1.15–1.39 at the cost of an increase in the number of registers by a factor of 1.78–2.73, increase in the number of LEs by a factor of 1.11–1.32 and it increases the clock routing complexity.

Field-programmable
gate arrays (FPGAs) have grown
enormously in their complexity and can encompass
all the major functional elements of a complete end product into a single
chip [

Joint Pictures experts Group 2000 (JPEG2000) is a recently standardized image compression algorithm that provides significant enhancements over the existing JPEG standard. JPEG2000 differs from widely used compression standards in that it relies on DWT and uses embedded bit plane coding of the wavelet coefficients. DWT has been traditionally implemented using convolution or FIR filter bank structures. These structures require both a large number of arithmetic computations and a large memory for storage, which are not desirable for high-speed/low-power image processing applications.

A new multiplier algorithm denoted as Baugh-Wooley pipelined
constant coefficient multiplier (BW-PKCM) is proposed and used for the study
and comparison of distributed arithmetic algorithm(DAA) and lifting schemes on FPGAs in [

The operating frequency of the 2D DWT may be increased, if it is implemented using either pipelining or WP. Pipelining results in the highest operating frequency but has number of disadvantages such as increased area, power dissipation, and clock routing complexity. WP has been proposed as one of the techniques for overcoming these limitations. WP results in increase in the speed and reduction in the clock routing complexity. The proposed hybrid scheme is aimed at combining the advantages of both pipelining and wave-pipelining.

The organization of the rest of the paper is as follows: In Section

The lifting-based
implementation of one-level 2D DWT may be computed using filter banks as shown
in Figure

Subband decomposition for one-level 2D DWT.

In the lifting scheme, the odd and
even input samples are processed by five lifting blocks

Details of

(i) In Figure

(ii) The delay in the individual stages is reduced further by using constant coefficient multiplier (KCM).

KCM uses a ROM for finding the product of a constant and a variable. The variable is fed as address to the ROM which contains the products corresponding to all possible combinations of the operands. When the ROM is implemented using 4 input look-up tables (LUTs), a number of stages of LUTs and adders are required to find the product. The speed of the KCM can be increased by introducing the pipelining registers at the outputs of ROMs and adders.

The detailed diagram of the

In the overlapping scheme, the image block is formed such that a number of pixels overlapped between adjacent blocks along the vertical and horizontal direction are equal to the order of the filter. For example, for the 9/7 biorthogonal filter used for the 2D DWT, the number of overlap pixels should be equal to 4 on the left and 4 on the right between horizontal blocks. Similarly, the number of pixel overlap between vertical blocks should be equal to 4 on the top and 4 on the bottom. For the blocks on the boundary, overlapping needs to be done only on the nonboundary edge.

The concept of
wave-pipelining has been described in a number of previous works [

A combinational logic circuit with input and output registers.

Temporal/spatial diagram of data flow through the combinational logic circuit.

In the conventional
system, the output register is clocked in the nonshaded region and the minimum
clock period,

For Xilinx FPGAs, the
physical design editor referred to as FPGA editor may be used for measuring and
altering the delays. Using this feature, the implementation of wave-pipelined
circuits on Xilinx FPGAs is considered in [

To maximize the operating speed of
the wave-pipelined circuit, the equalization of the path delays is considered
first. This cannot be completely automated as the commercially available
synthesis tools do not support the specification of interconnect delays.
However, the difference in path delays can be minimized by specifying the
physical location of logic cells (slices) or logic elements used for the
implementation, through either the user constraints file (UCF) or the logic lock
feature supported by the FPGA CAD tools [

Programmable clock generator.

The actual clock period
depends on the interconnect delay. The select input of the multiplexer is
varied with either a processor or a finite state machine (FSM) to achieve
different clock frequencies. The wave-pipelined circuit using the programmable
clock and skew generator can be operated at a higher frequency than that can be
achieved using the commercially available synthesis tools which use

Since off-chip
communication between the FPGA and a
processor is bound to be slower than on-chip communication, in order to
minimize the time required for adjustment of the parameters of the
wave-pipelined circuit (clock frequency and skew), the BIST approach using
design for testability [

The automation can be carried out by including two blocks to the basic wave-pipelined circuit: a finite state machine (FSM) and a self test circuit. The FSM systematically varies the clock skew and clock period till the wave-pipelined circuit operates satisfactorily. The self test circuit is used for testing the correctness of the operation.

The block diagram of a wave-pipelined circuit with BIST is given in
Figure

Self tuned wave-pipelined circuit.

The FSM block generates the control signal to choose between the normal mode and the self test mode and this is applied to the select input of multiplexer. In the self test mode, the FSM systematically varies the clock skews and clock periods. For each clock frequency and skew, the self test circuit generates the test inputs, applies them, generates the signature, compares it with the expected result, and finally generates a flag indicating the match. The FSM progresses with the testing till the frequency at which the DUT works for at least 3 or more skew values is found. The operating skew value is chosen to be the middle value so that the DUT would reliably work even if the delays change due to environmental conditions.

For testing the correctness of the
circuit, N test vectors may be fed one after another and the N outputs obtained
should be compared with the expected outputs. In order to minimize the number
of comparisons, a unique signature is generated out of the N outputs and it is
compared with the signature corresponding to the expected outputs. The
signature generator consists of a pseudorandom binary sequence (PRBS) generator
with multiple data input [

The circuit diagram of the
programmable clock generator is shown in
Figure

The operating frequency
of the wave-pipelined circuit is expected to lie between that of nonpipelined
circuit and pipelined circuit. Hence, the minimum and maximum frequency of the
clock generator should correspond to the maximum operating frequencies of the
nonpipelined circuit and pipelined circuits, respectively. The approximate
values of the clock periods of these circuits for the implementation of the

In principle, the number of test
vectors required for an M input combinational logic circuit is

The block
diagram of a wave-pipelined circuit which is tuned using the SOC approach is
shown in Figure

SOC approach for wave-pipelined circuit.

During normal operation, the block RAM contains the array of data to be processed. In the test mode, the block RAM contains the test data. During the testing mode, the processor writes the test vectors into block RAM, systematically applies the select inputs for the clock generator and clock skew blocks and uploads the results stored into the output block RAM for each combination of select inputs. It then checks the results with the expected results.

The
programmable clock and clock skew generator may be implemented in the custom
block using circuit given in Figure

A

The implementation of one-level 2D
DWT for image block of size

Area and speed performance of one-level
forward 2D DWT for

Lifting scheme | Slices used | Speed (MHz) | Number of registers |
---|---|---|---|

Nonpipelined | 836 | 54.45 | 611 |

Pipelined | 1110 | 87.54 | 1670 |

Hybrid WP-P | 836 [188]^{*} | 75.75 | 611 [85]^{*} |

^{*}denotes
additional overhead for testing WP circuits.

From Table

The BIST approach requires a number of
overheads such as FSM, signature generator and test vector RAM [

The block diagram of a wave-pipelined 2D DWT is implemented along with the Nios II soft-core processor and the former is added as the custom block to the Nios II using SOPC builder. The program to be executed by the Nios II is written in C/C++ and the custom block is invoked as a function in the C/C++ program. A C++ program is written to read and write from the block RAM in the custom block. When the C program is run, it systematically varies the “select” inputs for the clock and clock skew blocks, and uploads the content of the output block RAM. It compares this with the expected results. The clock and skew are adjusted till the match occurs for at least three consecutive clock skews. The operating clock and clock skew of the wave-pipelined circuit is fixed at the middle value and from now on, the custom block works without any intervention from the Nios II processor. Only when retuning is required, the Nios II processor interacts with the custom block.

For the hybrid wave-pipelined
circuit, the number of logic elements, number of registers, maximum operating
frequency, and power dissipated are computed and the results are given in
Table

Area and speed performance of one-level
forward 2D DWT for

Lifting scheme | Slices used | Speed (MHz) | Number of registers | Power at normalized frequency (mw/Hz) |
---|---|---|---|---|

Nonpipelined | 703 | 117.83 | 375 | — |

Pipelined | 782 | 203.92 | 671 | 158.97 |

Hybrid WP-P | 703 [30]^{*} | 147.5 | 375[8]^{*} | 179.58 |

From Table

In
order to assess the superiority of hybrid wave-pipelining with regard to power
dissipation, both hybrid wave-pipelined and pipelined circuits are operated at
the same frequency (corresponding to the maximum
operating frequency of the hybrid wave-pipelined circuit) and the power
dissipated for the two approaches are also given in
Table

To verify the correctness and
efficacy of the schemes proposed for the computation of 2D DWT, Lena image of
size

LL1 component compared with input image.

The one-level 2D DWT is
computed using the above scheme for all the 36 image blocks and merged
suitably. The LL1 component of the image is shown in Figure

In this paper, two automation schemes are proposed for the implementation of the 9/7 biorthogonal filters using hybrid WP-P constant coefficient multiplier with Baugh-Wooley multiplication algorithm. The 9/7 biorthogonal filters are implemented on both Xilinx and Altera devices with the following three multipliers: BW-PKCM, BW-KCM, and hybrid WP-P BW-KCM. From the implementation results, it is verified that hybrid WP-P BW-KCM is faster than nonpipelined BW-KCM by a factor of 1.25–1.39. The scheme with BW-PKCM is in turn faster than the hybrid WP-P BW-KCM by a factor of 1.15–1.39 and this is achieved at the cost ofincrease in the number of registers by a factor of 1.78–2.73 and increase in the number of LEs by a factor of 1.11–1.32. The hybrid wave-pipelined 2D DWT dissipates 12.9% more power than pipelined 2D DWT. Because one of the challenges in the design of FPGA-based wave-pipelined circuits is the nonavailability of accurate models for the interconnects and the temperature dependence of their delays. In the absence of these models, the wave-pipelined circuits can only be operated at moderate speeds. The implementation of the automation schemes on the ASICs can lead to better performance firstly because area is not wasted by unwanted registers and secondly because models are available for interconnects in the literature. The work on the computation of two-level 2D DWT using Xilinx and Altera FPGAs and the automation schemes for the ASIC implementation of one-level 2D DWT are under progress.