A novel approach is proposed in this paper for the implementation of 2D DWT using hybrid wave-pipelining (WP). A digital circuit may be operated at a higher frequency by using either pipelining or WP. Pipelining requires additional registers and it results in more area, power dissipation and clock routing complexity. Wave-pipelining does not have any of these disadvantages but requires complex trial and error procedure for tuning the clock period and clock skew between input and output registers. In this paper, a hybrid scheme is proposed to get the benefits of both pipelining and WP techniques. In this paper, two automation schemes are proposed for the implementation of 2D DWT using hybrid WP on both
Xilinx, San Jose, CA, USA and Altera FPGAs. In the first scheme, Built-in self-test (BIST) approach is used to choose the clock skew and clock period for I/O registers between the wave-pipelined blocks. In the second approach, an on-chip soft-core processor is used to choose the clock skew and clock period. The results for the hybrid WP are compared with nonpipelined and pipelined approaches. From the implementation results, the hybrid WP scheme requires the same area but faster than the nonpipelined scheme by a factor of 1.25–1.39. The pipelined scheme is faster than the hybrid scheme by a factor of 1.15–1.39 at the cost of an increase in the number of registers by a factor of 1.78–2.73, increase in the number of LEs by a factor of 1.11–1.32 and it increases the clock routing complexity.
1. Introduction
Field-programmable
gate arrays (FPGAs) have grown
enormously in their complexity and can encompass
all the major functional elements of a complete end product into a single
chip [1]. An FPGA-based system on chip can contain one or more processors,
memories, dedicated components for accelerating critical tasks and interfaces
to various peripherals. Development tools for the FPGAs, the
Altera, San Jose, CA, USA system-on-programmable-chip (SOPC)
builder, enable the integration of intellectual proprietary (IP) cores for common DSP functions
and user-designed custom blocks with the softcore processors Nios II. The availability of on-chip dedicated
multipliers, softcore/hardcore processors and IP cores make the FPGAs to be an
ideal platform for the implementation of area as well as speed intensive image
processing applications such as discrete cosine transform (DCT) and discrete wavelet
transform (DWT) [2].
Joint Pictures experts Group 2000 (JPEG2000) is a
recently standardized image compression algorithm that provides significant
enhancements over the existing JPEG standard. JPEG2000 differs from widely used
compression standards in that it relies on DWT and uses embedded bit plane
coding of the wavelet coefficients. DWT has been traditionally implemented
using convolution or FIR filter bank structures. These structures require both
a large number of arithmetic computations and a large memory for storage, which
are not desirable for high-speed/low-power image processing applications.
A new multiplier algorithm denoted as Baugh-Wooley pipelined
constant coefficient multiplier (BW-PKCM) is proposed and used for the study
and comparison of distributed arithmetic algorithm(DAA) and lifting schemes on FPGAs in [3]. For the computation of
2D DWT,
2’s complement multiplications are required. In the literature BW method [4] has been studied with
carry save, carry ripple, and serial parallel algorithms. These schemes are
inefficient in speed, area, or both when one of the operand is fixed. For an
N-bit
number, conventional 2’s complement multiplier (C2CM) requires [N-1/4] arrays
of 4-inputs LUTs. But sign extension and BW methods require [N/4] arrays of
4-inputs LUTs. The size of the array is equal to the number of product bits.
The 2’s complement block and control logic increases the number of LUT arrays
area and multiplication time for the C2CM. However, for the sign extension and
BW, the number of LUT array may be the same as that required for the first
scheme. The lifting scheme with BWPKCM requires 4% less area but has the same
speed compared to that using distributed arithmetic algorithm with sign
extension scheme. The implementation details are available with [3].
In 2D DWT, filter coefficients are constant. Hence, BW-PKCM which combines the pipelined
KCM with Baugh-Wooley multiplication algorithm is used in this paper.
The operating frequency of the 2D DWT may be
increased, if it is implemented using either pipelining or WP. Pipelining
results in the highest operating frequency but has number of disadvantages such
as increased area, power dissipation, and clock routing complexity. WP has been
proposed as one of the techniques for overcoming these limitations. WP results
in increase in the speed and reduction in the clock routing complexity. The
proposed hybrid scheme is aimed at combining the advantages of both pipelining
and wave-pipelining.
The organization of the rest of the paper is as follows: In Section
2, the previous work on 2D DWT and the design of wave-pipelined (WP) lifting
blocks on FPGAs are described. In Section 3, the previous work related to WP
and the challenges involved in the design of WP circuits are described. In Section
4, automation schemes for WP circuits are presented. In
Section 5, BIST approaches for the implementation
of WP circuits are discussed and the implementation results are presented. In
Section 6, SOC approaches for the implementation of WP circuits are discussed
and the implementation results are presented. Section 7 summarizes the
conclusions.
2. Review of Previous Work on 2D DWT
The lifting-based
implementation of one-level 2D DWT may be computed using filter banks as shown
in Figure 1. The input samples x(n) are passed through the 2 stages of analysis
filters. They are first processed by the low-pass (h[n]) and high-pass (g[n])
horizontal filters and are sub sampled by two. Subsequently, the outputs (L1,
H1) are processed by low-pass and high-pass vertical filters. The lifting scheme
uses a polyphase structure for the analysis filter [5]. The main feature of the
lifting scheme requires a far fewer computations compared to the convolution-based
DWT. Also the computational complexity can be reduced by 50%. As a result,
lifting-based implementation provides an efficient way to compute wavelet
transforms [6].
Subband decomposition for
one-level 2D DWT.
In the lifting scheme, the odd and
even input samples are processed by five lifting blocks (α,β,γ,δ,ξ(ξ1,ξ2)) in cascade. ξ1,ξ2 are
scaling blocks.
Details of α and
β blocks are shown in Figures 2 and
3. γ and δ blocks are obtained by replacing the constants α,β with γ,δ. The following modifications are
proposed in the lifting scheme [3].
α block.
β block.
(i) In Figure 2, since the output
from one block is fed as input to the next block, the maximum rate at which the
input can be fed to the system depends on the sum of the delays in all the four
stages. The speed is increased in [5], by introducing pipelining at the points
indicated by dotted lines in Figure 2. In this case, the input rate is
determined by the largest delay among all the four blocks.
(ii) The delay in the individual
stages is reduced further by using constant coefficient multiplier (KCM).
KCM uses a ROM for
finding the product of a constant and a variable. The variable is fed as address
to the ROM which contains the products corresponding to all possible
combinations of the operands. When the ROM is implemented using 4 input look-up
tables (LUTs), a number of stages of LUTs and adders are required to find the
product. The speed of the KCM can be increased by introducing the pipelining
registers at the outputs of ROMs and adders.
The detailed diagram of the α block implemented using BW-PKCM is shown in
Figure 4. The same scheme can be adopted for the β,γ,δ,ξ1,ξ2 blocks. The dotted line indicates
points where registers may be inserted for pipelining. For wave-pipelining all
the stages are directly connected without registers. The registers are used
only at the inputs and outputs. In hybrid wave-pipelining, registers are used
between adjacent lifting blocks and the individual lifting blocks are connected
without registers.
α block
using BW-PKCM.
2.1. Overlapping Scheme for Block 2D DWT
In the overlapping scheme, the image
block is formed such that a number of pixels overlapped between adjacent blocks
along the vertical and horizontal direction are equal to the order of the filter. For example,
for the 9/7 biorthogonal filter used for the 2D DWT, the number of overlap
pixels should be equal to 4 on the left and 4 on the right between horizontal
blocks. Similarly, the number of pixel overlap between vertical blocks should
be equal to 4 on the top and 4 on the bottom. For the blocks on the boundary,
overlapping needs to be done only on the nonboundary edge.
3. Background on Wave-Pipelining
The concept of
wave-pipelining has been described in a number of previous works [4, 7, 8].
To illustrate the concept of wave-pipelining, graphical representation [7] of
the data flow through combinational logic is used.
Figure 5 shows the
combinational logic with wave-pipelining circuit surrounded by edge triggered
input and output registers [7].
Figure 6 gives the associated timing diagram [7].
In Figure 6, the shaded regions bounded by the maximum and minimum delays
through the logic (Dmax and Dmin) depict the flow of data through
the combinational logic and the variations in the logic block with time. The
nonshaded areas depict the stable duration of the logic.
A combinational logic
circuit with input and output registers.
Temporal/spatial diagram of data flow through the combinational logic circuit.
In the conventional
system, the output register is clocked in the nonshaded region and the minimum
clock period, Tclk, is chosen to be
greater than Dmax. In the wave-pipelined
system, the clock period is chosen to be (Dmax−Dmin)
+ clocking overheads such as set-up time and hold time. In
Figure 5, δ denotes
clock skew between the input and output register. To ensure correct operation,
the skew should be adjusted such that the active clock edge occurs in the
stable period. To maximize the frequency of operation of the wave-pipelined
system, the difference (Dmax−Dmin) is
minimized by equalizing the path delays. However, the stable period decreases
with the increase in the logic depth. By adjusting the latching instant at the
output register to lie in the stable period, the wave-pipelined circuit has to
be made to work properly. But, for large logic depths, there may not be any
stable period. Hence, adjusting the latching instant by itself may not be
adequate for storing the correct result at the output register. For such cases,
the clock period has to be increased to increase the stable period.
Equalization of path delays, adjustment of the clock period and clock skew are
the three tasks carried out for maximizing the operating speed of the
wave-pipelined circuit. All the three tasks require the delays to be measured
and altered if required. These tasks are carried out manually in [9, 10].
For Xilinx FPGAs, the
physical design editor referred to as FPGA editor may be used for measuring and
altering the delays. Using this feature, the implementation of wave-pipelined
circuits on Xilinx FPGAs is considered in [10]. The wave-pipelined circuit designed using the FPGA editor may be tested
using simulation. However, the simulation is inadequate for testing due to the
difference between the actual delays and the delays calculated by the FPGA
editor. This is because the FPGA editor considers only the worst-case delays
and the actual delays may be significantly different due to fabrication variations.
This difference becomes important as the logic depth of the circuit increases.
Hence, the design has to be downloaded to the actual FPGA and its operation has
to be checked by feeding the test data [10]. If correct results are not
obtained, delays are altered and the design is downloaded for testing again. A
number of iterations of place and route, simulation, downloading, and testing
in the actual device may be required till the correct results are obtained. The
design of wave-pipelined circuit in this fashion requires human intervention
and is time consuming. Automation of the above three tasks are considered in
the next section.
4. Automation Schemes for Wave-Pipelined Circuits
To maximize the operating speed of
the wave-pipelined circuit, the equalization of the path delays is considered
first. This cannot be completely automated as the commercially available
synthesis tools do not support the specification of interconnect delays.
However, the difference in path delays can be minimized by specifying the
physical location of logic cells (slices) or logic elements used for the
implementation, through either the user constraints file (UCF) or the logic lock
feature supported by the FPGA CAD tools [11, 12]. UCF approach is proposed
for Xilinx FPGAs in [10]. The logic lock feature is adopted for the Altera
FPGAs in this paper. The adjustment of the clock skew and clock period can be
automated by adopting programmability. The programmable clock and clock skew
generator may be implemented in the FPGAs.
Figure 7 gives the circuit diagram
of a clock generation scheme which consists of a delay block and an inverter.
Programmable clock generator.
The actual clock period
depends on the interconnect delay. The select input of the multiplexer is
varied with either a processor or a finite state machine (FSM) to achieve
different clock frequencies. The wave-pipelined circuit using the programmable
clock and skew generator can be operated at a higher frequency than that can be
achieved using the commercially available synthesis tools which use Dmax
for fixing the operating
frequency. The automation may be carried out using either off-chip processor or
on-chip processor. The off-chip processor is used when the FPGA is used as a
coprocessor or hardware accelerator for a main processor or microcontroller.
Since off-chip
communication between the FPGA and a
processor is bound to be slower than on-chip communication, in order to
minimize the time required for adjustment of the parameters of the
wave-pipelined circuit (clock frequency and skew), the BIST approach using
design for testability [13, 14] technique, is proposed for this case. In the
SOC approach, a processor is assumed to be available on-chip and it is used for
adjustment of the parameters of the wave-pipelined circuit.
4.1. BIST Approach for Wave-Pipelined Circuit
The automation can be carried out by
including two blocks to the basic wave-pipelined circuit: a finite state
machine (FSM) and a self test circuit. The FSM systematically varies the clock
skew and clock period till the wave-pipelined circuit operates satisfactorily.
The self test circuit is used for testing the correctness of the operation.
The block diagram of a wave-pipelined circuit with BIST is given in
Figure 8. This
is obtained by including the test vector RAM, a signature analyzer,
programmable clock and clock skew generators and FSM blocks in the circuit of
Figure 5. The signature analyzer consists of a pseudorandom
binary sequence (PRBS) generator-based signature generator and a comparator.
Self tuned wave-pipelined circuit.
4.1.1. FSM Block
The FSM block generates the control signal to choose between the normal mode and the
self test mode and this is applied to the select input of multiplexer. In the
self test mode, the FSM systematically varies the clock skews and clock periods. For each clock
frequency and skew, the self test circuit generates the test inputs, applies
them, generates the signature, compares it with the expected result, and
finally generates a flag indicating the match. The FSM progresses with the testing till the frequency at which the DUT works for at
least 3 or more skew values is found. The operating skew value is chosen to be
the middle value so that the DUT would reliably work even if the delays change
due to environmental conditions.
4.1.2. Signature Generator
For testing the correctness of the
circuit, N test vectors may be fed one after another and the N outputs obtained
should be compared with the expected outputs. In order to minimize the number
of comparisons, a unique signature is generated out of the N outputs and it is
compared with the signature corresponding to the expected outputs. The
signature generator consists of a pseudorandom binary sequence (PRBS) generator
with multiple data input [13]. The successive output of the output register is
XOR’ed with the state of the PRBS to generate the next state.
4.1.3. Programmable Clock Generator
The circuit diagram of the
programmable clock generator is shown in
Figure 7. Programmable clock skew
generator may be implemented using only delay blocks. The clock generator is
implemented using only the LUTs and interconnects (nets) and is proposed for
the first time in [10]. The interconnects are manually chosen using the FPGA
layout editor in [10]. Programmable feature is proposed for the first time in
this paper. In this case, the interconnect delays are selected using the multiplexer.
The number of possible interconnect delays is restricted to minimize the
overheads due to the additional LUTs required for the introduction of the delay
and the multiplexers. Hence, only a coarse variation in the delay values can be
achieved. In
Figure 7, inputs C0–C3 are the
programmable select inputs, which determine the actual clock frequency.
The operating frequency
of the wave-pipelined circuit is expected to lie between that of nonpipelined
circuit and pipelined circuit. Hence, the minimum and maximum frequency of the
clock generator should correspond to the maximum operating frequencies of the
nonpipelined circuit and pipelined circuits, respectively. The approximate
values of the clock periods of these circuits for the implementation of the β block
on FPGA are 5.6 nanoseconds and 7.4 nanoseconds, respectively. The values of Dmax, Dmin for
the α block are 15.302 nanoseconds and 7.34 nanoseconds,
respectively. The programmable clock and
skew generator are designed such that the clock period can be varied from 8.4 nanoseconds
to 20.6 nanoseconds in steps of 0.8 nanoseconds and skew can be varied from 12.3 nanoseconds to 26.2 nanoseconds in steps of 0.9
nanoseconds approximately. The same exercise is carried out for β, γ, and δ blocks
using the synthesis report. A single clock generator is used for all the four
blocks. Separate skew generators are used for each of the four blocks. By varying the select line through FSM or processor, different clock periods and skew values are achieved.
4.1.4. Test Vector Generation
In principle, the number of test
vectors required for an M input combinational logic circuit is 2M.
If the value of M is small, exhaustive testing of the circuit may be carried
out by generating the test inputs through an M-bit counter and checking the
signature after the counter completes one full cycle. However, some of the
inputs may contribute more to Dmax than the others. For higher order circuits, exhaustive testing would require a
large testing time. In this case, a set of random vectors may be used for
testing the wave-pipelined circuits.
4.2. SOC Approach for Wave-Pipelined Circuits
The block
diagram of a wave-pipelined circuit which is tuned using the SOC approach is
shown in Figure 9. It consists of programmable clock, clock skew generator, and
block RAMs for storing the inputs and output vectors of the wave-pipelined
circuit.
SOC approach for
wave-pipelined circuit.
During normal operation, the block RAM
contains the array of data to be processed. In the test mode, the block RAM
contains the test data. During the testing mode, the processor writes the test
vectors into block RAM, systematically applies the select inputs for the clock
generator and clock skew blocks and uploads the results stored into the output
block RAM for each combination of select inputs. It then checks the results
with the expected results.
4.2.1. Implementation of Programmable Clock and Clock Skew Generator on SOC
The
programmable clock and clock skew generator may be implemented in the custom
block using circuit given in Figure 7. In this case, the LUTs are replaced by logic
elements (LEs). It may be noted that an external clock may be multiplied by an
arbitrary number using the Altera mega core function
altclklock. However, the multiplication factor has to be
specified at the synthesis time and hence the clock frequency cannot be
dynamically altered as in the scheme given in Figure 7. The select inputs for
the clock as well as skew blocks and the data inputs to the wave-pipelined
circuit may be applied and varied through the on-chip processor.
5. Implementation Results of 2D DWT Using BIST Approach
A 128×128 imagewith
8 bits per pixel is used for testing the three schemes. The 2D DWT scheme is
implemented using the lifting blocks with 9/7 biorthogonal filters and BW-KCM
multipliers. The lifting multiplier constants (α,β,γ,δ,ξ) are assumed to be of 8 bits each and the
input samples are assumed to be of 11 bits. For 2D DWT, image block of size
32×32 is assumed. The one-level 2D DWT is implemented on Xilinx Spartan II FPGA
using BIST approach. A personal computer
(PC) is used for the realization of the FSM. The interface used between PC and
FPGA is same as that described in [10].
5.1. Implementation Results on 2D DWT Using Spartan-II XC2S100PQ208-5
The implementation of one-level 2D
DWT for image block of size 32×32 is carried out for lifting scheme using Spartan-IIXC2S100PQ208-5. For the hybrid wave-pipelined circuit, the number
of logic elements, number of registers, and maximum operating frequency are
computed and the results are given in Table 1. Overheads required for the
wave-pipelined circuits are also shown in Table 1.
Area and speed performance of one-level
forward 2D DWT for 32×32 subimages.
Lifting scheme
Slices used
Speed (MHz)
Number of registers
Nonpipelined
836
54.45
611
Pipelined
1110
87.54
1670
Hybrid WP-P
836 [188]*
75.75
611 [85]*
*denotes
additional overhead for testing WP circuits.
From Table 1, it may be concluded that for the lifting scheme, the method
using hybrid WP-P BW-KCM is faster than nonpipelined BW-KCM by a factor of 1.4
and requires the same area. The pipelined BW-PKCM is in turn faster than the
hybrid WP-P BW-KCM by a factor of 1.2 and this is achieved with the increase in
the number of registers by a factor of 2.73 and the increase in the number of
slices by a factor of 1.32.
6. Implementation Results of 2D DWT Using SOC Approach
The BIST approach requires a number of
overheads such as FSM, signature generator and test vector RAM [14]. Instead of
using a dedicated circuit such as BIST, a processor may be used to carryout the
tuning and retuning tasks [15]. The
tasks performed in software use
the on-chip processor. The hardware block may use wave-pipelining and it may be
retuned by the on-chip processor periodically. Hence, the retuning task may be
time shared with the other tasks performed by the processor.
The block diagram of a wave-pipelined
2D DWT is implemented along with the Nios II soft-core processor and the former
is added as the custom block to the Nios II using SOPC builder. The program to
be executed by the Nios II is written in C/C++ and the custom block is invoked
as a function in the C/C++ program. A C++ program is written to read and write
from the block RAM in the custom block. When the C program is run, it
systematically varies the “select” inputs for the clock and clock skew blocks, and uploads the content of
the output block RAM. It compares this with the expected results. The clock and
skew are adjusted till the match occurs for at least three consecutive clock
skews. The operating clock and clock skew of the wave-pipelined circuit is
fixed at the middle value and from now on, the custom block works without any
intervention from the Nios II processor. Only when retuning is required, the
Nios II processor interacts with the custom block.
6.1. Implementation Results on 2D DWT Using Cyclone-II EP2C35F672C6
For the hybrid wave-pipelined
circuit, the number of logic elements, number of registers, maximum operating
frequency, and power dissipated are computed and the results are given in
Table 2. Overheads required for the
wave-pipelined circuits are also shown in Table 2. To reduce the hardware complexity
the same horizontal filter is used, instead of vertical filter for computing
LL1.
Area and speed performance of one-level
forward 2D DWT for 32×32 subimages.
Lifting scheme
Slices used
Speed (MHz)
Number of registers
Power at normalized frequency
(mw/Hz)
Nonpipelined
703
117.83
375
—
Pipelined
782
203.92
671
158.97
Hybrid WP-P
703 [30]*
147.5
375[8]*
179.58
From Table 2, it may be concluded that for the
lifting scheme, the method using the hybrid WP-P BW-KCM is faster than
nonpipelined BW-KCM by a factor of 1.25. The scheme with Baugh-Wooley pipelined
constant coefficient multiplier is in turn faster than the hybrid WP-P BW-KCM
by a factor of 1.38 and this is achieved with the increase in the number of
registers by a factor of 1.78 and the increase in number of LEs by a factor of
1.11.
In
order to assess the superiority of hybrid wave-pipelining with regard to power
dissipation, both hybrid wave-pipelined and pipelined circuits are operated at
the same frequency (corresponding to the maximum
operating frequency of the hybrid wave-pipelined circuit) and the power
dissipated for the two approaches are also given in
Table 2. From this
Table 2, it may be noted that the pipelined
circuit dissipates 1.5% more power than hybrid wave-pipelined 2D DWT. If the
overheads required for hybrid wave-pipelined 2D DWT are also considered, then
the pipelined 2D DWT dissipates 12.9% less power than hybrid wave-pipelined 2D
DWT.
6.2. Validation of the Scheme for 2D DWT
To verify the correctness and
efficacy of the schemes proposed for the computation of 2D DWT, Lena image of
size 128×128 with blocks (subimages) of size 32×32 pixels is used for the
computation of the 2D DWT. The Lena image shown in Figure 10 is obtained by
compressing the image dimension by a factor of 4 along both dimensions. Overlap
of 4 pixels is used between the adjacent blocks. Totally 36 image blocks are
used for the 128×128 image. This is carried out by hardware approach using
FPGA. For storing the image input, outputs of the horizontal filter and the
outputs of the vertical filters, the block RAMs are configured suitably.
LL1 component compared with
input image.
The one-level 2D DWT is
computed using the above scheme for all the 36 image blocks and merged
suitably. The LL1 component of the image is shown in Figure 10. From these
figures, it may be concluded that the LL1 components obtained through the FPGA
implementation match well with the original image. The
PSNR value is computed for the image obtained using BW-hybrid WPKCM is 28.22.
7. Conclusion
In this paper, two automation schemes
are proposed for the implementation of the 9/7 biorthogonal filters using
hybrid WP-P constant coefficient multiplier with Baugh-Wooley multiplication
algorithm. The 9/7 biorthogonal filters are implemented on both Xilinx and
Altera devices with the following three multipliers: BW-PKCM, BW-KCM, and
hybrid WP-P BW-KCM. From the implementation results, it is verified that hybrid
WP-P BW-KCM is faster than nonpipelined BW-KCM by a factor of 1.25–1.39. The
scheme with BW-PKCM is in turn faster than the hybrid WP-P BW-KCM by a factor
of 1.15–1.39 and
this is achieved at the cost ofincrease
in the number of registers by a factor of 1.78–2.73 and increase
in the number of LEs by a factor of 1.11–1.32. The hybrid
wave-pipelined 2D DWT dissipates 12.9% more power than pipelined 2D DWT.
Because one of the challenges in the design of FPGA-based wave-pipelined
circuits is the nonavailability of accurate models for the interconnects and
the temperature dependence of their delays. In the absence of these models, the
wave-pipelined circuits can only be operated at moderate speeds. The
implementation of the automation schemes on the ASICs can lead to better
performance firstly because area is not wasted by unwanted registers and
secondly because models are available for interconnects in the literature. The
work on the computation of two-level 2D DWT using Xilinx and Altera FPGAs and the
automation schemes for the ASIC implementation of one-level 2D DWT are under
progress.
MartinG.ChangH.System-on-chip designProceedings of the 4th International Conference on ASIC (ASICON'01)October 2001Shanghai, China121710.1109/ICASIC.2001.982487DraperB. A.draper@cs.colostate.eduBeveridgeJ. R.ross@cs.colostate.eduBohmA. P. W.bohm@cs.colostate.eduRossC.rossc@cs.colostate.eduChawatheM.chawathe@cs.colostate.eduAccelerated image processing on FPGAs200312121543155110.1109/TIP.2003.819226LakshminarayananG.VenkataramaniB.bvenki@nitt.eduKumarJ. S.YousufA. K.SriramG.Design and FPGA implementation of image block encoders with 2D-DWT3Proceedings of IEEE Conference on Convergent Technologies for Asia-Pacific Region (TENCON '03)October 2003Bangalore, India1015101910.1109/TENCON.2003.1273400ParhiK. K.1999New York, NY, USAJohn Wiley & SonsDaubechiesI.SweldensW.Factoring wavelet transforms into lifting steps19984324726910.1007/BF02476026AharyaT.TsaiP.-S.2005New York, NY, USAJohn Wiley & SonsGrayC. T.LiuW.CavinR. K.III1994Dordrecht, The NetherlandsKluwer Academic PublishersBurlesonW. P.CiesielskiM.KlassF.LiuW.Wave-pipelining: a tutorial and research survey19986346447410.1109/92.711317BoemoE. I.Lopez-BuedoS.MenesesJ. M.Wave pipelines via look-up tables4Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '96)May 1996Atlanta, Ga, USA18518810.1109/ISCAS.1996.541931LakshminarayananG.gopalk1@sasken.comVenkataramaniB.bvenki@nitt.eduOptimization techniques for FPGA-based wave-pipelined DSP blocks200513778379310.1109/TVLSI.2005.850086Altera documentation libraryAltera Corporation, San Jose, Calif, USA, 2003Xilinx documentation libraryXilinx Corporation, San Jose, Calif, USASmithM. J. S.2003SingaporePearson Education AsiaSeetharamanG.VenkataramaniB.bvenki@nitt.eduLakshminarayananG.Design and FPGA implementation of self tuned wave-pipelined filters2006524281286SeetharamanG.VenkataramaniB.SOC implementation of wave-pipelined circuitsProceedings of IEEE International Conference on Field-Programmable Technology (ICFPT '07)December 2007Kitakyushu, Japan91610.1109/FPT.2007.4439226