This work presents a flexible VLSI architecture to compute the
As the technology is evolving day by day, the size of hardware is shrinking with an increase of the storage capacity. High-end video applications have become very demanding in our daily life activities, for example, watching movies, video conferencing, creating and saving videos using high definition video cameras, and so forth. A single device can support all the multimedia applications which seemed to be dreaming before, for example, new high-end mobile phones and smart phones. As a consequence, new highly efficient video coders are of paramount importance. However, high efficiency comes at the expense of computational complexity. As pointed out in [
HEVC, the brand new and yet-to-release video coding standard, addresses high efficient video coding. One of the tools employed to improve coding efficiency is the DCT with different transform sizes. As an example, the 16-point DCT of HEVC is shown in [
The rest of this paper is organized as follows. Section
According to [
In [ bit reverse the order of the rows of gray coding to the row indices.
Therefore, (
The Givens rotations matrices can be described as follows:
Matrices for different DCT are derived using the factorization presented in Section
Equation (
Equation (
Equation (
Equation (
The complete hardware architecture of the DCT is shown in Figure
Top level hardware architecture of DCT.
When the last row is processed through the DCT, it is written to the transpose memory. At the same time, the first column from the transpose memory is read in order to be processed through DCT block. As the last row was not written, so the last data of the first column is not valid. So “Data0” multiplexer is used for forwarding. In this way, the first output of last transformed row of a
DCT block is the main block of the complete architecture. The DCT block takes the input data, the corresponding control signals, and the corresponding addresses. The internal architecture of the DCT block is shown in Figure
DCT block internal structure.
DCT block has 4 pipeline stages. The data is passed through the Hadamard block. The Hadamard block is designed with a fully parallel architecture. The Hadamard block takes 32 data at its inputs and passes to the butterfly_32, while the first 16 are input to the butterfly_16, the first 8 are input to the butterfly_8, and the first 4 are input to the butterfly_4 as well. Multiplexers are placed at the inputs of each different size butterfly in order to have correct result from Hadamard block. The select signals for the multiplexers are controlled by the control unit. The Hadamard block has 32 outputs. To have a Walsh transform from a Hadamard one, the bit_reversal and gray_code blocks are placed after the Hadamard block.
In the bit_reversal block, the data at input port number
Gray code block works in the same principle as bit reversal block, but according to gray code law. In gray code block, the output port is determined by applying gray code on the input addresses. For example, for DCT_32 if
The architecture of mem.block is shown in Figure
Memory block and permutation block.
The permutation block is implemented using (
The lifting scheme is implemented using (
Folded lifting scheme.
Lifting scheme is designed for 15 Givens rotations. The basic lifting structure, shown in Figure
Basic lifting structure.
The lifting structures takes two values at the inputs. For each Givens rotation, the
In case of DCT_32, the Hadamard block produces 32 results in parallel. The outputs of the Hadamard block are fed into the bit reversal block, gray code block, and in the following bit reversal block, and the output
of the bit reversal block is fed into the memory block. The memory block passes the upper 16 inputs directly to the permutation block, while the lower 16 inputs are stored in the registers. The permutation block forwards the upper 16 inputs to the lifting scheme, where the select lines for the multiplexers in the lifting scheme are set to “0”, and the lifting scheme outputs the results and the results are stored in the following memory block. In the next clock cycle the lower 16 values stored in the first memory block are passed to the lifting scheme, through the permutation block. The selection line for the multiplexers in the lifting scheme are set to “1”. The 16 results are calculated and are passed to the memory block. At the same time, the memory block forwards the previously stored values along with the new arrived ones.
In case of DCT_4, DCT_8, and DCT_16, the lower 16 values from the first memory blocks are invalid and never used. So the valid upper 16 inputs are fed into the lifting scheme via permutation network. The selection line for the lifting scheme multiplexers is always set to “0” in case of
The hardware architecture of the one square root block is shown in Figure
Transpose buffer is designed using registers. The buffer is designed to support maximum DCT size, that is, DCT_32. So the buffer is of size
The DCT transforms the rows of the block and the results are stored in the buffer. So the select signal for the input MUXes block is set to “1”. After
During the direct cycles, the rows from the input memory are transformed and the results are written in the transform buffer. When the last row from the input memory is transformed, first column is read in the next clock cycle from the transform memory. At this point, the last data of the first column is not the valid one, as the last transformed row has not yet been written to the memory. So data0 multiplexer is used for forwarding, where the first data out of the 32 transformed dates is selected from the data0 memory. The select line of the data0 multiplexer is set to “1” for just one clock cycle during transformation of
Control Unit controls the activities of all the blocks in each clock cycle. This unit is responsible for a correct sequence of operations. Control unit is designed using 4 memories, where each memory contains the control signals for each DCT size. There are 4 counters in the unit, where each counter produces the addresses for its corresponding memories. In response to the addresses, the memories output the control signals. The outputs of the memories are multiplexed, where the selection line of the multiplexer decides which input to go out. The hardware architecture of the control unit is shown in Figure
Control unit.
The computation of
The WHT is implemented with fully parallel architectures using maximum number of resources, while the lifting scheme is implemented with a folded architecture. Hence, 80 2-input/2-output butterflies are used to implement WHT for
The factorization of the matrices is applied to H.265 DCT. The lifting coefficients are approximated with the following condition:
Number of adders for lifting coefficients.
|
Adders |
|
Adders |
---|---|---|---|
51 | 2 | −98 | 2 |
101 | 3 | −569 | 2 |
311 | 3 | −200 | 3 |
25 | 2 | −50 | 2 |
152 | 2 | −297 | 3 |
64 | 0 | −121 | 2 |
183 | 3 | −325 | 3 |
25 | 2 | −50 | 2 |
19 | 2 | −38 | 2 |
63 | 1 | −124 | 1 |
178 | 3 | −345 | 4 |
115 | 3 | −219 | 3 |
71 | 2 | −132 | 1 |
169 | 3 | −305 | 3 |
99 | 3 | −172 | 3 |
According to [
The DCT block contains
Approximated valued of lifting structures coefficients and the number of bits required for quantization of results.
Givens rotations |
|
|
|
---|---|---|---|
|
8 | 51 | −98 |
|
10 | 101 | −569 |
|
10 | 311 | −200 |
|
9 | 25 | −50 |
|
10 | 152 | −297 |
|
8 | 64 | −121 |
|
9 | 183 | −325 |
|
10 | 25 | −50 |
|
8 | 19 | −38 |
|
9 | 63 | −124 |
|
10 | 178 | −345 |
|
9 | 115 | −219 |
|
8 | 71 | −132 |
|
9 | 169 | −305 |
|
8 | 99 | −172 |
The first stage of lifting steps (
Figure
PSNR calculation and experiment setup.
Table
PSNR (dB) of different sequences for different DCT sizes.
Sequences | DCT size | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
BQ terrace_ |
47.1 | 44.4 | 44.1 | 48.8 | 45.9 | 44.1 | 48.9 | 45.4 | 45.3 | 48.8 | 45.7 | 44.1 |
BQ square_ |
47.4 | 45.1 | 45.2 | 48.1 | 45.2 | 45.2 | 48.1 | 45.7 | 44.1 | 48.2 | 45.2 | 44.7 |
BQ mall_ |
46.9 | 44.1 | 44.5 | 47.9 | 44.9 | 45.5 | 48.7 | 44.6 | 44.9 | 48.4 | 45.7 | 45.3 |
Basketball drive_ |
47.8 | 44.5 | 44.1 | 48.2 | 44.8 | 45.7 | 48.5 | 45.1 | 45.7 | 48.3 | 45.1 | 45.3 |
Basketball drill_ |
47.0 | 45.4 | 45.6 | 48.6 | 45.0 | 44.8 | 48.2 | 45.9 | 45.0 | 48.7 | 45.4 | 45.5 |
In Tables
Proposed architecture.
|
|
|
|
---|---|---|---|
4 | 0 | 17 | 5 |
8 | 0 | 74 | 39 |
16 | 0 | 232 | 132 |
32 | 0 | 548 | 249 |
Comparison for
|
|
| |
---|---|---|---|
Original | 1024 | 992 | 0 |
Proposed | 0 | 548 | 249 |
Comparison for
|
|
| |
---|---|---|---|
[ |
0 | 242 | 58 |
Proposed | 0 | 232 | 132 |
The net list is written in VHDL language. Synopsys Design Vision is used for synthesis purpose. The code is synthesized on 90 nm standard cell library at a clock frequency of 150 MHz. Table
Synthesis results.
Parameter | Value |
---|---|
Technology | 90 nm |
Frequency | 150 MHz |
Area | 0.42 mm2 |
Power | 884.1 |
Memory | 36 kbits |
The time required by the proposed architecture to completely process an
In this work, a dynamic