1. Introduction

VLSI

VLSI Design

1563-5171 1065-514X

Hindawi Publishing Corporation

752024

10.1155/2012/752024

752024

Research Article

N Point DCT VLSI Architecture for Emerging HEVC Standard

Ahmed

Ashfaq

¹ Shahid

Muhammad Usman

¹ Rehman

Ata ur

¹ Norkin

Andrey

Department of Electronics & Telecommunication, Politecnico di Torino, 10129 Torino

Italy

polito.it

2012

24 7 2012

2012 21 12 2011 13 04 2012 06 05 2012

2012

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This work presents a flexible VLSI architecture to compute the N-point DCT. Since HEVC supports different block sizes for the computation of the DCT, that is, 4×4 up to 32×32, the design of a flexible architecture to support them helps reducing the area overhead of hardware implementations. The hardware proposed in this work is partially folded to save area and to get speed for large video sequences sizes. The proposed architecture relies on the decomposition of the DCT matrices into sparse submatrices in order to reduce the multiplications. Finally, multiplications are completely eliminated using the lifting scheme. The proposed architecture sustains real-time processing of 1080P HD video codec running at 150 MHz.

1. Introduction

As the technology is evolving day by day, the size of hardware is shrinking with an increase of the storage capacity. High-end video applications have become very demanding in our daily life activities, for example, watching movies, video conferencing, creating and saving videos using high definition video cameras, and so forth. A single device can support all the multimedia applications which seemed to be dreaming before, for example, new high-end mobile phones and smart phones. As a consequence, new highly efficient video coders are of paramount importance. However, high efficiency comes at the expense of computational complexity. As pointed out in [1, 2], several blocks of video codecs, including the transform stage [3], motion estimation and entropy coding [4], are responsible for this high complexity. As an example the discrete-cosine-transform (DCT), that is used in several standards for image and video compression, is a computation intensive operation. In particular, it requires a large number of additions and multiplications for direct implementation.

HEVC, the brand new and yet-to-release video coding standard, addresses high efficient video coding. One of the tools employed to improve coding efficiency is the DCT with different transform sizes. As an example, the 16-point DCT of HEVC is shown in [5]. In video compression, the DCT is widely used because it compacts the image energy at the low frequencies, making easy to discard the high frequency components. To meet the requirement of real-time processing, hardware implementations of 2-D DCT/inverse DCT (IDCT) are adopted, for example, [6]. The 2-D DCT/IDCT can be implemented with the 1-D DCT/IDCT and a transpose memory in a row-column decomposition manner. In the direct implementation of DCT, float-point multiplications have to be tackled, which cause precision problems in hardware. Hence, we propose a Walsh-Hadamard transform-based DCT implementation [7]. Then, inspired by the DCT factorizations proposed in [8, 9], we factorize the remaining rotations into simpler steps through the lifting scheme [10]. The resulting lifting scheme-based architecture, inspired by [11–13], is simplified, exploiting the techniques proposed in [9, 14] to achieve a multiplierless implementation. Other techniques can be employed to achieve multiplierless solutions, such as the ones proposed in [8, 15–18], but they are not discussed in this work. In this work, the proposed multisize DCT architecture supports all the block sizes of HEVC and is proposed for the real-time processing of 1080P HD video sequences.

The rest of this paper is organized as follows. Section 2 provides reviews of 2-D DCT transform. Section 3 shows the matrix decompositions for different DCT sizes. Section 4 presents the proposed hardware architecture. The VLSI implementation and the simulation results in Section 5. Finally, Section 6 concludes this paper.

2. Review of 2D Transform

According to [19], the DCT in the so-called II-form for an N-point block of samples x={x0,x1,…,xN-1} is obtained as (1)Xk=2Nϵk∑n=0N-1‍x(n)⋅cos⁡[π(2n+1)k2N], where (2)ϵk={12,if k=0,1,otherwise and k=0,1,2,…,N-1. The same result expressed in (1) can be rewritten as the product of a matrix (CNII) for x as follws: (3)X=CNII⋅xt, where (·)t is the transposition operator. The DCT matrix can be expressed in terms of a reduced number of angles by exploiting the symmetry properties of the trigonometric functions. For 16 points, DCT matrix can be represented as (4)C16II=122⋅C^16II, where C^16II is shown in (20), with (5)cp, q=cos⁡(p⋅π2q),sp, q=sin⁡(p⋅π2q), where 4≤2q≤2N and p is an odd integer such that p<2q-2.

In [20], it is shown that every even-odd transform can be represented in terms of any other evenodd transform through a conversion matrix. In particular, in [7] it is shown that the DCT can be expressed in terms of the Walsh-Hadamard transform (WHT) [21], and the conversion matrix has a block diagonal structure. In [22], it is shown that WHT can be implemented with a fast algorithm based on a butterfly structure. Among the possible WHT matrix representations, the Walsh-ordered one is applied by deriving it from the corresponding Hadamard-ordered matrix. The N-order Hadamard matrix can be expressed as (6)HN=[HN/2HN/2HN/2-HN/2], where H1=1. The corresponding Walsh matrix WN is obtained by applying a two step procedure [23] as follows:(1)

bit reverse the order of the rows of HN,

(2)

gray coding to the row indices.

Therefore, (4) can be written as (7)CNII=1N⋅BN⋅TN⋅BN⋅WN, where BN is the N-point bit reversal matrix, and (8)TN=[TN/200UN/2] is a block diagonal matrix with a recursive structure, T2=I2 is the 2×2 identity matrix and (9)U2=[c1,3s1,3-s1,3c1,3]. It is worth noting that UN/2 can be factorized in terms of Givens rotations, where the Givens rotation matrix for a rotation angle θ is (10)[cos⁡θsin⁡θ-sin⁡θcos⁡θ]. In particular UN/2 can be decomposed in the product of permutation matrices PN/2 and Givens rotation matrices VN/2, q with 3≤q≤m and m=log 2(N)+1 as (11)UN/2=PN/2⋅VN/2,m⋅…⋅VN/2,q⋅…⋅VN/2,3⋅PN/2. The permutation matrix PN/2 is obtained by applying the following permutation: (12)ΦN/2=(01⋯N2-1ϕN/2(0)ϕN/2(1)⋯ϕN/2(N2-1)) to the rows or to the columns of IN/2, the N/2×N/2 identity matrix. It is worth observing that ϕN/2(x) can be defined recursively as (13)ϕN/2(x)={2ϕN/4(x),0≤x≤N4-1,2ϕN/4(x)+1,N4≤x≤N-1, where ϕ2(0)=0 and ϕ2(1)=1.

The Givens rotations matrices can be described as follows: VN/2,m contains N/4 Givens rotations disposed in N/4 concentric squares with the rotation angle increasing from the outer square to the inner one. The definition of each cp,q is given in (5). In the outermost circle p=1, whereas the value of p increases from the outer square to the inner one. Since p is the multiplied value in the numerator of (5), the angle of the Givens rotations increases from outer to inner squares. The general structure of VN/2, m is shown in (15). In the next sections, the expression shown in (15) will be detailed for different values of N(14)VN/2 ,m=(c1,m00⋯00s1,m⋮⋮⋮⋮⋮⋮⋮⋯0cp,m⋯sp,m0⋯⋮⋮⋮⋮⋮⋮⋮⋯0-sp,m⋯cp,m0⋯⋮⋮⋮⋮⋮⋮⋮s1,m00⋯00c1,m). The remaining matrices VN/2,q, for q≥3, are (15)VN/2,q=(VN/2r,q000⋱000VN/2r,q), where r=m-q+1 and V2, 3=U2. Finally, each Givens rotation can be factorized into lifting steps by the means of the lifting scheme [10] as suggested in [9] as follows: (16)(cos⁡θsin⁡θ-sin⁡θcos⁡θ) =(11-cos⁡θsin⁡θ01)⋅(10-sin⁡θ1)⋅(11-cos⁡θsin⁡θ01).

3. Matrix Decompositions for Different <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M66"><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:math></inline-formula>

Matrices for different DCT are derived using the factorization presented in Section 2. In the following paragraphs factorizations for N ranging from 4 to 32 are explicitly shown.

3.1. <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M68"><mml:mn>4</mml:mn><mml:mo>×</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> DCT

Equation (6) can be given as (17)H4=(H2H2H2-H2). Equation (7) can be given as (18)C4II=12⋅B4⋅T4⋅B4⋅W4, where (19)T4=(I200U2), where U2 is shown in (9) and I2 is a 2×2 identity matrix. W4 is a Walsh-Hadamard matrix which is calculated using(17)(20)C^16II=[12121212121212121212121212121212c1,5c3,5c5,5c7,5s7,5s5,5s3,5s1,5-s1,5-s3,5-s5,5-s7,5-c7,5-c5,5-c3,5-c1,5c1,4c3,4s3,4s1,4-s1,4-s3,4-c3,4-c1,4-c1,4-c3,4-s3,4-s1,4s1,4s3,5c3,4c1,4c3,5s7,5s1,5-s5,5-c5,5-c1,5-c7,5-s3,5s3,5c7,5c1,5c5,5s5,5-s1,5-s7,5-c3,5c1,3s1,3-s1,3-c1,3-c1,3-s1,3s1,3c1,3c1,3s1,3-s1,3-c1,3-c1,3-s1,3s1,3c1,3c5,5s1,5-c7,5-c3,5-s3,5s7,5c1,5s5,5-s5,5-c1,5-s7,5s3,5c3,5c7,5-s1,5-c5,5c3,4-s1,4-c1,4-s3,4s3,4c1,4s1,4-c3,4-c3,4s1,4c1,4s3,4-s3,4-c1,4-s1,4c3,4c7,5-s5,5-c3,5-s1,5c1,5s3,5-c5,5-s7,5s7,5c5,5-s3,4-c1,5-s1,5c3,5s5,5-c7,512-12-121212-12-121212-12-121212-12-1212s7,5-c5,5-s3,5c1,5-s1,5-c3,5s5,5c7,5-c7,5-s5,5c3,5s1,5-c1,5s3,5c5,5-s7,5s3,4-c1,4s1,4c3,4-c3,4-s1,4c1,4-s3,4-s3,4c1,4-s1,4-c3,4c3,4s1,4-c1,4s3,4s5,5-c1,5s7,5s3,5-c3,5c7,5s1,5-c5,5c5,5-s1,5-c7,5c3,5-s3,5-s7,5c1,5-s5,5s1,3-c1,3c1,3-s1,3-s1,3c1,3-c1,3s1,3s1,3-c1,3c1,3-s1,3-s1,3c1,3-c1,3s1,3s3,5-c7,5c1,5-c5,5s5,5s1,5-s7,5c3,5-c3,5s7,5-s1,5-s5,5c5,5-s1,5c7,5-s3,5s1,4-s3,4c3,4-c1,4c1,4-c3,4s3,4-s1,4-s1,4s3,4-c3,4c1,4-c1,4c3,4-s3,4s1,4s1,5-s3,5s5,5-s7,5c7,5-c5,5c3,5-c1,5c1,5-c3,5c5,5-c7,5s7,5-s5,5s3,5-s1,5]

3.2. <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M77"><mml:mn>8</mml:mn><mml:mo>×</mml:mo><mml:mn>8</mml:mn><mml:mi /></mml:math></inline-formula> DCT

Equation (6) can be given as (21)H8=(H4H4H4H4). Equation (7) can be given as (22)C8II=122⋅B8⋅T8⋅B8⋅W8, where (23)T8=(I2000U2000U4), where U2 is shown in (9) and I2 is a 2×2 identity matrix. W8 is a Walsh-Hadamard matrix which is calculated using (21). According to (11), m=log 2(8)+1=4 and 3≤q≤4. So U4 can be written as (24)U4=P4⋅V4, 4⋅V4, 3⋅P4, where, according to (14) (25)V4, 4=(c1, 400s1, 40c3, 4s3, 400-s3, 4c3, 40-s1, 400c1, 4) and, according to (15), r=m-q+1=4-3+1=2 and VN/22, q=V8/4, 3=V2, 3=U2, so (26)V4, 3=(U200U2). Using (12) and (13) the permutation is computed as (27)Φ4=(01230213). Finally the permutation matrix P4 is obtained by applying the permutation, shown in (27), to the columns of 4×4 identity matrix as, (28)P4=[1000001001000001].

3.3. <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M97"><mml:mn>16</mml:mn><mml:mo>×</mml:mo><mml:mn>16</mml:mn></mml:math></inline-formula> DCT

Equation (6) can be given as (29)H16=(H8H8H8-H8). Equation (7) can be given as (30)C8II=14⋅B16⋅T16⋅B16⋅W16, where (31)T16=(I20000U20000U40000U8), where U2 and U4 are shown in (9) and (24). To calculate U8, according to (11), m=log 2(16)+1=5 and 3≤q≤5. So U8 can be written as (32)U8=P8⋅V8, 5⋅V8, 4⋅V8, 3⋅P8, where, according to (14), (33)V8, 5=(c1, 5000000s1, 50c3, 50000s3, 5000c5, 500s5, 500000c7, 5s7, 5000000-s7, 5c7, 500000-s5, 500c5, 5000-s3, 50000c3, 50-s1, 5000000c1, 5). Similarly, to calculate V8, 4, according to (15), r=m-q+1=5-4+1=2 and VN/22, q=V16/4, 4=V4, 4, so (34)V8, 4=(V4, 400V4, 4), where V4, 3 is calculated in (25). For V8, 3, r=m-q+1=5-3+1=3 and VN/23, q=V16/8, 3=V2, 3=U2, so (35)V8, 3=(U20000U20000U20000U2). Using (12) and (13) the permutation is computed as (36)Φ8=(0123456704261537). Finally the permutation matrix P8 is obtained by applying the permutation, shown in (37), to the columns of 8×8 identity matrix as (37)P4=[1000000000001000001000000000001001000000000001000001000000000001].

3.4. <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M122"><mml:mn>32</mml:mn><mml:mo>×</mml:mo><mml:mn>32</mml:mn></mml:math></inline-formula> DCT

Equation (6) can be given as (38)H32=(H16H16H16-H16). Equation (7) can be given as (39)C32II=142⋅B32⋅T32⋅B32⋅W32, where (40)T32=(I200000U200000U400000U800000U16), where U2, U4 and U8 are shown in (9), (24), and (32). To calculate U16, according to (11), m=log 2(32)+1=6 and 3≤q≤56. So U16 can be written as (41)U16=P16⋅V16, 6⋅V16, 5⋅V16, 4⋅V16, 3⋅P16, where, according to (14), (42)V16, 6=(c1, 6000000s1, 60c3, 60000s3, 6000⋱00⋰00000c15, 6s15, 6000000-s15, 6c15, 600000⋰00⋱000-s3, 60000c3, 60-s1, 6000000c1, 6). Similarly, to calculate V16, 5, according to (15), r=m-q+1=6-5+1=2 and VN/22, q=V32/4, 5=V8, 5, so (43)V16, 5=(V8, 500V8, 5), where V8, 3 is calculated in (33). For V16, 4, r=m-q+1=6-4+1=3 and VN/23, q=V32/8, 4=V4, 4, so (44)V16, 4=(V4, 40000V4, 40000V4, 40000V4, 4), where V4, 4 is calculated in (25). For V16, 3, r=m-q+1=6-3+1=4 and VN/24, q=V32/16, 3=V2, 3=U2, so (45)V16, 3=(U20000U20000⋱0000U2). Using (12) and (13) the permutation is computed as (46)Φ16=(01234567891011121314150841221061419513311715) finally the permutation matrix P16 is obtained by applying the permutation, shown in (47), to the columns of 16×16 identity matrix as (47)P16=[1000000000000000000000001000000000001000000000000000000000001000001000000000000000000000001000000000001000000000000000000000001001000000000000000000000001000000000001000000000000000000000001000001000000000000000000000001000000000001000000000000000000000001].

4. Proposed Architecture

The complete hardware architecture of the DCT is shown in Figure 1. Each frame is loaded in the input frame memory. The complete frame is divided into N × N blocks. The control unit reads the rows of each block from the input memory. At the same time, the control unit passes a “1” to the input multiplexers. Also the address and other control signals are passed to the DCT block. After complete calculation of the DCT, the transformed row is input to the transpose memory, along with its corresponding address. In this way, for the first N clock cycles, the rows from the input memory are input to the DCT and are written on the corresponding addresses in the transpose memory. After the N clock cycles, the control unit passes a “0” for the input multiplexers for the next N clock cycles. So in this way, each column from the transpose memory is input to DCT block, and the outputs of the DCT block are written back in the transpose memory, on the same location from where they are read. When all the columns are read and processed by the DCT, the control unit again starts reading the next N×N block from the input memory and at the same time, each row from the transpose memory is written to the output transformed memory. In this way all the N×N blocks are read, processed, and written in the output transformed memory.

Figure 1

Top level hardware architecture of DCT.

When the last row is processed through the DCT, it is written to the transpose memory. At the same time, the first column from the transpose memory is read in order to be processed through DCT block. As the last row was not written, so the last data of the first column is not valid. So “Data0” multiplexer is used for forwarding. In this way, the first output of last transformed row of a N×N block is forwarded to the input to DCT and also written to the transpose memory.

4.1. DCT Block

DCT block is the main block of the complete architecture. The DCT block takes the input data, the corresponding control signals, and the corresponding addresses. The internal architecture of the DCT block is shown in Figure 2.

Figure 2

DCT block internal structure.

DCT block has 4 pipeline stages. The data is passed through the Hadamard block. The Hadamard block is designed with a fully parallel architecture. The Hadamard block takes 32 data at its inputs and passes to the butterfly_32, while the first 16 are input to the butterfly_16, the first 8 are input to the butterfly_8, and the first 4 are input to the butterfly_4 as well. Multiplexers are placed at the inputs of each different size butterfly in order to have correct result from Hadamard block. The select signals for the multiplexers are controlled by the control unit. The Hadamard block has 32 outputs. To have a Walsh transform from a Hadamard one, the bit_reversal and gray_code blocks are placed after the Hadamard block.

In the bit_reversal block, the data at input port number X is moved to output port number Y, where the output port number Y is determined by representation the input port number X and reversing the bits. For example, in case of DCT_16, the Hadamard block will produce 16 valid outputs. So the 16 inputs to the bit_reversal block are shuffled according to bit_reversal rule, for example, X=0 means X=“0000" and the bit_reversal is also Y=“0000". So the first input port will be connected to the first output port. Similarly, for X=“0001", the bit reversal will Y=“1000" which means that the second input port is connected to the eight output port. In this way all the inputs are connected to the outputs according to bit reversal rule. As the architecture supports four different sizes of DCT, it means that the bit reversal rule will be different for each DCT size. For example, for DCT_4, X=“01" will be connected to Y=“10", that is, 2nd output port, while in case of DCT_16, it will be connected to the 8th output port. So multiplexers are placed in order to support all the DCT sizes in the bit reversal block.

Gray code block works in the same principle as bit reversal block, but according to gray code law. In gray code block, the output port is determined by applying gray code on the input addresses. For example, for DCT_32 if X=“01101", Y=“01011". So the input port number 13 is connected with the output port number 11. Gray code calculation does not depend on the DCT size. For example for DCT_16, if X=“1101", Y=“1011", which means that input port 13 is connected to the output port number 11, which is same as that for DCT_32.

The architecture of mem.block is shown in Figure 3. The memory block connects the first 16 inputs directly to the output ports, while the last 16 outputs are multiplexed with the latched inputs and the direct inputs. The last 16 outputs are used in case of DCT_32, while the last 16 last inputs are bypassed in case of DCT_4, DCT_8, and DCT_16.

Figure 3

Memory block and permutation block.

The permutation block is implemented using (27), (36), and (46). The block takes 32 inputs, and it sends 16 of the inputs to the outputs according to the permutation law. The first 16 inputs to the permutation block are passed to the outputs in the first clock cycle, while in the second clock cycle the last 16 inputs of the permutation network are passed to the outputs. In case of DCT<32, the selection line of the multiplexers is always set to “0”, while in case of DCT=32, the selection line remains “0” in first clock cycle, while remains “1” in the next clock cycle.

The lifting scheme is implemented using (19), (23), (31), and (40). The lifting block is implemented with a folded architecture. Where the fully parallel lifting block is used for DCT sizes of 4, 8, and 16, while DCT_32, the block is reused. Each row of DCT_32 takes 2 clock cycles for completion. During the first clock cycle, the upper 16 inputs are processed by the lifting scheme and are stored in the memory block. In the next clock cycle, the lower 16 inputs are processed by the lifting block and the result along with the previously calculated stored values is forwarded to the next block. The lifting block is shown in Figure 4.

Figure 4

Folded lifting scheme.

Lifting scheme is designed for 15 Givens rotations. The basic lifting structure, shown in Figure 5, is implemented using (16), where (48)a2m≈1-cos⁡θsin⁡θ,b2n≈-sin⁡θ, as suggested in [9].

Figure 5

Basic lifting structure.

The lifting structures takes two values at the inputs. For each Givens rotation, the a, b, m, and n are approximated integer values, in order to have the approximated DCT equal to the actual DCT. As a and b are integers, the multiplications are implemented using adders and shift operations. The result of each lifting structure is quantized to 16-bits resolution to have a reasonable PSNR value. So the final outputs of the lifting block are 16-bit wide. The results of some lifting structures are bypassed using the multiplexers at their outputs. In fact, the select line for the multiplexers will always remain “0” for DCT_4, DCT_8, and DCT_16, while for DCT_32, the select line remains “0” for first clock cycle and “1” the for the next clock cycle.

In case of DCT_32, the Hadamard block produces 32 results in parallel. The outputs of the Hadamard block are fed into the bit reversal block, gray code block, and in the following bit reversal block, and the output

of the bit reversal block is fed into the memory block. The memory block passes the upper 16 inputs directly to the permutation block, while the lower 16 inputs are stored in the registers. The permutation block forwards the upper 16 inputs to the lifting scheme, where the select lines for the multiplexers in the lifting scheme are set to “0”, and the lifting scheme outputs the results and the results are stored in the following memory block. In the next clock cycle the lower 16 values stored in the first memory block are passed to the lifting scheme, through the permutation block. The selection line for the multiplexers in the lifting scheme are set to “1”. The 16 results are calculated and are passed to the memory block. At the same time, the memory block forwards the previously stored values along with the new arrived ones.

In case of DCT_4, DCT_8, and DCT_16, the lower 16 values from the first memory blocks are invalid and never used. So the valid upper 16 inputs are fed into the lifting scheme via permutation network. The selection line for the lifting scheme multiplexers is always set to “0” in case of DCT<32.

D C T < 32 takes one clock cycle to calculate one row or one column, while DCT=32 takes 2 clock cycles. The outputs of the second memory block are passed to the third bit reversal block, passing through a fully parallel permutation network. The outputs of the bit reversal are then divided by square root of N, where N is the DCT size. The square roots are calculated as (49)1N≈{12,N=2,8+4-132,N=8,14,N=16,8+4-164,N=32.

The hardware architecture of the one square root block is shown in Figure 6. The input is divided by N and the calculated values are fed into the output multiplexer, where the valid result is sent to output depending on the DCT size. Finally, the outputs from the square root block are quantized from 16 bits to 13 bits using the Q block.

Figure 6

1 / N block.

4.2. Transpose Buffer

Transpose buffer is designed using registers. The buffer is designed to support maximum DCT size, that is, DCT_32. So the buffer is of size N×N×B, where N=32 and B=13, where B is the width of each data. So a total of 13 kbits memory is utilized to implement transpose buffer. The inputs of the buffer are the clock, reset, transpose signal, the row number, the column number, read enable signal, and the write enable signal. During the direct cycle, all the rows from the input frame memory are transformed through DCT block, and the results are stored on the corresponding rows in the transpose buffer. When all the rows of a input frame memory are transformed and written to the transform buffer, the columns of the transform buffer are read and the columns are transformed via DCT block, and the results are again written back to the transform buffer in transpose way, that is, on each column. When all the columns of the transform buffer are read, transformed, and written back to the buffer, the rows of the input frame memory are read row wise, and at the same time the rows of the transform buffer are written to the output frame memory. In this way, the complete frame is transformed and written to the output memory.

4.3. Input MUXes Block

The DCT transforms the rows of the block and the results are stored in the buffer. So the select signal for the input MUXes block is set to “1”. After N clock cycles, where N is the size of DCT, the select signal of the input MUXes is set to “0”, so that the columns from the transform buffer is fed into the DCT block. So, input MUXes block switches the inputs for the DCT block for the direct or transformed cycles.

4.4. Data0 Multiplexer

During the direct cycles, the rows from the input memory are transformed and the results are written in the transform buffer. When the last row from the input memory is transformed, first column is read in the next clock cycle from the transform memory. At this point, the last data of the first column is not the valid one, as the last transformed row has not yet been written to the memory. So data0 multiplexer is used for forwarding, where the first data out of the 32 transformed dates is selected from the data0 memory. The select line of the data0 multiplexer is set to “1” for just one clock cycle during transformation of N × N block, that is, when the first column is read from transform buffer and the last row is transformed via DCT block, otherwise the select signal is always set to “0”.

4.5. Control Unit

Control Unit controls the activities of all the blocks in each clock cycle. This unit is responsible for a correct sequence of operations. Control unit is designed using 4 memories, where each memory contains the control signals for each DCT size. There are 4 counters in the unit, where each counter produces the addresses for its corresponding memories. In response to the addresses, the memories output the control signals. The outputs of the memories are multiplexed, where the selection line of the multiplexer decides which input to go out. The hardware architecture of the control unit is shown in Figure 7. MEM_CU_4 is a 8×128=1 kbits size, MEM_CU_8 is a 16×128=2 kbits size, and MEM_CU_16 is a 32×128=16 kbits bits size while MEM_CU_32 is a 128×128=16 kbits size. The memories contains N control signals for direct cycle and N clock cycles for the transpose cycle, where N is the DCT size. But MEM_CU_32 contains 2(N+N) control signals, because each row or column takes two clock cycles for completion in case of DCT_32. So the control unit generates 128-bits wide control signals, for all the functioning blocks of the complete DCT in each clock cycle.

Figure 7

Control unit.

5. Results

The computation of N-point DCT by the means of the WHT factorization requires the following. (1)

(N/2)·log 2(N) 2-inputs/2-output butterfly for the N-point WHT.

(2)

1+(N/2)·[log 2(N)-2] 2-input/2-output lifting-based (3-lifting-step) structures for the Givens rotations.

The WHT is implemented with fully parallel architectures using maximum number of resources, while the lifting scheme is implemented with a folded architecture. Hence, 80 2-input/2-output butterflies are used to implement WHT for N=32. The number of adders required to implement the 80 2-input/2-output butterflies is 160. The total number of 2-input/2-output lifting structures to implement the 15 Givens rotation is 49, but with folded architecture we have reused the data path and reduced the number of lifting structures to 32.

The factorization of the matrices is applied to H.265 DCT. The lifting coefficients are approximated with the following condition: (50)⌊1N⋅BN⋅TN⋅BN⋅WN⋅27⌋=Q, where Q is the N-point DCT obtained from MATLAB function dctmtx, scaled with 27. Table 1 shows the approximated values, calculated from the conditions in (48), of all the coefficients and the number of bits required to normalize the results.

Table 1

Number of adders for lifting coefficients.

a	Adders	b	Adders
51	2	−98	2
101	3	−569	2
311	3	−200	3
25	2	−50	2
152	2	−297	3
64	0	−121	2
183	3	−325	3
25	2	−50	2
19	2	−38	2
63	1	−124	1
178	3	−345	4
115	3	−219	3
71	2	−132	1
169	3	−305	3
99	3	−172	3

According to [14], multiplications for a1, 3=51 and b1, 3=-98 can be implemented with a minimum number of additions resorting to the n-dimensional reduced adder graph (RAG-n) technique. The total number of adders required for all the coefficients is shown in Table 1.

The DCT block contains 2×N/2×log 2(N)=160 adders to implement the Hadamard block. Using Tables 1 and 2, the number of adders use to implement the DCT can be calculated. The first stage of lifting steps (π/8 rotation) requires 3×8=24 adders to implement the lifting structure, and further 6×8=48 adders are required to implement all the steps involving a1, 3 and b1, 3. The first stage of lifting steps (π/16 rotation) requires 3×4=12 adders to implement the lifting structure, and further 8×4=32 adders are required to implement all the steps involving a1, 4 and b1, 4. The first stage of lifting steps (3π/16 rotation) requires 3×4=12 adders to implement the lifting structure, and further 9×4=36 adders are required to implement all the steps involving a3, 4 and b3, 4. The first stage of lifting steps (π/32 rotation) requires 3×2=6 adders to implement the lifting structure, and further 6×2=12 adders are required to implement all the steps involving a1, 5 and b1, 5. The first stage of lifting steps (3π/32 rotation) requires 3×2=6 adders to implement the lifting structure, and further 7×2=14 adders are required to implement all the steps involving a3, 5 and b3, 5.

Table 2

Approximated valued of lifting structures coefficients and the number of bits required for quantization of results.

Givens rotations	m , n	a	b
π / 8	8	51	−98
π / 16	10	101	−569
3 π / 16	10	311	−200
π / 32	9	25	−50
3 π / 32	10	152	−297
5 π / 32	8	64	−121
7 π / 32	9	183	−325
π / 64	10	25	−50
3 π / 64	8	19	−38
5 π / 64	9	63	−124
7 π / 64	10	178	−345
9 π / 64	9	115	−219
11 π / 64	8	71	−132
13 π / 64	9	169	−305
15 π / 64	8	99	−172

The first stage of lifting steps (5π/32 rotation) requires 3×2=6 adders to implement the lifting structure, and further 2×2=4 adders are required to implement all the steps involving a5, 5 and b5, 5. The first stage of lifting steps (7π/32 rotation) requires 3×2=6 adders to implement the lifting structure, and further 9×2=18 adders are required to implement all the steps involving a7, 5 and b7, 5. The first stage of lifting steps (π/64 rotation) requires 3 adders to implement the lifting structure, and further 6 adders are required to implement all the steps involving a1, 6 and b1, 6. The first stage of lifting steps (3π/64 rotation) requires 3 adders to implement the lifting structure, and further 6 adders are required to implement all the steps involving a3, 6 and b3, 6. The first stage of lifting steps (5π/64 rotation) requires 3 adders to implement the lifting structure, and further 3 adders are required to implement all the steps involving a5, 6 and b5, 6. The first stage of lifting steps (7π/64 rotation) requires 3 adders to implement the lifting structure, and further 10 adders are required to implement all the steps involving a7, 6 and b7, 6. The first stage of lifting steps (9π/64 rotation) requires 3 adders to implement the lifting structure, and further 9 adders are required to implement all the steps involving a9, 6 and b9, 6. The first stage of lifting steps (11π/64 rotation) requires 3 adders to implement the lifting structure, and further 5 adders are required to implement all the steps involving a11, 6 and b11, 6. The first stage of lifting steps (13π/64 rotation) requires 3 adders to implement the lifting structure, and further 9 adders are required to implement all the steps involving a13, 6 and b13, 6. The first stage of lifting steps (15π/64 rotation) requires 3 adders to implement the lifting structure, and further 9 adders are required to implement all the steps involving a15, 6 and b15, 6. The square root block contains 2 adders, and there are 32 square root blocks. Therefore, 64 adders are calculating square roots in parallel. The total number of adders required for Hadamard and lifting scheme is 160+72+44+48+18+20+10+24+9+9+6+13+12+15+12+12+64=548.

Figure 8 shows the experiment setup carried out to calculate the PSNR. The original frames are transformed using the proposed DCT. Then the transformed data is quantized to 13 bits. The quantized coefficients are then passed through the inverse quantization block and inverse DCT. The PSNR is then calculated between the original frames and the reconstructed frames. The inverse quantization is taken using (51) (51)x=(CNII)-1⋅Xt.

Figure 8

PSNR calculation and experiment setup.

Table 3 shows the PSNR values for different sequences with different DCT sizes. The PSNR is calculated as shown in (52) and (53) (52)PSNR=20⋅log⁡(MAXI2MSE), where MAXI is the maximum possible value of the image, and MSE, mean square error, can be defined as (53)MSE=1m⋅n∑i=0m-1‍∑j=0n-1‍[I(i,j)-K(i,j)]2, where I(i, j) and K(i, j) are the input frame and the reconstructed frame, after inverse quantization and IDCT, respectively. From Table 3, it is quite clear that the DCT is showing great efficiency with respect to PSNR. PSNR of Y frames is very close to 50 dB.

Table 3

PSNR (dB) of different sequences for different DCT sizes.

Sequences	DCT size
	N = 4			N = 8			N = 16			N = 32
	Y	U	V	Y	U	V	Y	U	V	Y	U	V
BQ terrace_1920×1080_60	47.1	44.4	44.1	48.8	45.9	44.1	48.9	45.4	45.3	48.8	45.7	44.1
BQ square_416×240_60	47.4	45.1	45.2	48.1	45.2	45.2	48.1	45.7	44.1	48.2	45.2	44.7
BQ mall_832×480_60	46.9	44.1	44.5	47.9	44.9	45.5	48.7	44.6	44.9	48.4	45.7	45.3
Basketball drive_1920×1080_50	47.8	44.5	44.1	48.2	44.8	45.7	48.5	45.1	45.7	48.3	45.1	45.3
Basketball drill_832×480_50	47.0	45.4	45.6	48.6	45.0	44.8	48.2	45.9	45.0	48.7	45.4	45.5

In Tables 4, 5, and 6, the number of multiplications, additions, and shifts, required to calculate different sizes of DCT, are shown. The proposed architecture has no multiplications, where all the multiplications are implemented using shifts and adds. As it can be observed, the number of additions required to compute the 32-point DCT with the proposed architecture is less than the original DCT implementation and the other proposed ones.

Table 4

Proposed architecture.

N	A	S
4	17	5
8	74	39
16	232	132
32	548	249

N * is the DCT size.

M * is the number of multiplications.

A * is the number of additions.

S * is the number of shifts.

Table 5

Comparison for N=32.

	M	A	S
Original	1024	992	0
Proposed	0	548	249

Table 6

Comparison for N=16.

	M	A	S
[5]	0	242	58
Proposed	0	232	132

N * is the DCT size.

M * is the number of multiplications.

A * is the number of additions.

S * is the number of shifts.

The net list is written in VHDL language. Synopsys Design Vision is used for synthesis purpose. The code is synthesized on 90 nm standard cell library at a clock frequency of 150 MHz. Table 7 shows the results of the synthesis.

Table 7

Synthesis results.

Parameter	Value
Technology	90 nm
Frequency	150 MHz
Area	0.42 mm²
Power	884.1 μW
Memory	36 kbits

The time required by the proposed architecture to completely process an N×N macro block is (54)TMB=2fclk⋅(δ⋅N+4), where δ=2 if N=32 and δ=1 otherwise. Thus, the total time to process one W×H pixel frame is (55)T=W⋅HN2⋅TMB⋅K=W⋅HN2⋅K⋅2δ⋅N+8fclk, where K accounts for the chroma subsampling, for example, K=3 for 4 : 4 : 4, K=2 for 4 : 2 : 2, and K=1.5 for 4 : 2 : 0. So from (55), taking H=1920, W=1080, and K=1.5 we obtain T=2.8 ms and T=20.7 ms for N=32 and N=4, respectively. As a consequence, in the worst case (N=4) the proposed architecture sustains up to 48 frames per second.

6. Conclusion

In this work, a dynamic N-point DCT for HEVC is proposed. A partially folded architecture is adopted to maintain speed and to save area. The DCT supports 4, 8, 16, and 32 points. The simulation results show that the PSNR is very close to 50 dB, which is reasonably good. Multiplications are removed from the architecture by introducing lifting scheme and approximating the coefficients.

Wiegand

Sullivan

G. J.

Bjøntegaard

Luthra

Overview of the H.264/AVC video coding standard

IEEE Transactions on Circuits and Systems for Video Technology 2003 13 7 560 576

2-s2.0-0042631515

10.1109/TCSVT.2003.815165

Ugur

Andersson

Fuldseth

Bjøntegaard

Endresen

L. P.

Lainema

Hallapuro

Ridge

Rusanovskyy

Zhang

Norkin

Priddle

Rusert

Samuelsson

Sjöberg

High performance, low complexity video coding and the emerging hevc standard

IEEE Transactions on Circuits and Systems for Video Technology 2010 20 12 1688 1697

2-s2.0-79551588102

10.1109/TCSVT.2010.2092613

Song

Seo

Kim

A unified transform unit for H.264

Proceedings of the International SoC Design Conference (ISOCC '08)

November 2008

130 133

2-s2.0-67650683025

10.1109/SOCDC.2008.4815701

Saponara

Martina

Casula

Fanucci

Masera

Motion estimation and CABAC VLSI co-processors for real-time high-quality H.264/AVC video coding

Microprocessors and Microsystems 2010 34 7-8 316 328

2-s2.0-78650748799

10.1016/j.micpro.2010.06.003

Haggag

M. N.

El-Sharkawy

Fahmy

Efficient fast multiplication-free integer transformation for the 2-D DCT H.265 standard

Proceedings of the 17th IEEE International Conference on Image Processing (ICIP '10)

September 2010

3769 3772

2-s2.0-78651071383

10.1109/ICIP.2010.5653484

Huang

Y. M.

J. L.

Hsu

C. T.

A refined fast 2-D discrete cosine transform algorithm with regular butterfly structure

IEEE Transactions on Consumer Electronics 1998 44 2 376 383

2-s2.0-0032072374

Hein

Ahmed

On a real-time Walsh-Hadamard cosine transform image processor

IEEE Transactions on Electromagnetic Compatibility 1978 EMC-20 453 457

Liang

Tran

T. D.

Fast multiplierless approximations of the DCT with the lifting scheme

IEEE Transactions on Signal Processing 2001 49 12 3032 3044

2-s2.0-0035673741

10.1109/78.969511

Chen

Y. J.

Oraintara

Nguyen

Video compression using integer DCT

International Conference on Image Processing (ICIP'00)

September 2000

844 845

2-s2.0-0034442216

Daubechies

Sweldens

Factoring wavelet transforms into lifting steps

Journal of Fourier Analysis and Applications 1998 4 3 247 268

2-s2.0-18344410543

Martina

Masera

Piccinini

Zamboni

A VLSI architecture for IWT (Integer Wavelet Transform)

Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems

August 2000

1174 1177

2-s2.0-0034465166

Martina

Masera

Piccinini

Zamboni

Novel JPEG 2000 compliant DWT and IWT VLSI implementations

Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 2003 35 2 137 153

2-s2.0-0038514765

10.1023/A:1023696430633

Martina

Masera

Folded multiplierless lifting-based wavelet pipeline

IET Electronics Letters 2007 43 5 27 28

2-s2.0-33847689316

10.1049/el:20073181

Dempster

A. G.

Macleod

M. D.

Use of minimum-adder multiplier blocks in FIR digital filters

IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 1995 42 9 569 577

2-s2.0-0029374075

10.1109/82.466647

Martina

Masera

Low-complexity, efficient 9/7 wavelet filters VLSI implementation

IEEE Transactions on Circuits and Systems II: Express Briefs 2006 53 11 1289 1293

2-s2.0-34547582778

10.1109/TCSII.2006.883092

Martina

Masera

Multiplierless, folded 9/7–5/3 wavelet VLSI architecture

IEEE Transactions on Circuits and Systems II 2007 54 9 770 774

2-s2.0-34648831843

10.1109/TCSII.2007.900354

Joshi

Reznik

Y. A.

Karczewicz

Efficient large size transforms for high-performance video coding

33rd Proceedings of the Applications of Digital Image Processing

August 2010

San Diego, Calif, USA

Proceedings of the SPIE

2-s2.0-78649402020

10.1117/12.862250

Martina

Masera

Piccinini

Scalable low-complexity B-spline discrete wavelet transform architecture

IET Circuits, Devices and Systems 2010 4 2 159 167

2-s2.0-77953329106

10.1049/iet-cds.2009.0185

Britanak

Yip

P. C.

Rao

K. R.

Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer Approximations 2007

Elsevier

Jones

H. W.

Hein

D. N.

Knauer

S. C.

The Ratio CO/CO₂ of oxidation on a burning carbon surface

November 1978

87 98

2-s2.0-0018811070

Ahmed

Rao

K. R.

Orthogonal Transforms for Digital Signal Processing 1975

Springer

Manz

J. W.

A sequency-ordered fast Walsh transform

IEEE Transactions on Audio Electroacoustics 1972 AU-20 204 205

Claire

A. T. S.

Sabido-David

Unified matrix treatment of the fast Walsh-Hadamad transfor

IEEE Transactions on Computers 1976 25 11 1142 1146

2-s2.0-0017020136