Low Cost Design of a Hybrid Architecture of Integer Inverse DCT for H . 264 , VC-1 , AVS , and HEVC MuhammadMartuza

The paper presents a unified hybrid architecture to compute the 8 × 8 integer inverse discrete cosine transform (IDCT) of multiple modern video codecs—AVS, H.264/AVC, VC-1, and HEVC (under development). Based on the symmetric structure of the matrices and the similarity in matrix operation, we develop a generalized “decompose and share” algorithm to compute the 8 × 8 IDCT. The algorithm is later applied to four video standards. The hardware-share approach ensures the maximum circuit reuse during the computation. The architecture is designed with only adders and shifters to reduce the hardware cost significantly. The design is implemented on FPGA and later synthesized in CMOS 0.18 um technology. The results meet the requirements of advanced video coding applications.


Introduction
In recent years, different video applications use different video standards, such as H.264/AVC [1], VC-1 [2], and AVS [3].To improve the coding efficiency further, recently a joint collaboration team on video coding (JCT-VC) is drafting a next generation video coding standards, known tentatively as high efficient video coding (HEVC or H.265) [4].The target bit rate is half of that of H.264/AVC.Besides, several other effective techniques are proposed in the draft to reduce the complexity of the encoder such as improved intrapicture coding, and simpler VLC coefficients [5].As a result of these new features, experts predict that the HEVC will dominate the future multimedia market.
In order to meet up the present and future demands of different multimedia applications, it becomes necessary to develop a unified video decoder that can support all popular video standards on a single platform.In recent years, there is a growing interest to develop multistandard inverse transform architectures for advanced multimedia applications.However, most of them do not support AVS, the video codec developed by Chinese government that became the core technology of China Mobile Multimedia Broadcasting (CMMB) [6].None of the existing works supports the HEVC; thought it is not finalized yet, considering the future prospective of the HEVC [7], it is important to start exploring possible implementation in hardware of the transform unit discussed in the draft.
In this paper, we present a new generalized algorithm and its hardwire implementation of an 8 × 8 IDCT architecture.The scheme is based on matrix decomposition with sparse matrices and offset computations.These sparse matrices are derived in a way that can be reused maximum number of times during decoding different inverse matrices.All multipliers in the design are replaced by adders and shifters.In the scheme, we first split the 8 × 8 transformation matrix into two small 4 × 4 matrices by applying permutation techniques.Then we concurrently perform separate operations on these two matrices to compute the output.It enables parallel operation and yields high throughput, which eventually helps meet the coding requirement of the high resolution video.
The proposed generalized algorithm is later applied to compute the 8 × 8 integer IDCT of AVS.Then we identify the submatrices of AVS and reuse them to compute the IDCT of VC-1.We follow the same principle to compute the other two IDCTs of H.264 and HEVC.For HEVC, we have used the draft matrix discussed in the recent meeting [7]; since it is not yet finalized, we have developed the generalized architecture in such a way that can be easily adjusted to accommodate any changes to the final HEVC format.

Previous Works
In recent years, some multistandard inverse transform architectures have been proposed for video applications.Lee's work in [8] presents a 8 × 8 multistandard IDCT architecture based on delta coefficient matrices which can support VC-1, MPEG4, and H.264.It can process up to 21.9 fps for full HD video.Kim's work in [9] describes a design following similar approach of [8] to unify the IDCT and inverse quantization (IQ) operations for those three codecs.However, the design cannot support full HD video format.Qi's work in [10] shows an efficient integrated architecture designed for multistandard inverse transforms of MPEG-2/4, H.264, and VC-1 using factor share (FS) and adder share (AS) strategies for saving circuit resource.The work achieves 100 MHz working frequency for full HD video resolution, but does not support AVS.In another interesting design [11], the authors devise a common architecture by sharing adders and multipliers to perform transform and quantization of H.264, MPEG-4, and VC-1.The common shortcoming of all these designs discussed in [8][9][10][11] is that none of them supports the Chinese standard, AVS, nor the HEVC.
In our previous work [12], we have developed a resource shared design using delta coefficient matrices which can compute the 8 × 8 IDCT of VC-1, JPEG, MPEG4, H.264/AVC, and AVS.But due to complex data scheduling and the integration of JPEG (which is an image codec), the decoding capability is limited.The design supports both HD formats, but fails to comply with super resolution (WQXGA).Liu [13] introduces another design to support multiple standards where the design throughput is low (110.8MHz) and cannot decode HD and WQXGA video.Fan's works in [14,15] are based on another efficient matrix decomposition algorithm to compute multiple transforms; however, the work is limited to only H.264 and VC-1.There are similar works in [16][17][18], which are also limited to these two codecs (H.264 and VC-1).
In this paper, we present a generalized low-cost algorithm and its single chip implementation to compute all four modern video standards (AVS, H.264, VC-1, and HEVC).The design meets the requirement of high performance video coding as it can process the HD video at 145 fps, the full HD video at 62 fps, and the WQXGA video at 32 fps.The proposed scheme can be applied to both forward and inverse transformation; however, here we only show the implementation for the inverse process (targeted for decoders).

Proposed Generalized Algorithm for 8 × 8 IDCT
In a video compression system, the transform coding usually employs an 8-point II-type DCT.Since, the forward DCT uses the same basis coefficients and is the transpose of the IDCT matrix, the proposed IDCT scheme is easily applicable to it without any added cost or complexity.The 8-point 1D forward and inverse DCT coefficient matrices are expressed in general form as F and I respectively (below in (1), where, a, b, c . . ., g denote seven different transform coefficients): In this paper, we have denoted the 8 × 8 IDCT transform matrices for AVS, VC-1, H.264/AVC, and HEVC by the letters A, V , H and HV respectively.These seven coefficients (a, b, c . . ., g) for each of the transforms are different, but integer in nature (as shown in Table 1).
In the following section, we show how ( 6) can be applied to different IDCT matrices.Another new feature of the proposed scheme is that we take the advantage of the similarity in matrix operation to further optimize the implementation.First of all, we apply ( 6) to efficiently implement the transformation matrix of AVS.Based on it and the generalized structure, we develop the matrix of VC-1 so that we can share as many units (from AVS) as possible.Next, we develop the IDCT matrix of H.264 based on the same principle (decompose and share from AVS and VC-1).In this stage, we are able to achieve the maximum sharing as it will be shown later (in Section 3.4) that the implementation of H.264 does not cost any extra hardware.Finally, we develop the IDCT of HEVC by further decomposing and reusing the units already implemented (with a minimum addition of extra units).

Matrix Decomposition for AVS.
Let us now construct A (from (1) and Table 1) and apply (6) to compute the 4 × 4 submatrices, A 00 and A 11 .We then right shift A 00 by three bits and decompose it as follows: where Like P 0 , the computational cost of A 1 is only 4 additions.For A 2 , we implement (5/4) • x as (1 + 1/4) • x-that is right shift x (arbitrary data) by two bits and then add with x.So, the cost is 6 add and 6 shift operations.Thus in (8), the total computational cost is 10 addition and 6 shift operations.In similar way, we can decompose A 11 as shown below: where For both A 3 and A 4 , the coefficient (3/2) can be shared and the cost is: 12 additions and 4 shift operations for A 3 ; 8 additions and 4 shift operations for A 4 .From ( 8)-( 10), we can summarize the final expression of the 8 × 8 IDCT for AVS as: Thus, the total computational cost to implement A is 38 additions and 26 shift operations.In the next section, we will apply (6) to VC-1 and subsequently decompose the matrix in a way so that we can reuse the units already developed for the AVS (from ( 12)).

Matrix Decomposition for VC-1.
We follow the same principles, as discussed in ( 8) and (10), to decompose the 8 × 8 IDCT for the VC-1: where Now considering the symmetric property and the coefficient distribution patterns between A 00 /8 (in (8)) and V 00 /8, we decompose V 00 /8 as: where From ( 16), ( 15) can be reexpressed as: Now it can be seen how the implementation of AVS matrix (from ( 12)) can be reused in (17).This matrix decomposition enables hardware sharing and results in significant saving in implementation resources.From (17), the total cost of V 3 and V 00 /8 is 8 additions and 6 shift operations.
Next based on our careful observation between the computational similarities between A 11 /4 (in (10)) and V 11 /8, we devise the decomposition scheme of V 11 /8 as: where By substituting (19) in (18), V 11 /8 is expressed as: Note that A 4v in ( 19) is structurally similar to A 4 in (10) except the change in the diagonal coefficients.So we only need to implement it; the rest is shared from the architecture of A 4 .We do so by adding 4 multiplexers at the output of the four left diagonal elements of A 4 matrix.Then according to (19), we reuse A 3 to compute V 4 .As the new matrix A 3v can be derived from A 3 by rearranging the rows and changing the polarity of some input bits, we share it from the design of A 3 by adding 4 multiplexers only.Finally, the expression of V 00 and V 11 from ( 17) and (20) are substituted in (13) to get the final expression of the IDCT for VC-1: It is seen from ( 21) that to implement V , the only new unit that is required is V 3 ; the rest is shared from the implementation of AVS (from ( 12)).So, the total computational cost for VC-1 is 12 additions and 10 shift operations.

Matrix Decomposition for H.264/AVC.
Following similar procedure illustrated in the two previous sections, we can simplify the 8 × 8 transformation matrix for H.264/AVC as shown below: where In order to ensure the maximum unit sharing, we decompose H 00 /8 as below: where In (24), A 1 is directly reused from (12).To share A 2h from the architecture of A 2 we simply add two multiplexer units.
So there is no additional cost in terms of adders and shifters to compute H 00 /8.Similarly, we can decompose H 11 /8 as: where Here A 4v is directly reused from (21) and we share A 3h from the architecture of A 3 .In this sharing we do not even need to use any multiplexers, because we have already done so while sharing A 3v from A 3 in Section 3.3.The final expression of the 8 × 8 IDCT for H.264 (with all shared units) can be summarized as follows: It is interesting to note that all terms in (28) are implemented from the terms of ( 12) and (21); thus, in the proposed scheme, there is no additional cost to implement the IDCT for H.264 which results in significant hardware savings.

Matrix Decomposition for HEVC.
In this section, we develop the transformation matrix for the HEVC based on the principles described before.The 8 × 8 matrix can be decomposed as: where HV 00 4 , The computational cost of HV 1 is 4 additions and 4 shift operations.Here the coefficient (3/4) • x is factorized as (x − x/4).So the cost of HV 00 in (30) is 4 additions and 8 shift operations.Similarly, we decompose HV 11 /4 as: where Combining ( 29)-(32), we compute the proposed 8 × 8 IDCT for HEVC as given below: In (34), only the new matrices, HV 1 and HV 2 , will be implemented and the rest will be shared from (12).So the Sel Control unit Serial to parallel converter total computational cost to implement HV in the proposed design is 24 additions and 28 shift operations.It is important to note that we have carefully decomposed HV so that if there is any change in the final standard, all one needs to do is to update (30) and ( 32) with new parameters without interrupting the entire design.In summery, the proposed unified design costs 74 additions and 64 shift operations to perform the inverse transformation of four defined video standards.

Hardware Implementation of the Shared Architecture
In the implementation of the multistandard architectures on a single platform, we have shared the entire hardware unit of the 4 × 4 matrices, instead of sharing individual adders, shifters, or other factors (as done in [10]).It ensures maximum reduction of hardware cost in our design.The overall block diagram of our proposed scheme is shown in Figure 1.We can see from Figure 1 that the P 0 block splits the 8-point decomposition process to two independent 4-point processes; since these two processes work concurrently, the design throughput is highly increased.The blocks B 1 , B 2 , and B 3 perform different operations (shared) as shown in Table 2. Figure 2(a) shows the design of the serial to parallel converter (S2P) block.It performs left shift and then stores the input one by one into eight registers in 8 clock cycles, and at the 9th cycle, all stored input samples are sent to next block, P 0 .Here the S2P block apparently functions like a temporary memory buffer as it stores the rows of the input matrix inside eight registers.As a result, the proposed design does not require additional memory architecture.The wrapper architecture (P C ) is shown in Figure 2(b).In this multicodec system, only one IDCT and its associated computational units are activated at a time by the control unit and the select pin (Sel); the rest is disabled.The other blocks are shown in Figure 3.In different stages of the design, several multiplexers are used to ensure proper computation of the IDCT in operation.Finally, the P C block combines two different set of data and generates one output.In Figure 3, In 0 , In 1 , . .., In 3 represent the inputs coming from the previous block and Out 0 , Out 1 , . .., Out 3 represent the outputs going to the next block.As an example, in     [13] 76 -JPEG + MPEG-2/4 + H.264 + VC-1 + AVS in [12] 58 31 Proposed-H.264 + VC-1 + AVS + HEVC 74 64 Figure 3(c) for the shared design of V 3 /HV 1 , the inputs are coming from A 2 /A 2h subblock and the outputs are going to P C block.The state diagram of the control unit is shown in Figure 4. Here, "r" is reset and "c" is a 3-bit internal counter run by the system clock.There are one reset and four active states.The states of the control signals are also shown in the diagram; for example, in state 1 (S1), S2P is storing the input vector while the output wrapper (P C block) enables R 00 from MUX1 and R 10 from MUX2.Table 2 shows the units that are active depending on the status of the select pin.For example, the select signal will be "00" when the user wants to perform the IDCT of AVS codec.In that case, B 1 , B 2 , and B 3 will function as 2•A 2 , A 3 , and A 4 , respectively (the rest is inactive as found in ( 12)).

Performance Analysis and Comparisons
The proposed design is implemented in Verilog and its operation is verified using Xilinx Vertex4 LX60 FPGA.The total number of LUTs needed for this proposed architecture is 2,242.The design is later synthesized using 0.18 µm CMOS technology.The architecture costs 39.3 K gates and 12.15 K standard cells with a maximum operating frequency of 200.8 MHz.The estimated power consumption is 29.9 mW with 3 V supply.
In order to demonstrate the sharing efficiency, we have compared the adder count of our design with the 8-point standalone IDCT matrices of three standards: AVS, VC-1, and H.264/AVC (as presented in [12]).The results are shown in Figure 5.As of today, there is no implementation of the 8 × 8 IDCT of HEVC; thus, we have implemented it separately for the sake of better comparison.Now, we  can see from Figure 5 that a total of 104 adders is required to implement these four transforms without sharing.The proposed shared design can compute all of them with 28.9% less adders.Moreover, the savings achieved in individual standards due to the sharing are also marked on the figure.
It is important to note that, though the proposed design costs 38 adders to implement AVS, it does not cost any additional adder units to implement H.264. Hence, AVS and H.264 combined together cost only 38 adders (compared to 48 for standalone implementations).The cost  "-": no information; "x": not supported by the hardware.
of implementing shift operation is considered insignificant in the computation.In Table 3, we compare the cost of the proposed scheme with available existing designs in the literature.None of the designs in this table supports HEVC (which is computationally expensive due to large matrix parameters as shown in Table 1).Although, the designs in [10,12] cost fewer adders, it is shown later that the proposed scheme outperforms it in decoding capacity.Considering the fact that, the proposed architecture can decode the IDCT for four video codecs, it consumes the least number of adders compared to others.
In Table 4, we have summarized the performance in terms of gate count, maximum working frequency, and standard support with other designs.Only the design in [19] has frequency closer to us, but it supports only H.264. Similarly, designs in [11,14] support only two codecs and accordingly cost lesser hardware than ours.Among other designs [8-10, 12, 13] are comparable to our design as they support as many as three codecs.While working at maximum capacity, the proposed design can process 200.8 million pixels/sec.
In order to have a better assessment among comparable designs (e.g., minimum support of three codecs), in Table 5 we compare the decoding capability (using 4 : 2 : 0 lumachroma sampling) of the proposed approach with that of [8][9][10]12].In our work, the maximum achieved frame rate of a 1080 p video is = 200.8× 10 6 /(1920 × 1080 + 2 × 960 × 540) = 64.56≈ 64 fps, which is the highest compared to all other designs in Table 5. Considering the current trends to use super resolution monitors, in this table we have also compared the decoding capabilities for the Wide Quad eXtended Graphics Array (WQXGA, with resolution of 2560 × 1600 pixels).Thus, it can be seen that the proposed design cannot only decode AVS, H.264/AVC, VC-1, and HEVC videos, but also can maintain relatively higher operational frequency to meet the requirements of real time transmission (the target fps to transmit HD, full HD, and QWXGA video are 120, 60, and 30, resp.).From the performance analysis,  the scheme is found to be competitive as it can transmit the highest number of frames per seconds and, hence, takes the least time to transmit one frame at a given resolution.

Conclusion
In this paper, we present a generalized algorithm and a hardware-shared architecture by using the symmetric property of the integer matrices and the matrix decomposition to compute the 8-point 1-D IDCT for four modern video codecs: H.264/AVC, VC-1, AVS, and HEVC (draft in stage).The architecture is designed in such a way that can accommodate any change in the final release of the HEVC.We first apply the generalized scheme to AVS-based transform unit, and then gradually build the rest of the transform units on top of another to maximize the sharing.The performance analysis shows that the proposed design satisfies the requirement of all four codecs and achieves the highest decoding capability.Overall, the architecture is suitable for low-cost implementation in modern multicodec systems.

Figure 1 :
Figure 1: Block diagram of the proposed architecture.

Figure 5 :
Figure 5: Cost of the proposed scheme-standalone versus costshared.

Table 3 :
Comparison of the cost of adders and shifters.

Table 5 :
Comparison of decoding capability (minimum support of three codecs).