Multitransform techniques have been widely used in modern video coding and have better compression efficiency than the single transform technique that is used conventionally. However, every transform needs a corresponding hardware implementation, which results in a high hardware cost for multiple transforms. A novel method that includes a fivestep operation sharing synthesis and architectureunification techniques is proposed to systematically share the hardware and reduce the cost of multitransform coding. In order to demonstrate the effectiveness of the method, a unified architecture is designed using the method for all of the six transforms involved in the H.264 video codec: 2D 4 × 4 forward and inverse integer transforms, 2D 4 × 4 and 2 × 2 Hadamard transforms, and 1D 8 × 8 forward and inverse integer transforms. Firstly, the six H.264 transform architectures are designed at a low cost using the proposed fivestep operation sharing synthesis technique. Secondly, the proposed architectureunification technique further unifies these six transform architectures into a low cost hardwareunified architecture. The unified architecture requires only 28 adders, 16 subtractors, 40 shifters, and a proposed muxbased routing network, and the gate count is only 16308. The unified architecture processes 8 pixels/clockcycle, up to 275 MHz, which is equal to 707 FullHD 1080 p frames/second.
Video coding standards commonly use transform coding techniques—discrete cosine transforms (DCTs) are widely used in image and video compression standards, such as JPEG [
H.264 requires the computation of three transforms, 8 × 8, 4 × 4 integer transforms, and 4 × 4 Hadamard transforms used in the luma component, and two transforms, 4 × 4 integer transforms and 2 × 2 Hadamard transforms, for the chroma components. Every transform of H.264 needs a corresponding hardware implementation, which results in a high hardware cost for all of the transforms involved. In recent years, some transform architectures for H.264 encoder/decoder that reduce the hardware cost have been proposed. In general, for low cost multiple transforms of video codecs, hardware sharing is the most suitable technique [
Although these studies demonstrate combined architectures for multiple transforms, a single architecture has not been designed for the whole set of forward and inverse transforms for H.264 encoder and decoder. Therefore, this study designs a unified architecture for the complete transform functionality of a H.264 codec, while still maintaining the low cost and high speed characteristics. In addition, sharing the hardware for the same operations reduces the hardware cost, and these studies use a hardware sharing technique to reduce hardware cost. However, no systematic hardware sharing method has been proposed in the literatures. This paper proposed a novel method that includes a fivestep operation sharing synthesis (FOSS) and architectureunification techniques, to systematically share the hardware and reduce the cost of multitransform coding.
In order to design the low cost unified architecture, a hardware sharing method is proposed, as shown in Figure
The proposed hardware sharing method.
This paper firstly describes a fivestep operation sharing synthesis (FOSS) technique and demonstrates the effectiveness of this technique by building low cost 1D
The basic idea is to systematically synthesize an architecture that shares all of the same operations in a matrix multiplication to reduce the cost of hardware.
Taking 1D
Therefore, a 2D transform can be calculated using two steps. The first step is a 1D transform
The matrix multiplication in (
The computation complexity for a 1D
A 1D
Take 2D inverse integer transform as the other example to demonstrate more clearly the proposed FOSS technique. The H.264
Secondly, the FOSS technique is used for (
A 2D
The inputs for the FOSS architecture are
Similarly, the FOSS architectures for a 2D
When all six FOSS architectures of H.264 transforms have been obtained, the proposed architectureunification procedure described in the following is then used to construct a shared architecture for the FOSS architectures.
The unified architecture must have the largest stage count of all of the FOSS architectures, because every stage in each FOSS architecture must correspond to a stage in the unified architecture (e.g., Figure
Shared stages in the unified architecture: (a) a fourstage FOSS architecture, (b) a threestage FOSS architecture, and (c) a unified architecture for FOSS architectures (a) and (b), in which the first three stages are shared.
The stage counts for all FOSS architectures are not identical. In order to unify FOSS architectures with different stage counts, every FOSS architecture shares operation nodes from stage one of the unified architecture. For example, Figure
In the unified architecture, the count for the shared nodes in stage
Shared node counts for the unified architecture: (a) a FOSS architecture with 4 nodes in stage 1 and 3 nodes in stage 2, (b) a FOSS architecture with 3 nodes in stage 1 and 4 nodes in stage 2, and (c) a unified architecture with 4 nodes in stage 1 and stage 2.
If
Figure
Sharing of a node in multiple FOSS architectures: (a) nodes of the same stage of different FOSS architectures, (b) determining which nodes of the same stage to share a node in a unified architecture, and (c) shared nodes of the same stage in a unified architecture.
In order to perform the operations of the correspondent nodes for all FOSS architectures, it is necessary to determine the operation for each shared node in the unified architecture.
If the operations for the nodes of the FOSS architectures that share a node are all additions or all subtractions, the operation for this shared node of the unified architecture is addition or subtraction, respectively. If some operations for the nodes of the FOSS architectures that share a node are additions and some operations are subtractions, the operations for this shared node are both subtraction and addition. As shown in Figure
The operation decision for a node in the unified architecture.
In order to determine bitwidth for a shared node of the unified architecture, bitwidths of the nodes that share a node must be determined first. In order to determine the bitwidth for a node of a FOSS architecture, both input and output bitwidths for the FOSS architecture must be checked in the video standard specification, as shown in Table
The input and output bitwidths for each transform of H.264.
4 × 4 forward transform  4 × 4 inverse transform  8 × 8 forward transform  8 × 8 inverse transform  

In  Out  In  Out  In  Out  In  Out  
Row transform  9  12  16  16  9  12, 13  16  16 
Column transform  12  15  16  16  13  16  16  16 


4 × 4 forward Hadamard  4 × 4 inverse Hadamard  2 × 2 forward Hadamard  2 × 2 inverse Hadamard  
In  Out  In  Out  In  Out  In  Out  


Row transform  16  16  16  16  16  16  16  16 
Column transform  16  16  16  16  16  16  16  16 
The bitwidth decision for a node in the unified architecture.
To share a node of the unified architecture for the operations of multiple transforms, additional multiplexers are required to route a correspondent input path for individual transform mode. If
An 8bit 6to1 multiplexer.
(a) A 1bit 2to1 multiplexer. (b) A 1bit 6to1 multiplexer.
In order to reduce the hardware cost and to alleviate routing congestion in a VLSI physical design for a muxbased routing network for the unified architecture, the number of input paths for a shared node must be minimized. The multiplexer in Figure
A low cost multiplexer design for the muxbased routing network.
The multiplexers required for a shared node result not only in extra hardware cost, but also in routing congestion for the unified architecture. The number of input paths for a 1bit 6to1 multiplexer is 11, as shown in Figure
The architectureunification technique consists of count decisions for the unified architecture stages and the shared nodes, node sharing for the unified architecture for all FOSS architectures, the operation and the bitwidth decisions for a shared node, and the design of a low cost multiplexer for the muxbased routing network. When this process is used for all of the H.264 transform architectures designed using the FOSS technique, Figure
The unified architecture for all H.264 transforms.
The computational complexities of the original transforms, the FOSS architectures, and the unified architecture are analyzed. The computational complexities of the original transforms and the FOSS architectures are compared in Tables
The computational complexity of the original transforms for H.264.
Adder  Multiplier  

1D 8 × 8 forward  56  64 
1D 8 × 8 inverse  56  64 
2D 4 × 4 forward  240  256 
2D 4 × 4 inverse  240  256 
2D 4 × 4 Hadamard  240  256 
2D 2 × 2 Hadamard  12  16 


Total  844  912 
The computational complexity of the FOSS architectures for H.264 transforms.
Sub/adder  Adder  Subtractor  Shifter  

1D 8 × 8 forward  0  20  16  10 
1D 8 × 8 inverse  0  24  16  18 
2D 4 × 4 forward  8  16  16  16 
2D 4 × 4 inverse  8  16  16  16 
2D 4 × 4 Hadamard  8  16  16  0 
2D 2 × 2 Hadamard  0  4  4  0 


Total  24  96  84  60 
The computational complexity of the unified architecture for H.264 transforms.
Adder  Subtractor  Shifter  

Unified architecture  28  16  40 
In order to implement the unified architecture, a frontend cellbased design flow is employed for logic design, simulation, and verification of VLSI implementation. The proposed FOSS architectures and the unified architecture are firstly realized using Verilog RTL code, and then a ModelSim EDA tool is used for the RTL functional simulation. Synopsys Design Compiler is used for logic synthesis and the standard cell library used is the UMC 0.18
Because transform matrix multiplication uses different hardware architectures, the gate counts for the original transforms are not known. Table
A comparison of the gate count, frequency, latency, and throughput for H.264 transform FOSS architectures.
Gate count  Freq. (MHz)  Latency  Throughput (pixels/cycle)  

1D 8 × 8 forward  2853  313  1  8 
1D 8 × 8 inverse  5096  292  1  8 
2D 4 × 4 forward  4299  306  1  8 
2D 4 × 4 inverse  5821  285  1  8 
2D 4 × 4 Hadamard  5437  302  1  8 
2D 2 × 2 Hadamard  1032  385  1  4 


Total  25438  N/A  N/A  N/A 
The gate count, frequency, latency, and throughput for the unified architecture.
Gate count  Freq. (MHz)  Latency  Throughput (pixels/cycle)  

Unified architecture  16308  275  1  Listed in Table 
The throughput for the unified architecture for H.264 transforms.
Freq.  Throughput (pixels/cycle)  

1D 8 × 8 forward  275 MHz  8 
1D 8 × 8 inverse  275 MHz  8 
2D 4 × 4 forward  275 MHz  8 
2D 4 × 4 inverse  275 MHz  8 
2D 4 × 4 Hadamard  275 MHz  8 
2D 2 × 2 Hadamard  275 MHz  4 
Since there is no register in either the FOSS architectures or the unified architecture, the critical path is the longest path of all of the paths from the input to the output of the architecture. The frequency is the reciprocal of the critical path. Because of the delay due to the muxbased routing network, the critical path for the unified architecture is slightly longer than that for any FOSS architecture. In other words, the frequency of the unified architecture is lower than that of any FOSS architecture. The frequency range of the FOSS architectures is from 285 MHz to 385 MHz, while the unified architecture still processes up to 275 MHz, as shown in Tables
In this paper, a systematic hardware sharing method that allows a unified architecture for H.264 transforms is presented. A FOSS architecture design technique is presented to reduce the hardware cost for each H.264 transform. The basic idea is to systematically synthesize architecture, to share all of the same operations in a matrix multiplication, and to allow a reduction in hardware cost. In total 844 adders and 912 multipliers are required for all of the six H.264 transform matrix multiplications. However, only 24 sub/adders, 96 adders, 84 subtractors, and 60 shifters are required for all of the six FOSS architectures, which reduces the cost by sharing the same operations and by replacing all of the multipliers by adders and shifters. When all of the six FOSS architectures of the H.264 transforms are determined, an architectureunification design flow is then proposed that unifies all of the low cost transform FOSS architectures into a single architecture to eliminate the redundant hardware. The unified architecture only requires 28 adders, 16 subtractors, 40 shifters, and the proposed muxbased routing network. The gate count for the unified architecture is 16308, which is 36% less than the total gate count for all of the six FOSS architectures. The frequency range of the FOSS architectures is from 285 MHz to 385 MHz, while the unified architecture still processes up to 275 MHz. Since there is no register in either the FOSS architectures or the unified architecture, the latencies are all one clock cycle. Throughput for the 2D
The authors declare that there is no conflict of interests regarding the publication of this paper.