A Systematic Hardware Sharing Method for Unified Architecture Design of H . 264 Transforms

1Department of Electronic Engineering, National Formosa University, No. 64, Wenhua Road, Huwei Township, Yunlin County 632, Taiwan 2Department of Computer Science and Information Engineering, National Taichung University of Science and Technology, No. 129, Section 3, Sanmin Road, North District, Taichung City 404, Taiwan 3Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan


Introduction
Video coding standards commonly use transform coding techniques-discrete cosine transforms (DCTs) are widely used in image and video compression standards, such as JPEG [1], MPEG-1/2 [2,3], and MPEG-4 [4].Unlike the DCTs used in previous standards, H.264 [5] uses integer transform matrices for coding, so there is no mismatch between the forward and inverse transforms [6,7] and the complexity is significantly less than that for a DCT.H.264 also provides a specific transform for each prediction mode, and blocks of size 16 × 16 down to 4 × 4 pixels can be used for motion prediction.The prediction modes are organized in a tree-structured manner, which allows flexible combination of different motion compensation block sizes inside a 16 × 16-pixel macroblock.Therefore, H.264 achieves better compression, but it also requires an enormous number of computations.
Although these studies demonstrate combined architectures for multiple transforms, a single architecture has not been designed for the whole set of forward and inverse transforms for H.264 encoder and decoder.Therefore, this study designs a unified architecture for the complete transform functionality of a H.264 codec, while still maintaining the low cost and high speed characteristics.In addition, sharing the hardware for the same operations reduces the hardware cost, and these studies use a hardware sharing technique to reduce hardware cost.However, no systematic hardware sharing method has been proposed in the literatures.This paper proposed a novel method that includes a five-step operation sharing synthesis (FOSS) and architecture-unification techniques, to systematically share the hardware and reduce the cost of multitransform coding.
In order to design the low cost unified architecture, a hardware sharing method is proposed, as shown in Figure 1.Firstly, six transform architectures are designed for low cost, using the proposed five-step operation sharing synthesis (FOSS) technique.Secondly, these low cost FOSS architectures are merged into a single architecture, using the proposed architecture-unification technique.The details of these techniques are described in the later sections.Section 2 describes the FOSS architecture design that reduces the hardware cost for each H.264 transform.Section 3 demonstrates the unification of all the low cost transform FOSS architectures into a single architecture, to eliminate the redundant hardware.The complexity and performance of the unified architecture are analyzed in Section 4, and Section 5 concludes the paper.

The Five-Step Operation Sharing Synthesis Technique
This paper firstly describes a five-step operation sharing synthesis (FOSS) technique and demonstrates the effectiveness of this technique by building low cost 1D 8 × 8 inverse integer transform and 2D 4 × 4 transform architectures.The procedure for the FOSS technique is as follows.
Step 1.Put rows with the same coefficients into the same group.
Step 2. Determine the same operations in each group.
Step 3. Determine the same operations between groups.
Step 4. Replace multiplication by addition and shift.
Step 5. Map each shared item to an operation node.The basic idea is to systematically synthesize an architecture that shares all of the same operations in a matrix multiplication to reduce the cost of hardware. where is an 8 × 8 inverse integer transform matrix and  is an 8 × 8 pixel block.   is the transpose matrix of   and  is an 8 × 8 matrix output of a 2D inverse integer transform.Since the 2D transform is separable, by using the column-row decomposition method, the computation can be converted to 1D row transform followed by a 1D column transform.These operations can be represented as follows: Therefore, a 2D transform can be calculated using two steps.The first step is a 1D transform  =    and the second step is the other 1D transform  =   .The first row of ,  0 ∼  7 , multiplied by the inverse transform matrix equals the first row of ,  0 ∼  7 , as shown in the following: The matrix multiplication in (4) requires 64 multiplications and 56 additions.This paper proposes a novel operation sharing synthesis technique to reduce the hardware cost, and the effectiveness of this technique is demonstrated by applying this technique to a 1D 8×8 inverse integer transform.
Step 1.Put rows with the same coefficients into the same group, to determine the same operations in the next step more easily: Step 2. Determine the same operations in each group and mark the shared operations using "(), " as shown in the following: Step 3. Determine the same operations between groups and mark the shared operations using "(), " as shown in the following: Step 4. Replace multiplication by addition and shift.If the coefficient is a second power, shift replaces multiplication.
For the other coefficients, addition, subtraction, and shift are all needed to replace multiplication.The shared operations are indicated using "(), " as shown in the following: The computation complexity for a 1D 8 × 8 inverse transform is reduced from 64 multiplications and 56 additions in (4) to 24 additions, 16 subtractions, and 18 shift operations in (4), which can be directly mapped to a low cost hardware architecture.
Step 5. Map each shared item to an operation node, where  0 ∼  7 and  0 ∼  7 are the inputs and outputs of the architecture, respectively, and  0 ∼  6 ,  0 ∼  11 , and  0 ∼  11 represent the shared nodes.Only four stages are required for the architecture, from input to output.Nodes  0 ∼  11 ,  0 ∼  11 ,  0 ∼  7 , and  0 ∼  7 are in the 1st, 2nd, 3rd, and 4th stages, respectively.The operations for each stage of the 1D 8 × 8 inverse integer transform are summarized as follows.

Stage 1. Consider
Stage 2. Consider Stage 3. Consider Stage 4. Consider Using these operations for the four stages, the FOSS architecture for a 1D 8 × 8 inverse integer transform is obtained as shown in Figure 2. In Figure 2, the "+" sign on the node represents an addition and the node with the "−" sign represents a subtraction.An arrow represents the data flow. where Although 2D transforms are separable, the column-row decomposition method is not used in this study.Instead, a direct 2D transform method is used to eliminate the use of a transpose register array, in order to reduce the latency.Firstly,   in ( 9) is replaced by (10), to produce the following 16 × 16 transform matrix: Secondly, the FOSS technique is used for (11) to implement a 2D 4 × 4 inverse integer transform as shown in Figure 3.

Architecture-Unification Technique
When all six FOSS architectures of H.264 transforms have been obtained, the proposed architecture-unification procedure described in the following is then used to construct a shared architecture for the FOSS architectures.

Count Decisions of the Unified Architecture Stages and the
Shared Nodes.The unified architecture must have the largest stage count of all of the FOSS architectures, because every stage in each FOSS architecture must correspond to a stage in the unified architecture (e.g., Figure 4).Figures 4(a) and 4(b) show FOSS architectures with four and three stages, respectively.The stage count for a unified architecture must be the largest stage count of the two FOSS architectures, if the two FOSS architectures are to be unified.Figure 4(a) has four stages and Figure 4(b) has three stages, so the unified architecture must have four stages, as shown in Figure 4(c).
The stage counts for all FOSS architectures are not identical.In order to unify FOSS architectures with different stage counts, every FOSS architecture shares operation nodes from stage one of the unified architecture.For example, Figure 4(a) shows a four-stage FOSS architecture, where the shared stages are from stage one to stage four and Figure 4(b) is a three-stage FOSS architecture, where the shared stages are from stage one to stage three, as shown in Figure 4(c).
In the unified architecture, the count for the shared nodes in stage  must be the maximum count for the operation nodes of stage  among all of the FOSS architectures, as shown in Figure 5.In Figure 5   overhead for input path switching, the number of input paths must be minimized.In order to achieve this goal, the nodes of the unified architecture are shared stage by stage, and every node of a stage is compared for all architectures, using the following procedure.
Step 1.If the nodes have two same input paths and the same operation, go to Step 8.
Step 2. If the nodes have two same input paths but different operations, go to Step 8.
Step 3. If the nodes have only one same input path and the same operation, go to Step 8.
Step 4. If the nodes have only one same input path but different operations, go to Step 8.
Step 5.If the nodes have one or two same input paths with opposite position and their operations are both addition, go to Step 8.
Step 6.If the nodes have not the same input path and have the same operation, go to Step 8.
Step 7. If the nodes have two different input paths and different operations, go to Step 8.
Step 8.The nodes share a node of the unified architecture.Figure 6 shows an example of how a node is shared, for the 4 FOSS architectures. Figure 6(a) shows the input paths and the operation nodes for a stage for the 4 FOSS architectures.The transform,  1 , shown in Figure 6(a) is used to label each node, from top to bottom ( 0 ∼  3 ), as shown in Figure 6(b).The nodes of the 4 FOSS architectures ( 1 ∼  4 ) share a node, using this procedure, and the nodes that share a node use the same label, as shown in Figure 6(b), which is also the number  of the shared nodes in the unified architecture, as shown in Figure 6(c).Note that if any one of the inputs of a node in Figure 6(c) has multiple input paths for different transform modes, a multiplexer is required, to select a corresponding input path.

Operation and the Bit-Width Decisions for a Shared Node.
In order to perform the operations of the correspondent nodes for all FOSS architectures, it is necessary to determine the operation for each shared node in the unified architecture.
If the operations for the nodes of the FOSS architectures that share a node are all additions or all subtractions, the operation for this shared node of the unified architecture is addition or subtraction, respectively.If some operations for the nodes of the FOSS architectures that share a node are additions and some operations are subtractions, the operations for this shared node are both subtraction and addition.As shown in Figure 7,  0 nodes are all adders in all of the FOSS architectures, so a  0 node is an adder in the unified architecture.Similarly,  2 is a subtractor.Because the inputs for the correspondent operation nodes in different FOSS architectures are seldom shifted with the same bits, it is not efficient to share a shift operation in the unified architecture.
In order to determine bit-width for a shared node of the unified architecture, bit-widths of the nodes that share a node must be determined first.In order to determine the bit-width for a node of a FOSS architecture, both input and output bit-widths for the FOSS architecture must be checked in the video standard specification, as shown in Table 1.The input bit-width for a node in the current stage is then determined according to the output dynamic range that results from addition, subtraction, and the shift of the nodes in the previous stage.Note that the dynamic range accumulates stage by stage from input to output, for a FOSS architecture.For the unified architecture, the bit-width of a shared node is determined by the largest bit-width of all of the nodes that share a node.As shown in Figure 8 the largest bit-widths of input and output of the  0 nodes in all of the FOSS architectures are both 16 bits, so the input and output bit-widths of the  0 node in the unified architecture are both 16 bits.

The Design of a Low Cost Multiplexer Design for the Mux-
Based Routing Network.To share a node of the unified architecture for the operations of multiple transforms, additional multiplexers are required to route a correspondent input path for individual transform mode.If  transforms share a node, each input of a node has a maximum of  different input paths and two additional -to-1 multiplexers are deployed in front of each node.If an input path is -bit wide, an input requires a -bit, -to-1 multiplexer.As shown in Figure 9, one input of the shared node is 8-bit wide, so an 8-bit 6-to-1 multiplexer is required to route one of the 6 input paths to the input of the shared node.If a 1-bit 2-to-1 multiplexer requires a hardware unit, as shown in Figure 10(a), a 1-bit 6-to-1 multiplexer requires 5 hardware units as shown in Figure 10(b).Because an 8-bit 6-to-1 multiplexer requires eight 1-bit 6-to-1 multiplexers, an 8-bit 6-to-1 multiplexer requires 40 (8 × 5) hardware units.
In order to reduce the hardware cost and to alleviate routing congestion in a VLSI physical design for a mux-based routing network for the unified architecture, the number of input paths for a shared node must be minimized.The multiplexer in Figure 9 is redesigned as a low cost and low routing congestion multiplexer, as shown in Figure 11.In Figure 9, there are two input paths,  00 and  01 , for one input of a shared node, so an 8-bit 2-to-1 multiplexer is used to select input path  00 or  01 .In addition, a 1-bit 6-to-1 multiplexer is used to select 0 or 1 for the selected line of the 8-bit 2-to-1 multiplexer.Using an 8-bit 2-to-1 multiplexer requires 8 hardware units and a 1-bit 6-to-1 multiplexer       requires 5 hardware units, giving a total of 13 (8 + 5) hardware units.The cost of the multiplexer in Figure 11 is 67.5% that of the multiplexer in Figure 9.
The multiplexers required for a shared node result not only in extra hardware cost, but also in routing congestion for the unified architecture.The number of input paths for a 1-bit 6-to-1 multiplexer is 11, as shown in Figure 10(b), and that for an 8-bit 6-to-1 multiplexer is 88, as shown in Figure 9, which could cause clustering of the routing wires in particular areas.The routing congestion can be mitigated using the proposed multiplexer design, wherein the number of input paths is only 33, as shown in Figure 11.In addition, a good floor plan that uniformly distributes the multiplexers around the chip can disperse the routing wires.Note that additional latency is incurred by the multiplexers of the routing network that serialize the operations for different transform modes.

The Architecture-Unification Technique for All the H.264
Transform FOSS Architectures.The architecture-unification technique consists of count decisions for the unified architecture stages and the shared nodes, node sharing for the unified architecture for all FOSS architectures, the operation and the bit-width decisions for a shared node, and the design of a low cost multiplexer for the mux-based routing network.When this process is used for all of the H.264 transform architectures designed using the FOSS technique, Figure 12 shows the unified architecture for 2D 2 × 2 Hadamard transform, 2D 4 × 4 forward transform, 2D 4 × 4 inverse transform, 1D 8 × 8 forward transform, and 1D 8 × 8 inverse transform FOSS architectures.

Complexity and Performance Analysis
4.1.The Computational Complexity of the Original Transforms, the FOSS Architectures, and the Unified Architecture.The computational complexities of the original transforms, the FOSS architectures, and the unified architecture are analyzed.The computational complexities of the original transforms and the FOSS architectures are compared in Tables 2 and 3.As shown in Table 2, a total of 844 adders and 912 multipliers are needed for all of the six original H.264 transforms.However, only 24 sub/adders, 96 adders, 84 subtractors, and 60 shifters are needed for all of the six FOSS architectures that reduce the cost by sharing the same operations and by replacing the multipliers by adders and shifters, as shown in Table 3.The computational complexities of the FOSS architectures and the unified architecture are compared in Tables 3 and 4. Table 4 shows the computational complexity of the unified   Because transform matrix multiplication uses different hardware architectures, the gate counts for the original transforms are not known.Table 5 shows the gate counts for the FOSS architectures and Table 6 shows those for the unified architecture, which has the gate count that is 36% less than the total gate count of all of the six FOSS architectures.Because the six FOSS architectures share the unified architecture, using the proposed architecture-unification technique, the number of gates saved for an individual transform is not known, but the total number of gates saved is the only way   to measure the total reduction in hardware cost that results from the unification technique.Since there is no register in either the FOSS architectures or the unified architecture, the critical path is the longest path of all of the paths from the input to the output of the architecture.The frequency is the reciprocal of the critical path.Because of the delay due to the mux-based routing network, the critical path for the unified architecture is slightly longer than that for any FOSS architecture.In other words, the frequency of the unified architecture is lower than that of any FOSS architecture.The frequency range of the FOSS architectures is from 285 MHz to 385 MHz, while the unified architecture still processes up to 275 MHz, as shown in Tables 5 and 6.Since there is no register in either the FOSS architectures or the unified architecture, the latencies are all one clock cycle.The throughput for the 2D 2 × 2 Hadamard transform FOSS architecture is 4 pixels/cycle, but those of the other transform FOSS architectures are all 8 pixels/cycle.

Conclusion
In map each shared item to an operation node To get a FOSS architecture of a H.264 transform Node sharing of the unified architecture for all FOSS architectures Operation and bit-width decisions of a shared node Low cost multiplexer design for the mux-based routing network To get the unified architecture for all the H.264 transform FOSS architectures

Figure 1 :
Figure 1: The proposed hardware sharing method.

3. 2 .
Node Sharing in the Unified Architecture for All of the FOSS Architectures.If  transforms are to be unified, one of the two inputs of a node in the unified architecture can have up to  different input paths.In order to reduce the multiplexer

Figure 4 :Figure 5 :
Figure 4: Shared stages in the unified architecture: (a) a four-stage FOSS architecture, (b) a three-stage FOSS architecture, and (c) a unified architecture for FOSS architectures (a) and (b), in which the first three stages are shared.

Figure 6 :
Figure 6: Sharing of a node in multiple FOSS architectures: (a) nodes of the same stage of different FOSS architectures, (b) determining which nodes of the same stage to share a node in a unified architecture, and (c) shared nodes of the same stage in a unified architecture.

Figure 7 :
Figure 7: The operation decision for a node in the unified architecture.

7 Figure 8 :Figure 9 :
Figure 8: The bit-width decision for a node in the unified architecture.

Table 1 :
The input and output bit-widths for each transform of H.264.

Table 2 :
The computational complexity of the original transforms for H.264.
4.2.The Hardware Cost and Performance of the FOSS Archi-tectures and the Unified Architecture.In order to implement the unified architecture, a front-end cell-based design flow is employed for logic design, simulation, and verification of VLSI implementation.The proposed FOSS architectures and the unified architecture are firstly realized using Verilog RTL

Table 3 :
The computational complexity of the FOSS architectures for H.264 transforms.

Table 4 :
The computational complexity of the unified architecture for H.264 transforms.

Table 5 :
A comparison of the gate count, frequency, latency, and throughput for H.264 transform FOSS architectures.

Table 6 :
The gate count, frequency, latency, and throughput for the unified architecture.

Table 7 :
The throughput for the unified architecture for H.264 transforms.
this paper, a systematic hardware sharing method that allows a unified architecture for H.264 transforms is presented.A FOSS architecture design technique is presented to reduce the hardware cost for each H.264 transform.The basic idea is to systematically synthesize architecture, to share all of the same operations in a matrix multiplication, and to allow a reduction in hardware cost.In total 844 adders and 912 multipliers are required for all of the six H.264 transform matrix multiplications.However, only 24 sub/adders, 96 adders, 84 subtractors, and 60 shifters are required for all of the six FOSS architectures, which reduces the cost by sharing the same operations and by replacing all of the multipliers by adders and shifters.When all of the six FOSS architectures of the H.264 transforms are determined, an architecture-unification design flow is then proposed that unifies all of the low cost transform FOSS architectures into a single architecture to eliminate the redundant hardware.The unified architecture only requires 28 adders, 16 subtractors, 40 shifters, and the proposed mux-based routing network.The gate count for the unified architecture is 16308, which is 36% less than the total gate count for all of the six FOSS architectures.The frequency range of the FOSS architectures is from 285 MHz to 385 MHz, while the unified architecture still processes up to 275 MHz.Since there is no register in either the FOSS architectures or the unified architecture, the latencies are all one clock cycle.Throughput for the 2D 2 × 2 Hadamard transform is 4 pixels/cycle, but those of the other transforms are all 8 pixels/cycle.In addition, the proposed hardware sharing method can also be used to construct a unified architecture for multitransform coding of other international video coding standards, such as VC-1, MPEG-1/2/4, and even the next generation high efficiency video coding (HEVC) that allows a reduction in hardware cost.