An Effective Transform Unit Size Decision Method for High Efficiency Video Coding

High efficiency video coding (HEVC) is the latest video coding standard. HEVC can achieve higher compression performance than previous standards, such as MPEG-4, H.263, and H.264/AVC. However, HEVC requires enormous computational complexity in encoding process due to quadtree structure. In order to reduce the computational burden of HEVC encoder, an early transform unit (TU) decision algorithm (ETDA) is adopted to pruning the residual quadtree (RQT) at early stage based on the number of nonzero DCT coefficients (called NNZ-EDTA) to accelerate the encoding process. However, the NNZ-ETDA cannot effectively reduce the computational load for sequences with active motion or rich texture. Therefore, in order to further improve the performanceofNNZ-ETDA,weproposeanadaptiveRQT-depthdecisionforNNZ-ETDA(calledARD-NNZ-ETDA)byexploiting thecharacteristicsofhightemporal-spatialcorrelationthatexistinnaturevideosequences.Simulationresultsshowthatthe proposedmethodcanachievetimeimprovingratio(TIR)about61.26% ∼ 81.48% when compared to the HEVC test model 8.1 (HM 8.1) with insignificant loss of image quality. Compared with the NNZ-ETDA, the proposed method can further achieve an average TIR about 8.29% ∼ 17.92%.


Introduction
With the rapid development of electronic technology, the panels of 4 × 2 (or 8 × 4) high-resolution will become the main specification of large size digital TV in the future.However, the currently state-of-the-art video coding standard H.264/AVC is difficult to support the video applications of high definition (HD) and ultrahigh definition (UHD) resolution [1].Therefore, the ITU-T video coding experts group (VCEG) and ISO/IEC moving pictures expert group (MPEG) through their joint collaborative team on video coding (JCT-VC) have developed the newest high efficiency video coding (HEVC) for video compression standard to satisfy the UHD requirement in 2010 [2], and the first version of HEVC was approved as ITU-T H.265 and ISO/IEC 23008-2 by JCT-VC in January 2013 [3][4][5].
HEVC can achieve an average bit rate decrease of 50% in comparison with H.264/AVC high profile while still maintaining the same subjective video quality [6].This is because HEVC adopts some new coding structures including coding unit (CU), prediction unit (PU), and transform unit (TU).The CU is the basic unit of region splitting used for inter-/intraprediction, which allows recursive subdividing into four equally sized blocks.The CU can be split by coding quadtree structure of 4 level depths, in which CU size ranges from largest CU (LCU: 64 × 64) to the smallest CU (SCU: 8 × 8) pixels.The PU is the basic unit used for carrying the information related to the prediction processes, and the TU can be split by residual quadtree (RQT) at maximally 3 level depths which vary from 32 × 32 to 4 × 4 pixels.The relationship among the CU, PU, and TU is shown in Figure 1.
In general, intra CUs have two types of PUs including 2 × 2 partition and  ×  partition, but inter CUs have four types of PUs including 2 × 2, 2 × ,  × 2, and ×.Therefore, HEVC encoder enables 7 different partition modes including SKIP, inter 2 × 2, inter 2 × , inter  × 2, inter  × , intra 2 × 2, and intra  ×  for interslice as shown in Figure 2. The rate distortion (RD) cost under all partition mode and all CU sizes has to be calculated by performing the PUs and TUs to select the optimal CU size and partition mode.For an LCU, all the PUs and available TUs listed in Table 1 have to be exhaustively searched by rate-distortion optimization (RDO) process and this causes dramatically increased computational complexity compared with H.264/AVC.This "try all and select the best" method will result in the high computational complexity and limit the use of HEVC encoders in real-time applications.
To reduce the computational burden of HEVC encoder, there have been many fast encoding methods proposed to reduce the number of CUs and PUs to be tested [7][8][9][10][11][12].Most existing fast HEVC encoding approaches are to check all-zero block (AZB) or motion homogeneity or RD cost or coding tree pruning to skip motion estimation (ME) on unnecessary CU sizes.In HEVC, every possible CU size is tested in order to estimate the coding performance of each CU, and the coding performance is determined by the CU size compared with the corresponding PUs and TUs.Among these coding units, TU is the basic unit used for the transform and quantization processes.The TU process recursively partitions the structure of a given prediction block from PU into transform blocks that are represented by the residual quadtree (RQT).Therefore, in addition to above-mentioned fast CU size decision methods, the early TU decision algorithm (ETDA) is another method selected to reduce the encoding complexity of HEVC [13,14].Recently, a new ETDA is proposed by pruning the RQT at early stage based on the number of nonzero DCT coefficients (called NNZ-EDTA) to accelerate the encoding process of TU module [15].However, the NNZ-ETDA cannot effectively reduce the computational load for sequences with active motion or rich texture.Therefore, in order to further improve the performance of NNZ-ETDA, we propose an adaptive RQT-depth decision for NNZ-ETDA (called ARD-NNZ-ETDA) by exploiting the characteristics of high temporal-spatial correlation that exist in nature video sequences.Firstly, we analyse and calculate the temporal and spatial correlation values from the temporal (colocated) and spatial (left, upper, and left upper) neighbouring blocks of the current TU, respectively.And then the dynamic depth level range of RQT for the current TU is predicted by using the correlation weight and maximum depth levels of its neighbouring blocks.Finally, we combine the proposed adaptive RQT depth and the NNZ-ETDA to further reduce the computational load of TU module.The rest of the paper is organized is as follows.Section 2 introduces the RQT structure of HEVC.In Section 3 we briefly review the method of NNZ-EDTA.In Section 4 we describe the proposed effective TU size decision method.The results of experiment are shown in Section 5. Finally, Section 6 shows the conclusions of this study.

RQT Structure of HEVC
where X is the residual block, Y is the transformed coefficient matrix, and C is a core transformation matrix [16].Equation (2) shows a general form of an  ×  block transformed by integer DCT in HEVC.Consider where all of the above matrices are with the size of ×.The symbol ⊗ stands for a scalar multiplication.PF 1 and PF 2 are postscaling factors defined as follows: PF 1 : = ( + (1 ≪ ( +  − 10))) ≫ ( +  − 9) , where  and  are the intermediate sample value in Y and X, respectively,  = log 2 , and  indicates the internal bit depth of video signals [17].All computations in (2) can be implemented by only using addition and shift operations.
The main reason is that transformed coefficients can be represented with 16 bits to avoid the computation burden in hardware.After integer DCT, the transformed coefficient is quantized by the operations defined as follows: where Q is the quantized coefficient matrix, QP is an quantization parameter ranging from 0 to 51, and ≪ and ≫ are the binary left-shift and right-shift operators, respectively.The constant  is an offset for quantization, which is 171 and 85 for intra-and intercoding, respectively.The quantization matrix (QM) is defined as follows: where % is the mod operation.The relationship between Qbits and QP can be derived as follows: where ⌊⌋ denotes rounding to the nearest integer.

Best RQT Structure of HEVC.
Similar to an LCU, a TU is also recursively partitioned into smaller TUs using a quadtree structure.The residual samples corresponding to a CU are to be subdivided into smaller units using a RQT and the leaf nodes of the RQT is referred to as TUs.In order to achieve the optimal coding performance, the full RQT needs to be pruned to obtain best RQT structure of TUs by comparing the RD cost between the upper layer TU and the lower layer sub-TUs from bottom to top.The minimization process of the RD cost is the well-known RDO measurement.The RD cost function is defined as follows: where  is the distortion,  is the bit rates, and  is the Lagrangian multiplier depending on QP.When the prediction errors are transmitted to the transform and quantization steps, the bit rates can be estimated through entropy coding, and the distortion can be estimated after the inverse quantization and inverse transform steps.RD cost is the main evaluation indicator for achieving the optimal performance.For example, the optimal decision-making is described in Figure 3 when given a CU size of 32 × 32.The leaf nodes ( ∼ ) of the RQT represent the final mode decision of the chosen size of each TU.Although the coding efficiency in HEVC can be improved by using varying transform block sizes (from 32 × 32 to 4 × 4), the computational complexity increased dramatically in terms of the transform kernel size and the transform coding structure [3].This is because the RQT selects the best TU partition mode by checking the RD cost of all possible TUs in all the RQT depths.From the example shown in Figure 3, we can find that the RD cost evaluation is performed a number of times within each RQT structure: once for the TU size of 32 × 32, four times for the TU size of 16 × 16, and 16 times for the TU size of 8 × 8.

Brief Review of NNZ-EDTA
In order to further speed up the encoding process of HEVC, Chio and Jang proposed an early TU decision method for fast video encoding by pruning the TUs at an early stage based on the number of nonzero DCT coefficients (NNZ-EDTA) [15].They find that if a reasonable cost is found at an early stage, the HEVC encoder is not able to skip the remaining RQT processes by using subtree computations.For example, if the TU size of the root node (i.e., TU = 32 × 32 in Figure 3) is determined to be the best TU size after the RQT process, the computations related to subtree RQT processes for finding the best TU size (i.e., TU = 16 × 16 and 8 × 8 in Figure 3) would be unnecessary.On the other hand, they also find that the determined TU size and the number of nonzero DCT coefficients (NNZ) are strongly correlated.Based on two observations, they proposed an efficient way to stop the RD cost evaluation below the root node without sacrificing the coding efficiency.
To evaluate and analyze the performance of the new EDTA, we first implement the NNZ-EDTA in the HEVC test model 8.1 (HM 8.1) [18].Tables 2 and 3 show the conditional probability of traffic and people on street sequences as shown in Figure 4, respectively.In these tables,   (  | NZ  = ) denotes the conditional probability of the selection of a root node (  ) given by NNZ (NZ  = ), where  is the TU size of the root node.NNZ is selected as a threshold to stop further RD cost evaluation based on balancing the computational complexity reduction and the compression efficiency loss.From Tables 2 and 3, we find that the correlation between the determined TU size and NNZ is very low when the threshold of NNZ is set to  = 3.Moreover, from the experimental results, we further find that the NNZ-ETDA cannot effectively reduce the computational load for sequences with fast motion or rich texture.

Proposed Fast TU Decision Method
HM usually adopts the maximum TU size equal to 32 × 32 and maximally 3 depth levels whose depth range is from 1 to 4. From the statistical analysis xin [3], we can find that there are different depth level distributions for sequences.For sequences that contain a large area with high activities, the possibility of selecting depth level "1" is very low.For sequences containing a large area with low activities, more than 70% blocks select the depth level "1" as the optimal levels.These results show that small depth levels are always selected at TUs in the homogeneous region, and large depth levels are selected at TUs with active motion or rich texture.The depth range should be adaptively determined based on the residual block property.
Natural video sequences have strongly spatial and temporal correlations, especially in the homogeneous regions.The optimal RQT depth level of a certain TU is the same or very close to the depth level of its spatially adjacent blocks due to the high correlation between adjacent blocks.Therefore, we firstly analyze and calculate the temporal and spatial correlation values of maximum RQT depth from the temporal (colocated: Col) and spatial (left: L, upper: U, and left upper: L-U) neighboring blocks of the current TU as shown in Figure 5. Table 4 shows the probability of the same maximum RQT depth among the temporal (Col) and spatial (L, U, and L-U) neighboring blocks of the current TU.
To speed up RQT pruning, we make use of spatial and temporal correlations to analyse region properties and skip unnecessary TU sizes.Specifically, the optimal RQT depth level of a block is predicted using spatial neighbouring blocks and the colocated block at the previously coded frame is as follows: where  is the number of blocks equal to 4,   is the value of depth level, and   is the weight determined based on correlations between the current block and its neighbouring blocks.The four weights are normalized to have ∑ −1 =0   = 1.Table 4 shows the statistical analysis of the same maximum  RQT depth under different sequences.From Table 4, we can observe that the colocated block in the previously coded frame and spatial neighbouring blocks have almost the same correlation of the maximum RQT depth for the current TU after normalization.Therefore, the weights for four neighbouring blocks are set to   = 0.25.According to the predicted value of the optimal RQT depth, each block is divided into five types as follows.
(1) If depth pred ≤ 1, its optimal RQT depth is chosen to "1" and the TU size is 32 × 32.The dynamic depth range (DR) of current TU is classified as Type 0.
(2) If 1 < depth pred ≤ 1.5, its optimal RQT depth is chosen to "2" and the TU size is 32 × 32∼16 × 16.The dynamic DR of current TU is classified as Type 1.
(3) If 1.5 < depth pred ≤ 2.5, its optimal RQT depth is chosen to "3" and the TU size is 32 × 32∼8 × 8.The dynamic DR of current TU is classified as Type 2.
(5) If depth pred > 3.5, its optimal RQT depth is chosen to "4" and the TU size is 8 × 8∼4 × 4. The dynamic DR of current TU is classified as Type 4.
Based on the above analysis, the candidate depth levels that will be tested using RDO for each TU are summarized in Table 5.Therefore, the dynamic depth level range of RQT of the current TU can be predicted by using the correlation weight and maximum depth levels of its neighboring blocks.The flowchart of the proposed fast TU decision method is shown in Figure 6.Finally, we combine the proposed adaptive depth of RQT and the NNZ-ETDA to further reduce the computational load of TU.In other words, we propose an adaptive RQT depth to set the candidate depth levels before performing the NNZ-ETDA.The flowchart of proposed adaptive RQT-depth decision for NNZ-EDTA (ARD-NNZ-EDTA) implemented in HEVC HM 8.1 is shown in Figure 7.The proposed ARD-NNZ-EDTA firstly utilizes an adaptive RQT-depth decision algorithm to limit the depth range of RQT and then employs the NNZ-ETDA to exclude the unnecessary calculations of RDO for TUs in RQT.

Experiment Results
For the performance evaluation, we assess the total execution time of the proposed method in comparison to those of the HEVC HM 8.1 [18,19] and the NNZ-ETDA (threshold  = 3) in order to confirm the reduction in computational complexity.The coding performance is evaluated based on ΔBitrate, ΔPSNR, and ΔTime, which are defined as follows: The system hardware is Intel (R) Core (TW) CPU i5-3350P at 3.10 GHz, 8.0 GB memory, and Windows XP 64-bit O/S.Additional details of the encoding environment are described in Table 6.
Table 7 shows the TU processing time (QP = 22) of the proposed method and the HEVC HM 8.1, TU start 4 predicted RQT?
[1, 4]  As shown in Table 7, the TU processing time was effectively improved to ΔTim = 61.26%compared with that of HM 8.1 on average with insignificant loss of image quality.On the other hand, Table 8 also shows the TU processing time of the proposed method and the NNZ-EDTA with the same scenario and QP value, respectively.We also find that our method can further achieve time improving ratio (ΔTime) to 17.92% as compared with the NNZ-EDTA.It is clear that the proposed method indeed efficiently reduces the computational complexity in TU module with insignificant loss of image quality.In addition, the average time improving ratio (TIR) using different QP values is shown in Figure 8. Figures 8(a) and 8(b) show the TIR varying of the proposed method compared with HM 8.1and NNZ-EDTA, respectively.From Figure 8(a), we can observe that the encoding time improving is more efficient when the value of QP increases.However, from Figure 8(b), we can find that the encoding time improving is decreasing when the value of QP increases.9, the proposed method can improve the TU encoding time under different QPs.However, the encoding time is close to that of NNZ-ETDA when the value of QP increases.This is because the quantization error is too large that it results in the lower temporal-spatial correlation.

Conclusions and Further Works
We propose an adaptive RQT depth for NNZ-ETDA by exploiting the characteristics of high temporal-spatial correlation that exists in nature video sequences.An adaptive depth of RQT is employed to the NNZ-ETDA to further reduce the computational load of TU.The proposed method   can achieve time improving ratio about 61.26%∼81.48% as compared to HM 8.1 encoder with insignificant loss of image quality.When compared with the NNZ-ETDA, the proposed method can further achieve time improving ratio about 8.29%∼17.92%.In addition, the proposed method also can be equally implemented with or be considered in design of a fast HEVC encoder.
To support 3D video format applications and efficient rate control at low bit rate, HEVC extensions for the efficient compression of stereo and multiview video are being developed by JCT-3V [20].However, the higher encoding complexity of 3D-HEVC is a key problem in real-time applications [21,22].Thus, it gives an opportunity to develop fast 3D video encoding algorithms for its practical implementations.We will further exploit the characteristics of high temporalspatial correlation existing in 3D video to develop an effective 3D-HEVC encoding method in our further works.

Figure 5 :
Figure 5: Temporal and spatial correlations of TUs.

Figure 8 :
Figure 8: Average time improving ratio using different QP values: (a) proposed versus HM 8.1 and (b) proposed versus NNZ-EDTA.

Figure 9 :
Figure 9: The TU processing time with different QP: (a) traffic and (b) people on street.

Table 1 :
PU and TU sizes in HEVC.

Table 3 :
(  |   = ) of determined TU sizes according to NNZ in people on street.

Table 5 :
Candidate depth levels for each RQT type.

Table 6 :
Test conditions and software reference configurations.

Table 7 :
Results of comparison between HM 8.1 and proposed method.

Table 8 :
Results of comparison between NNZ-ETDA and proposed method.