FPGA Implementation of Optimal 3D-Integer DCT Structure for Video Compression

A novel optimal structure for implementing 3D-integer discrete cosine transform (DCT) is presented by analyzing various integer approximation methods. The integer set with reduced mean squared error (MSE) and high coding efficiency are considered for implementation in FPGA. The proposed method proves that the least resources are utilized for the integer set that has shorter bit values. Optimal 3D-integer DCT structure is determined by analyzing the MSE, power dissipation, coding efficiency, and hardware complexity of different integer sets. The experimental results reveal that direct method of computing the 3D-integer DCT using the integer set [10, 9, 6, 2, 3, 1, 1] performs better when compared to other integer sets in terms of resource utilization and power dissipation.


Introduction
Nowadays most video compression algorithms rely on reducing the spatial and temporal redundancy by motion compensation and prediction. However these algorithms are complex and no symmetry exists between encoding and decoding block. This has made implementation of the algorithm more complex. 3D-DCT based video coding [1] is considered as an alternate to the existing standard video compression algorithms. It eliminates some of the problems like blocking effect caused by motion estimation algorithm, which is lossy and time-consuming [2]. For a video sequence that involves fast motion object, motion estimation may not yield correct motion vector since full search cannot be done in a given video stream.
Few research efforts are made to enhance the 3D-DCT based video codec [3][4][5] and made comparable to the standard video compression algorithm. If implementable structure exists for 3D-integer DCT, that will further accelerate the encoding process. A lossy compression scheme has been developed by Zaharia et al. [6] that apply 3D-DCT for compressing 3D integral images and they showed that it outperforms the JPEG standard. Even though recent compression standards developed using discrete wavelet transform outperform the JPEG standard, DCT is the preferred one, because fast computation structures exist for DCT. It reflects the need for proposing new hardware for 3D-integer DCT. However no attempt has been made to implement 3Dinteger DCT algorithm. It is essential to find the suitability of 3D-DCT based video coders in real time application by analyzing the hardware complexity.
Most standard video compression algorithms like MPEG and H.26X adopt DCT as part of their standard. This had led to the development of many fast 1D-and 2D-DCT algorithms. The fundamental aim behind the development of new algorithm for DCT is to reduce the number of multiplications and additions. In order to compute DCT for a given input sequence of length it requires 2 multiplications and ( − 1) additions. The fast DCT algorithm stated in [7] reduces the computational complexity to ( /2) log 2 multiplications and log 2 additions. A few algorithms and implementation structure exist for computing real valued 1D-DCT and 2D-DCT [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23]. Among them the algorithm presented by Prado and Duhamel [16] is given significant 2 The Scientific World Journal importance because the study reveals that if an optimal algorithm is obtained for 1D-DCT then the extension to the corresponding 2D-DCT and 3D-DCT algorithm will also be optimal. However implementing the real value transform becomes more complex since the need of floating point multiplier is unavoidable even if it consumes more resources. Cham et al. [24] have presented a simplified algorithm that first converts the floating point to fixed point and then performs DCT. However exact energy transformation will not happen in this case because of the floating to fixed point conversion. The errors occurring during the computation of 1D-DCT are propagated to the third dimension.
Currently DCT with integer coefficients are of great interest, because the design is simpler and implemented more efficiently. An improvement over traditional real and fixed point implementation was proposed by Edirisuriya et al. [25]. In this paper DCT was computed using integer values. So there is no need to design floating point multiplier that consumes more resource and time. The survey undoubtedly shows the usage of integer DCT in 3D-DCT based video and image compression algorithms. However efforts to design the hardware for 3D-integer DCT are rare in the literature. A few approximation methods are available for deriving the equivalent integer DCT from real value DCT. It is classified as indirect or C-matrix transform method proposed by Kwak et al. [26] and direct method by Pei and Ding [27]. In these papers the two approximation methods (direct and indirect) are considered for analysis and optimal integer set for computing 3D-integer DCT is determined based on MSE and coding efficiency.
Finally based on power dissipation and resource utilization optimal structure for 3D-integer DCT is determined.

3D-Discrete Cosine Transforms
The discrete cosine transform (DCT) is a member of a family of sinusoidal unitary transforms. It found applications in digital signal processing and particularly in image/video compression. The family of discrete trigonometric transforms consists of 8 versions of DCT. Each transform is identified as even or odd and of types I, II, III, and IV. All present image and video processing applications involve only even types of the DCT. In particular DCT-II received much attention in video compression applications because of its high energy packing ability and there exist fast computation structures to compute DCT-II. So throughout the text DCT-II was mentioned as DCT. Equation (1) defines the onedimensional-DCT and inverse DCT for a finite duration signal of length as where Usually image and video frames are two-dimensional in nature. Because of the orthogonality and separability property, DCT can be extended to two dimensional forms. The 2D-DCT for a block of pixels of size × whose intensity values range between 0 and 255 is defined in where , , , = 0, 1, . . . , / − 1. Consider The equation for computing 2D-DCT is extended along the temporal domain to get the required expression for computing 3D-DCT. It is defined in (5) and (7). Consider where where ( , , ) and ( , , ) represent the frequency domain and time domain intensity values, respectively. Correspondingly the expression for finding inverse 3D-DCT is given as shown below: The Scientific World Journal 3

Integer Approximation of 3D-DCT Using Indirect Method
In indirect method integer values are obtained using other orthogonal transforms like the Walsh-Hadamard transform. DCT can be implemented using WHT through a conversion matrix shown in̂= wherêrepresents discrete cosine transform and is the conversion matrix which converts the Walsh domain vector (̂) into DCT domain. In indirect method there are totally 11 different elements in the conversion matrix. Substitution of variable for each nonzero element in the matrix results in 11 variables denoted as { , , , , , , , , , , }. It is represented in (9), wherê8 is approximated conversion matrix:̂8 ) . (9) Preserving the signs of the element of̂8 a search was made to find suitable integer values. Also it has to satisfy the following algebraic equations: Equations (10) and (11) are conditions of orthogonality and they ensure that rows of̂8 are orthogonal to each other. Equation (12) is for normality condition. In order to makê 8 resemble those of real valued transform constraints are set on the variables , , , , , , , , , , . The magnitudes of the elements in 8 are compared and the following inequalities are obtained: > .
All the integer solutions satisfying (10) to (12) under constraints given by (13) to (16) will guarantee that the approximated conversion matrix̂8 is orthonormal and close to the original conversion matrix 8 . The generalized signal flow graph of integer approximation using indirect method is given in Figure 1, where In Figure 1 the lines indicated in blue color represent addition and dotted lines indicated in red color represent subtraction. Additional information regarding integer approximation can be found in the work done by Britanak et al. [28].

Integer Approximation Using Direct Method
In direct method equivalent integer values are obtained directly and it replaces the rational number in the DCT matrix. The approximated integer cosine transform matrix is given by where 8 is a diagonal matrix with normalization factors on its main diagonal and 8 is an integer matrix. It is seen that totally there are 7 different elements in the DCT matrix. The same variables are used to represent the elements in the conversion matrix having the same magnitude. Substituting a variable for each nonzero element in the matrix results in 7 variables denoted as { , , , , , , } as it is shown in (19). Set of inequalities are formed so that orthogonality and normality property of DCT matrix is preserved in the integer domain. Consider By solving (20) under set of constraints described in (21)  x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 sets with low mean squared error (MSE) and high transform coding efficiency ( ) are preferred to get the optimal solution for 3D-integer DCT. Fast computation structures are obtained by recursive sparse matrix factorization method. The generalized signal flow graph of integer approximation using direct integer DCT is given in Figure 2, where the parameters , , , , V, , are integers or dyadic rational.

Criteria for Evaluation of Approximated Integer DCT
In order to evaluate the approximation error between the integer DCT and original transform matrix and to measure the difference in performance in data compression, some theoretical criteria are needed. For this purpose, the input signal is frequently modeled as a first-order stationary Markov process (Markov-1) with zero-mean, unit variance, and adjacent interelement correlation coefficient chosen between zero and one. Then, the input signal X is defined by a covariance matrix , whose elements are given by The matrix is symmetric and Toeplitz. The covariance matrix of the transformed vector y, where y = x, is obtained from (25):

Mean Squared Error
For the evaluation of approximation error between the approximated and original transform matrix, the parameter mean squared error (MSE) was used. It is defined as follows.
Let us assume that is the original transform matrix and is its approximation. Then, for a given input vector X of length , the error vector is x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 From (26), the MSE between the original and approximated transform can be defined by where is the covariance matrix of the input signal X. Thus, to maintain the compatibility between the original and approximated transform, the MSE should be minimized.

Transform Efficiency
Equation (28) defines the transform efficiency: The where are elements of . The transform efficiency measures the decorrelation ability of the transform. The optimal KLT converts signal into completely uncorrelated coefficients and it has transform efficiency = 100 for all values of , while the DCT has transform efficiency = 93.9911 for the correlation coefficient = 0.95.

Structure for Computing 3D-Integer Discrete Cosine Transform
In order to reduce the hardware complexity optimal integer sets from direct and indirect integer approximation are chosen based on number of multiplications/additions. The structure that possesses minimum complexity is considered for computing 3D-integer DCT by taking 1D-integer DCT along row, column, and temporal domain. The block diagram of the proposed 3D-integer DCT is shown in Figure 3. To compute 3D-integer DCT for the cube of dimension, say, 8 × 8 × 8, the 1D-integer DCT is initially performed along the row wise and the computed values are stored in buffer " " along the column wise. The process is repeated for all the rows of the cube starting from frame 1 to frame 8. To have clear visualization rows are marked with the same color, as shown in Figure 3. Here the buffer size and cube size are identical. The structure for computing DCT may be from either direct method or indirect method. Similarly 1D-integer DCT is computed for the values stored in buffer " " along the row wise and the results are stored along the column wise in buffer " ," this result in 2D-integer DCT. Then perform one more 1D-integer DCT for the values stored in buffer " " along the temporal direction that gives the 3D-integer DCT value as shown in Figure 3.

Determination of Optimal Integer Set for Computing 3D-Integer Discrete Cosine Transform.
In order to determine the optimal integer set the performances of the proposed 3D-IDCT are compared against the existing real valued transforms with respect to MSE and transform coding efficiency. Different possible integer solutions exist for both the direct and indirect method of computing 1D-IDCT and it is subjected to computing 3D-IDCT. The MSE and transform coding efficiency of the corresponding integer sets along with the computational complexity are listed in Tables 1 and 2. The integer solutions whose MSE and coding efficiency are very close to real value transform are considered for FPGA implementation. Also it is observed that though the integer set with higher bit solutions (5, 6, 7, and 8) yield low MSE and high coding efficiency, it is not preferred for implementation. Because when computing 3D-integer DCT the size of registers (buffers " " and " ") holding intermediate values becomes larger for higher bit solutions that directly increases the computational complexity (ie) higher bit length multiplier is required. Further, it is noted that for integer set having zero and one, as one of the elements, variation in multiplication/additions is observed.
Here the number of multiplications and additions is estimated based on the structure shown in Figure 3. With  reference to the results obtained in Tables 1 and 2   this integer set yields relatively low MSE and high coding efficiency when compared to real value transform.
Further it was observed that if optimal integer set is used to encode the video sequence instead of real value 3D-DCT there is no much deviation in PSNR value. However it is noticed that the deviation is proportional to the MSE of the corresponding integer set. For the optimal integer set the maximum degradation in PSNR value was found to be 0.01 db.

FPGA Implementation of 3D-IDCT
The hardware design for computing 3D-integer DCT for a block of data 8 × 8 × 8 using the integer set [10, 9,   2, 3, 1 and 1] was coded in Verilog Hardware Description Language. The functional behavior of the design was tested in Xilinx ISE simulator with sample data set. Simulations are also performed using MATLAB for the same data set for correctness. The design was mapped on to Artix-7 FPGA board. The Artix-7 belongs to 28-nanometer (nm) process technology designed for low power products used in portable communication devices. The maximum DC value of 3D-DCT was found to be 4000. If normalization factors are neglected, in integer domain maximum of 17 bits are required to hold the 3D-integer DCT value.
As the value of elements in the integer set increases, then bit length of the processing elements also increases to show that the least resources are utilized for the integer set that has shorter bit values. Synthesis was performed for the integer set [13,12,5,12,0,0,12,4,3,3,4] and comparison has been made with the optimal integer set. From the device utilization summary shown in Table 3 it was noticed that higher resources are utilized for the integer set [13,12,5,12,0,0,12,4,3,3,4]. It is due to the fact that, for computing 3D-integer set, this integer set requires 25 bits; however for optimal integer set it requires only 17 bits. So when bit length of the integer set increases then bit length of computational unit (multiplication/addition) also increases that leads to higher resource utilization. In order to estimate the power consumption of the design Xilinx Power Estimator (XPE) tool was used. The distribution of on-chip power and total power of the design is shown in Figure 4.
The total on-chip power reflects the heat dissipated from the chip. If the device operates at 100 MHz clock, with the total on-chip power of 0.201 W, then the junction temperature is 25.4 ∘ C and it is well below the thermal margin of the target FPGA device. Also a comparison has been made between the existing fixed point 2D-DCT algorithm based on Loffler method [22] and the proposed 3D-integer DCT algorithm in terms of device utilization. It is identified that twelve instances of fixed point 2D-DCT Loffler structures are needed to compute a fixed point 3D-DCT algorithm in accordance 8 The Scientific World Journal with the fact that resource utilization is calculated and it is given in Table 4. It is clearly seen from Table 4 that the proposed 3Dinteger DCT algorithm with optimal integer set [10, 9, 6, 2, 3, 1, 1] outperforms the fixed point 3D-DCT algorithm based on Loffler method [22].

Conclusion
In this paper various integer sets from different approximation methods for converting real to integer value transforms are analyzed in terms of MSE and coding efficiency. Based on that, optimal integer set is chosen for computing 3Dinteger DCT. Further if optimal integer set was adopted to encode the video sequence, then the deviation in PSNR with respect to real value DCT was found to be 0.01 db. Also a new hardware structure for computing the 3D-integer DCT is proposed and implemented the same in FPGA board. The synthesis results reveal that the least resources are utilized for the integer set that has shorter bit values. Also based on number of additions and multiplications variation in resource utilization is observed. The experimental results reveal that direct method of computing the 3D-integer DCT using the integer set [10, 9, 6, 2, 3, 1, 1] performs better when compared to other integer sets in terms of resource utilization and power dissipation.