Asynchronous Realization of Algebraic Integer-Based 2D DCT Using Achronix Speedster SPD60 FPGA

Transformation and quantization play a critical role in video codecs. Recently proposed algebraic-integer-(AI-) based discrete cosine transform (DCT) algorithms are analyzed in the presence of quantization, using the High Efficiency Video Coding (HEVC) standard. AI DCT is implemented and tested on asynchronous quasi delay-insensitive logic, using Achronix SPD60 field programmable gate array (FPGA), which leads to lower complexity, higher speed of operation, and insensitivity to process-voltage-temperature variations. Performance of AI DCT with HEVC is measured in terms of the accuracy of the transform coefficients and the overall rate-distortion (R-D) characteristics, using HM 7.1 reference software. Results indicate a 31% improvement over the integer DCT in the number of transform coefficients having error within 1%. The performance of the 65nm asynchronous hardware in terms of speed of operation is investigated and compared with the 65 nm synchronous Xilinx FPGA. Considering word lengths of 5 and 6 bits, a speed increase of 230% and 199% is observed, respectively. These results indicate that AI DCT can be potentially utilized in HEVC for applications demanding high accuracy as well as high throughput. However, novel quantization schemes are required to allow the accuracy improvements obtained.


Introduction
High dynamic range (HDR) video and image transmission over digital communication channels is undergoing exponential growth [1].With the increasing demand for high-definition programming, there exists a strong need for efficient digital video coding (DVC) that provides high data compression ratios which in turn leads to better utilization of network resources [2].The H.264/AVC standard [3] does not provide the required compression ratios for emerging capture and display technologies such as ultra high definition (UHD) [4], multiview [5], and autostereoscopy [6].To address such emerging needs, the Joint Collaborative Team on Video Coding (JCT-VC) has developed the successor for H.264/AVC, called High Efficiency Video Coding (HEVC) [4].The HEVC standard aims at achieving a 50% reduction in data rate compared with its predecessors while maintaining low complexity computation.Video compression systems operating at high frequencies and resolutions require hardware capable of significant throughput with tolerable area and power requirements.Real-time video compression circuits having high numerical accuracy are needed for nextgeneration video [1], coding systems [2,3,7], and retina displays [8].
The two-dimensional (2D) 8 × 8 discrete cosine transform (DCT) is a fundamental operation in real-time video systems, which is adopted in compression standards, such as JPEG, MPEG-1, MPEG-2, H.261, H.263, H.264, and most recently H.265/HEVC [3,4,9].The DCT is the de facto standard in transform coding, with extensive applications in modern DVC standards, due to its superior energy compaction on par with the optimum Karhunen-Loève transforms, achieved at reasonably low computational complexity [9,10].The circuit realization of the 2D 8 × 8 DCT affects noise, distortion, circuit area, and power consumption of such compression systems.The 2D DCT implementation is essentially dependent on the one-dimensional (1D) DCT.The 8-point 1D DCT requires multiplications by numbers in the form [] = cos(/16),  = 0, 1, . . ., 7.These constants impose implementation difficulties in terms of their machine representation, because they are irrational values.Fixed-point arithmetic DCT implementations usually employ rounding off to approximate such quantities, which introduces errors.Besides the numerical representation issues, error propagation, noise injection, noise coupling, and noise amplification are significant when considering fixed-point realizations.
Algebraic integer (AI) encoding [11] has been proposed for addressing this problem.Algebraic integers are defined as the roots of monic polynomials having integer coefficients.AI encoding maps irrational numbers to array of integers, which can be exactly manipulated in integer arithmetic hardware.Depending on the numbers to be encoded, AI mapping can be exact, which is the case for 8-point DCT multipliers [12].It was previously shown in [11,13] that AI number theoretic bases can lead to the exact computation of the 2D DCT.The resulting AI-based computation is required to be mapped back into usual fixed-point representation at a given precision using the final reconstruction step (FRS) [11], where an error is unavoidably introduced.However, this introduced error is controlled and not propagable.The AI-based DCT algorithms utilize a number of theoretical techniques, such as multiencoding [14], Booth encoding [15], and expansion factor technique [16].AI-based architectures for the accurate computation of DCT coefficients lead to low chip area, low critical path delay, and low dynamic power consumption.
This work is based on the 8 × 8 2D AI-based DCT digital hardware architecture in [16], which is an improvement of architectures proposed in [17,18].The architecture is based on the low-complexity Arai DCT algorithm [19], which formed the building block of each 1D DCT, using AI number representation.The 8-point Arai algorithm only needs 5 multiplications to generate the 8 output coefficients.The entire signal flow graph is free of quantization errors, unlike conventional fixed-point implementations.Error propagation is eliminated throughout intermediate computation, resulting in zero error correlation among the final DCT coefficients.Errors in the architecture are confined to the FRS, which as mentioned before maps the resulting doubly AI-encoded DCT coefficients into fixed-point representations [11].This allows the selection of individual levels of precision, for each of the 64 DCT coefficients, which makes the architecture capable of arbitrarily high accuracy.The architecture is multiplier-free because it operates over the error-free AI-encoded number representation.
We study the quantization effects of AI-based DCT architecture [16] in the domain of DVC, using the HEVC standard.The HEVC employs a scaled DCT approximation proposed by Chen et al. [20], for real-time DCT implementation, in order to reduce complexity.By replacing Chen's fast DCT algorithm in the HEVC standard with the AI-based Arai DCT, the potential improvements to video image quality in terms of signal-to-noise ratio (SNR) are investigated.Simulations with HM 7.1 reference software [21] are used for this purpose.Unlike its predecessor H.264/AVC, which only supported block sizes of 4 × 4 and 8 × 8, HEVC supports DCT transforms of block sizes of 16 × 16, 32 × 32, and 64 × 64.The higher-order DCTs are utilized to improve energy compaction, and to reduce quantization energy when compressive images contain large homogeneous areas.This study is limited to block sizes 4 × 4 and 8 × 8.The 4 × 4 DCT was implemented using the 8 × 8 DCT, by zero padding the input.All statistical analysis was conducted using standard video sequences available in online databases [22].
In addition to the evaluation of the quantization noise performance, the applicability to high-throughput applications is investigated, by implementation on high-throughput asynchronous quasi delay-insensitive (QDI) logic [23].The architecture is suitable for asynchronous QDI logic, because its feedforward structure can be pipelined, using the asynchronous QDI methods, such that the maximum throughput and speed of operation are achieved.In addition to higher throughput, asynchronous QDI logic has the advantages of (i) robustness to changes in circuit delays due to processvoltage-temperature (PVT) variations and (ii) simplified application-specific integrated circuit (ASIC) design effort in clock tree routing, due to absence of a global clock [23].We use a 65 nm CMOS field programmable gate array (FPGA) device SPD60, from Achronix Semiconductor Corporation [24], for asynchronous QDI logic realization.Performance comparison in terms of speed of operation is conducted using implementations on conventional 65 nm synchronous Xilinx Virtex-5 FPGA devices.
This paper unfolds as follows.In the next section the existing transform as well as the novel AI-based DCT transform is reviewed.The methodology used for investigating the performance of the novel AI-based DCT in the HEVC standard and the results obtained are given in Section 3. The asynchronous QDI logic is introduced in Section 4, followed by the results obtained by hardware implementation in Section 5.

HEVC Integer Cosine
Transform.The integer cosine transform (ICT) required in the HEVC specification is a scaled approximation of Chen's fast algorithm for computing the 1D DCT [20].Since the computational architecture is based on a regular butterfly structure, the ICT has lower circuit complexity at the physical layer.However, the ICT algorithm is not optimal in terms of speed or accuracy [4,20].Despite such suboptimality, the ICT algorithm has the advantage of being extensible to larger transform sizes such as 32 × 32 and 64 × 64, leading to its adoption in HEVC.An overview of the Chen's fast DCT algorithm is given  1.
In order to compute the 2D DCT, the 1D transform is applied twice: first in a row-wise manner, then column-wisely, with a transposition operation in between.Hence, the errors that are introduced in the rounding operation after the row and column transform calculations propagate through the 2D DCT computation and are present as additive noise at the final quantizer stage.The additive noise injected within the 2D transformation algorithm can effect a visible impact on picture fidelity especially for low levels of compression.

AI Encoding and Decoding.
Encoding real numbers to AI representation involves a mapping function  enc (; z) = a, where  is a real number, z is a fixed array of algebraic integers, and a is an array of integers, which represents  in the AI basis.Any real number can be represented in the AI basis with arbitrary precision, and there are some real numbers that can be represented without error [25].
In the proposed architecture, AI encoding requires no operation, since input data are assumed to be integer.In fact, an integer  satisfies the following rule  enc (; z) = [  0 0 0 ].Decoding operation is done in the FRS block, which implements the following operation: for an 8 × 8 input A [26], where C is the usual DCT matrix [10].This shows that the 2D DCT consists of (i) the column-wise application of the 1D DCT to the input A, (ii) a transposition operation, and then (iii) the row-wise application of the 1D DCT.
Here the basic building block of the 2D AI DCT architecture is taken as the 1D AI DCT block [16,18] with 8 inputs and 22 AI integer outputs, as shown in Figure 1.The block diagram of the architecture which consists of five subcircuits [16] is shown in Figure 2. The five subcircuits are reviewed as (i) an input decimator circuit, (ii) an 8-point AI-encoded 1D DCT block for column-wise computation, (iii) an AI-based transposition buffer, (iv) four parallel instantiations of the 8point AI-encoded 1D DCT block for row-wise computation, and (v) the FRS circuit for decoding AI-encoded 2D DCT coefficients.The last transposition is obtained via wired crossconnections.
The implementation does not cover the input decimator circuit, which can be realized using the high-speed serializer/deserializer (SerDes) intellectual property cores.The reader is directed to [16], where the architecture is described in detail.

AI-Based Arai DCT in HEVC
Let A be a 4 ⋅ 2  × 4 ⋅ 2  input data block, where ,  ∈ {0, 1}.One may calculate the 2D DCT of A by zero padding it and submitting it to the 8 × 8 DCT algorithm.Zero padding A furnishes the 8 × 8 matrix B defined according to the following block diagonal matrix:  (, )th element in matrices A  and B  , respectively.Then the following relationship holds true: where  = √ 2 (2−) ⋅ 2 (2−) .This expression can be directly obtained from the 2D DCT definition.Above relationship guarantees that the data blocks of size 4 × 4, 4 × 8, and 8 × 4 can be computed using the 8 × 8 DCT.For the 4 × 4 block size, we have  = 2; for the remaining cases  = √ 2.

Evaluation Setup.
Using sample video sequences, the performance of the proposed AI DCT algorithm was compared with the ICT scheme [4].Firstly, the accuracy of the existing transform unit was analyzed, followed by the effect of the new AI transform.Analysis was performed for the overall encoding process, by obtaining rate-distortion (R-D) curves associated with the standard test sequences.
The configuration used for this evaluation, including the quadtree picture-partitioning settings, is given in Table 3.
The configuration omits the use of larger-sized transforms and restricts the transform operation to 8 × 8 and 4 × 4 transforms.Nonsquare quadtree transform (NSQT) coding tool was also disabled to enable the aforementioned criterion [4,27].

Accuracy of the Transform Block.
The errors in the calculated transform coefficients, with respect to coefficients calculated using floating point MATLAB dct2 function, were obtained both for the original implementation and the proposed implementation.Success rates were obtained as the percentage of coefficients having an error ratio less than a selected threshold value.The results are displayed in Figure 3. Notice that for errors within 1%, the proposed algorithm effects a 31% improvement when compared with the scheme described in [4,20] .

Performance of the Overall Encoding Process.
Effect of the proposed transform on the overall performance of the encoding process was analyzed by obtaining R-D curves for standard video sequences.The curves were obtained by varying the quantization point (QP) from 0 to 8 and obtaining the peak signal-to-noise ratio (PSNR) of the reconstructed sequence, with reference to the original sequence, along with the bits/frame of the encoded video.(416 × 240), BlowingBubbles (416 × 240), and BQMall (832 × 480) using group of pictures (GOP) IPPP. . .sequence using intracoding.Figure 4 provides a comparison between the original video, encoded video using existing algorithm, and encoded video with proposed algorithm using a frame obtained using the test video sequence Basketballpass at QP = 32.The R-D curves (Figures 5-7) indicate that the difference in the rate points of the current and proposed algorithm is almost negligible.This is due to the accuracy degradation in the quantization process.The quantization noise injected at the quantizer dominates the total error diminishing the accuracy gains obtained by the superior transform block.This clearly shows that in order for high-performance HEVC video systems to benefit from the improved accuracy of AIbased architectures, the quantizer block of the codec must be modified or improved from its default design.

Asynchronous QDI FPGA Implementation
Achronix Semiconductor FPGAs utilize asynchronous QDI logic [23] which enables rapid asynchronous implementation of designs without requiring ASICs.The clockless nature of the asynchronous implementation leads to reduced design complexity, lower energy consumption, and higher speed of operation [23].It is also robust since the asynchronous operation is independent of changes in delay resulting from PVT variations [23].A synchronous I/O frame surrounds the clockless logic fabric of the device.Inputs and outputs of the design are made synchronous to a user-defined clock that determines the speed of operation.
The speed of operation of Achronix devices is maximized by means of asynchronous fine-grained pipeline stages termed "picoPIPE, " which are based on a 3-wire handshake scheme, in contrast to a single-wire connection with a global clock signal in synchronous FPGA devices.The correct operation is ensured by a local handshake protocol between adjacent pipeline stages.Adding a large number of pipeline stages to the datapath of the user design leads to highthroughput architectures which have speed of operation in the best case equal to the reciprocal of handshake time delay between two pipeline stages.
However, the existence of unavoidable feedback loops and "reconvergent paths" [28] causes lower speed of operation.These two phenomena are the most frequently occurring  types of critical paths in Achronix FPGAs, which are very different from causes of critical paths of synchronous FPGAs.The critical paths of synchronous FPGAs are determined by delayed paths in the circuit constrained by user-defined specifications such as clocks, input, and output delays.A netlist that contains "reconvergent paths"-nodes where a path with fewer pipeline stages (shorter path) reconverges with a path with more pipeline stages (longer path)-leads to slow speed of operation for asynchronous logic.In that case, data in the shorter path arrive earlier to the node and have to wait for data in the longer path to propagate forward.The performance limitation for a critical reconvergent path is the forward delay of data in the longer path and backward delay of the acknowledge signal in the shorter path [28].Achronix tools documentation [28] details methods for eliminating reconvergent paths.In the next section, we describe and employ some of them.The AI DCT architecture is most suitable for implementation on this device because reconvergent paths form the only speed limitation.The Achronix SPD60 FPGA device based on asynchronous QDI logic [23] was employed for implementation and testing of the architecture on-chip.

FPGA Implementation and Test
The architecture described was implemented using MAT-LAB/Simulink and Xilinx System Generator (XSG) blocks.
The FRS was designed using VHDL and imported to Simulink implementation through the XSG tools.The resulting design was tested using MATLAB/Simulink for correct operation and could be used for direct implementation in Xilinx devices.We generated behavioral hardware description language (HDL) using the XSG tool.VHDL code was synthesized for asynchronous QDI logic by submitting it to the tool flow for implementation in Achronix SPD60 FPGAs.The design was downloaded to the SDP60 device after bit-file generation, and tested successfully on chip using the Achronix SPC60 plugin card, using a serial connection established via an HDL wrapper.Subsequently, the design was submitted to the FPGA tool flows of both Achronix and Xilinx to investigate their maximum speed of operation, power consumption, and FPGA resource utilization.Note that the HDL of the architecture without the wrapper for serial communication is used in the synthesis engine.Resource utilization, maximum speed of operation, and power consumption results of the compilations are given in Table 4 for different fixed-point input widths  in the hardware design.The architecture having  = 8 was the most desirable design with the acceptable levels of tolerances.However, architectures with smaller values of  were also implemented and tested on chip for completeness.For comparison with the asynchronous implementation in terms of speed of operation and resource utilization, the design was compiled to a larger Xilinx synchronous FPGA device-Virtex-5 xc5vlx330-with the same technology node as the Achronix device.Designs were not physically implemented and tested on the Xilinx device but only sent through the Xilinx tool flow in order to obtain the synthesis results for purposes of comparison.The speed of operation results were obtained from the static timing analysis of the compiled designs, considering the default options of the Xilinx tools and the same clock constraint of 125 MHz.
The critical paths for every Achronix compilation reported were also reconvergent paths.Such reconvergent paths can be eliminated in order to further increase the speed of operation as described in Achronix documentation [28].To eliminate a reconvergent critical path, delay was adjusted by the introduction of extra delay elements to the path with lower number of pipeline stages.The Achronix ACE design tools facilitate these operations either by the addition of a constraint to the place and route tool or by changing HDL description by instantiating a macro "ACX SLACKMATCH" at the shorter path.The place-and-route tool for Achronix does delay balancing automatically up to a certain extent.However, larger speeds require changing the input HDL and the constraints as described above and a recompilation.Notice that the speed of operation results given in Table 4 was obtained without optimization.Nevertheless, the asynchronous designs show faster (by more than 230%) operation speeds compared to the synchronous counterpart.
The resource utilization for various designs on Achronix FPGA is given in Table 4 for completeness, except for the 8bit architecture, which could not be accommodated in the SPD60 device.The FPGA resource utilization for Achronix device is given in terms of reconfigurable logic blocks (RLBs), sequential delay flip-flops SEQ-DFFs, and 4-input look-up tables (LUT4s).Xilinx FPGA resource utilization is given in terms of slices, slice flip-flops (FFs), and slice look-up tables (LUTs).Although there is no direct comparison to asynchronous FPGA designs because of the vastly different logic fabrics, we provide corresponding estimates for 65 nm Achronix SDP60 realizations at the same range of  values.The configurable logic blocks (CLBs) in Xilinx Virtex-5 devices are equivalent to RLBs in Achronix FPGAs.Further, Xilinx LUTs can be substantially compared with Achronix 4-input LUTs.The increase in CLB, RLB, FF, and LUT consumptions, be it synchronous Xilinx or asynchronous Achronix FPGAs, both at 65 nm CMOS technology node, varied as shown in Table 4.The Achronix device was capable of operating significantly faster than the Xilinx FPGA device.
Both estimated dynamic and quiescent power consumptions for the 65 nm Xilinx Virtex-5 xc5vlx330 device are provided in Table 4.However, the power consumption estimation of the Achronix device was not possible with the current version of the Achronix design automation tools.

Conclusion
The performance of a recently proposed AI-based DCT architecture in the presence of quantization for HEVC was investigated along with its applicability to high-throughput applications using asynchronous QDI implementation.
Software implementation of the proposed algorithm was integrated into HM 7.1 and its performance in terms of accuracy was compared with the currently employed 8 × 8 DCT transform based on Chen's fast DCT algorithm.Results indicate significant improvement in the accuracy of transform coefficients, specifically a 31% improvement is observed for an error percentage within 1%.R-D curves obtained for quantization values of 0 ≤ QP ≤ 32 indicate that the overall improvement in terms of R-D performance is negligible.These results suggest the potential usage of AI-based DCT algorithms in applications that require high accuracy typically in high resolution front ends.Moreover, they also indicate the necessity of advanced transform techniques that are capable of utilizing the gains obtained through improved transform techniques.
Achronix FPGA devices are used to implement the hardware architecture on asynchronous QDI logic without using the time-consuming and expensive flow of ASIC implementation.Comparison results in terms of speed of operation through static timing analysis between the synchronous and asynchronous FPGA show that the asynchronous FPGA could outperform the Xilinx FPGA device by showing a maximum of 230% increase in operation speed.
As a future work, the effect of novel quantization techniques such as frequency-dependent quantization [29] on the performance should be investigated.Also AI-based algorithms for large block size transforms and their inverse transformations should be studied.Work should also be done on integrating the Arai scaling factors into the quantizer.

Figure 4 :
Figure 4: Frame from Basketballpass test video: (a) original video, (b) encoded video using existing HM encoder, and (c) encoded video using proposed DCT algorithm.

Table 1 :
[20]iplicands required in the 8-point Chen's fast DCT algorithm.The fast algorithm proposed by Chen et al.[20]for computing the 8-point DCT requires 26 additions and 12 multiplications.The seven different multiplicands employed in this algorithm are given in Table1.The exact DCT coefficients can be obtained by scaling the output obtained from the butterfly structure given in[20]by 4. In the suggested reference C++ software implementation[20], Chen's algorithm was implemented in 16-bit arithmetic, using short variables to represent intermediate results.Suitable approximations for the irrational constants in the DCT are scaled by a factor of √ 2, leading to Table 2.2.8-Point Chen's Fast DCT Algorithm.

Table 4 :
Resource utilization, speed of operations, and power consumption of the DCT design on synchronous FPGA and asynchronous FPGA implementations for various input fixed-point bus widths.