Hardware Efficient Architecture with Variable Block Size for Motion Estimation

,


Introduction
Digital video processing has been applied to a large number of consumer electronics products such as digital video recorders (DVR), personal digital assistants (PDA), digital cameras, and set top boxes.Motion estimation (ME), which plays most important role in video compression, is applied to evaluate the movement of blocks in the current frame.It aims to remove temporal redundancies that exist in video sequences, which results in substantial bit rate reductions.The block matching algorithm (BMA) is widely adopted for ME as it fits well with rectangular video frames as well as block based transforms and provides a reasonably effective temporal model.
In BMA, previous frame  (−1) is considered as reference frame and frame   is called current frame.Macroblock (MB) of size  ×  from current frame will look for its best match in region having maximum probability called search region in reference frame.Usually size of search region is considered as [−, +] in  as well as in  direction which results in evaluation of (2+1) 2 candidate macroblocks.The difference between the coordinates of current macroblock from current fame and best match candidate macroblock from reference frame is called displacement vector or motion vector (MV).Popular cost function in hardware implementation to identify best match is sum of absolute differences (SAD) which is described by SAD (, V) = Existing video coding standards offer variable block size video motion estimation to improve quality of encoding.Variable block size (VBS) motion compensated prediction (MCP) provides significant rate distortion performance gain over conventional fixed block size MCP but it involves massive computation and adds an extra burden to any ME architecture, in the form of additional hardware complexity, extra computation time, or a combination of both.In H.264 standard of compression a typical macroblock has a dimension of 16 × 16 pixels which can be segmented in the smallest block size of dimension of 4 × 4 (base block) as shown in Figure 1.This division is represented as macroblock mode in Figure 1  To generate SAD value for all 41 possible combinations of 16 × 16 macroblock, 256 pixels are processed for current macroblock as well as for each candidate macroblock.There are several overlapping candidate macroblocks depending on the size of search area memory.Before SAD computation, reading pixels of macroblocks from different memory is most significant task.To serve the purpose, raster scan [4], meander scan [5], z scan [3], or spiral scan patterns are used.Based on pixel reading mechanism, architecture will perform absolute difference and accumulation of difference, and finally comparator will identify which block size is best suited for particular macroblock among various candidate macroblocks.In this paper Section 2 surveys existing VBSME architectures and their scanning patterns.Architecture based on z pattern is presented in Section 3. Section 4 describes simulation and synthesis results and comparison with existing architecture which is followed by conclusion.

Macroblock Scanning Pattern and VBSME Architectures
There has been large development done by researchers in the field of variable size block matching.VBSME with 41 possible combinations of variable size is highly time consuming and quite complex from hardware implementation perspective due to huge computation.In this section existing architectures for VBSME are discussed.Full search VBSME architectures [2][3][4][5][6][7][8][9] are able to perform a full motion search on various size of macroblocks.VBSME unit initially reads current macroblock from current frame and candidate macroblocks from reference frame, divided into 3 stages.The very 1st stage is used to compute absolute difference between corresponding element of current macroblock data and reference macroblock data.The second stage is to calculate intermediate results to generate 41 different SAD values.The data is partially stored in buffer and also forwarded to third stage which is used to generate all SAD values which are useful for the generation of MVs.Various architectures with different scanning pattern gives a variety of performance results for motion vector (MV) generation showing tradeoff between macroblock processed per second and resource requirement for computation.To generate SAD value for all possible combinations of macroblocks all pixels are read using traditional raster scan pattern for 16 × 16 macroblock as shown in Figure 2 for architectures presented in [2,4,6,7].On the other hand, architectures presented in [5,9] use meander scan and architecture presented in [3] uses z scan pattern as shown in Figure 3. Based on pixel reading mechanism architecture will perform absolute difference and accumulation of difference and finally comparator will identify which block size is best suited for particular macroblock among various candidate macroblocks.
16 × 16 macroblock can be segmented into 16 small blocks of size 4 × 4 as indicated in Figure 4 where various small blocks are labels with b0 to b15.In horizontal raster scan pattern of Figure 2(a), first row of blocks b0, b1, b2, and b3 are read while in vertical raster scan pattern of Figure 2(b) first column of blocks b0, b4, b8, and b12 are read.However both types of scan, horizontal and vertical, provide same results in context of resource utilization as well as number of clock cycles required for reading pixels.In VBSME architectures 1, 4, or 16 pixels are read simultaneously and processed in processing elements (PEs) to generate SAD combinations.For parallel processing of pixels architectures prefer multiple PEs which can be 4, 16, 64, or even 256.Most of architectures use 16 × 16 search range which is extended to 32 × 32 in few of the architectures.The VBSME architecture presented in [2] is based on 16 PEs.The current macroblock data is arranged in a raster scan sequence and search region data is arranged in a dual raster scan sequence.16 SAD values are being computed, each with block size 4 × 4. The stored SAD values are then reused to compute SAD values for other block sizes.This is done by shuffling and combining the computed subblock SAD values appropriately to derive SAD for each of the other larger block sizes.This avoids the need to compute each of these from scratch and allow up to 41 SAD values to be processed in a single processor.Architectures presented in [2][3][4] read single pixel at a time and can process only one pixel of current macroblock and candidate macroblock using particular PE in single clock cycle and hence consume 282 clock, 271 clock, and 262 clock cycles, respectively, to generate 41 SAD combinations.Architecture presented in [4] uses 18 × 1 multiplexers as well as latches and eliminates the intermediate buffer requirement need compared to architecture presented in [2].PEs are arranged in 4 × 4 array in architecture explained in [3] and it uses single pixel z scan for reading pixel from reference and current frame.The pixel values are fed through shift registers to 16 PEs which are arranged in 4 × 4 array.Concept is replicated several times to compute multiple candidate macroblocks in given search window.By using scanning pattern of [4] and reading 4 pixels at a time clock cycles required to generate 41 combinations reduce to 70 which is approximately 4 times lesser as indicated in [7].Same author has also presented the extended version of architecture for 16pixel processing in which the number of clock cycles required to generate the same 41 combinations is reduced to 20 which is lesser by factor 16. Architecture proposed in [5] 1, in very 1st cycle submacroblock b0 is read from both reference and current memory and fed to the processing element PE0.At the same time all other PEs also get same submacroblock from current memory but 1 column shifted submacroblock from reference memory.Due to proposed scanning pattern sixteen pixels are scanned together and their SAD values will be available in next clock cycle.Buffer is needed to store SAD value of this smallest size 4 × 4 submacroblock.

Proposed Architecture
The processing element used in Figure 5 is represented in detail in Figure 7.The architecture is divided into multiple stages, namely, absolute difference calculation (ADC), addition of absolute difference, and generation of 41 SAD combinations.To compute absolute difference, multiplexer      based ADC presented in [10] and, for addition of operands, adder presented in [11]  Once all SAD values are available in (2 + 1) PEs, comparators identify best possible combination for (2 + 1) RMBs which is stored and compared with next row of RMBs.After evaluation of all (2 + 1) 2 RMBs, best match macroblock is identified which is followed by motion vector computation.Then, next macroblock from current frame is evaluated.Latency between two consecutive macroblocks of current frame depends on time required to read search area.Due to 128-bit data bus 16 pixels are read from reference frame concurrently, which takes 48 clock cycles for very first macroblock and 64 clock cycles for the rest of the macroblocks if single search area memory is used.In this work three search area memories are incorporated which are used in round robin fashion.When  = 8 is chosen, then 50% search areas for two consecutive macroblocks are overlapped; hence at the time of filling one memory, pixels are filled in next memory also.Due to this arrangement, at the time of motion vector computation for any macroblock, search area memory is prepared for next macroblock; hence there is no latency between successive macroblocks.

Synthesis Results of Proposed VBSME Architecture.
Proposed VBSME hardware architecture is implemented and tested in terms of various evaluation metrics.Architectures have been implemented using VHDL and synthesized using Xilinx FPGA family Spartan3 and Virtex5 with chip XC3s400 and XC5vlx50, respectively.Current memory size is chosen as 16 × 16 pixels due to macroblock size of 16 × 16 while reference memory size is 32 × 32 pixels by considering search window parameter  as 8. Table 3 shows macrostatistics for proposed implementation.Architecture is optimized for adder subtractors and other resources hence demonstrating very low gate count of only 22k.Synthesis delay of design is only 2.543 ns offering maximum frequency of 393.16 MHz.At maximum frequency it can process 179 HD (1920 × 1080) frames in one second.Post place and route delay is 9.72 ns which is considered as worst case delay in which 47 HD (1920 × 1080) frames can be processed per second at frequency of 102 MHz.Table 4 indicates the comparison between the existing VLSI implementation of VBSME and proposed implementation.Similar comparison between the existing FPGA implementation of VBSME and proposed implementation is shown in Table 5.Most of architectures are implemented with variable block sizes from 16 × 16 to 4 × 4 presented in [14] which is limited to block size between 16 × 16 and 8 × 8. Architectures presented in [7,16] are demonstrated for search range 16 × 16; The architecture proposed in this design works on 16 pixels' scanning which results in higher throughput compared to not only 1-pixel scan and 4-pixel scan architecture but also existing 16-pixel scan architectures.In comparison with 16-pixel raster scan architecture of Warrington et al. [7] proposed architecture can process 3 times more HD frames even in worst case and offers 7 times lesser gate count while compared to 16-pixel meander scan architecture of Wei et al. [5] it can process more than 2 times HD frames with 16 times less processing elements.Gate count of López et al. [6] architecture is comparable with proposed architecture but it offers frame rate of only 60 fps for CIF resolution which in actuality is very less.Gate count of [15] is lesser compared to proposed design but frame processing rate is not given and therefore is not adequate for comparison.Architecture presented by Olivares [12] can process 21.42 HD (1920 × 1080) resolution frames with 256 PEs; still this frame rate is not sufficient for real time implementation.From comparison among FPGA implementation of VBSME architectures also we can observe that number of LUTs used by proposed design is higher but at same time design offers higher frame processing rate.From overall comparison with various 16 pixels' scan architectures we can derive that proposed architecture outperforms in terms of throughput.
For the advance comparison of architecture, in addition to frame processing rate, hardware efficiency   [5] is used which is defined as the ratio of data throughput rate TP over hardware cost in terms of resource utilization or gate count.TP is defined by the number of macroblocks processed by architecture per second.Equation (2) indicates hardware efficiency and its unit is macroblocks per second per gate.To evaluate the architecture efficiency in terms of power,   can be defined as ratio of TP over the power as shown in (3).Unit of   is macroblocks per second per mW.With higher   and   , architecture is more efficient.
As per ( 2) and (3) hardware and power efficiency are computed for existing and proposed VBSME implementation and shown in Table 6.Hardware efficiency of proposed architecture in comparison with existing architectures is more than 5 times enhanced in worst case while it is more than 19 times superior in best case.In terms of power efficiency, proposed implementation produces similar results as implementation presented by Fatemi et al. [13].Other than that power efficiency of proposed architecture is better than other architectures in best case.In comparison of some of the architectures, proposed design uses somewhat more gates but throughput of proposed design is higher compared to all existing architectures.Overall comparison indicates that proposed VBSME architecture is hardware efficient and power efficient.

Conclusion
In this paper, architecture for full search variable block size motion estimation is described.Architecture makes calculation for all 41 combinations of variable block size motion vector considering 289 candidate macroblocks in search area of 32 × 32.Architecture described in this paper uses 16 (, ) −  (−1) ( + ,  + V)     .(1)
Figure 6shows location of RMBs for various processing unit and Table1shows the data scheduling for the proposed architecture with 17 PEs.As shown in Table 3.1.PixelReading Pattern.In this section VBSME architecture is presented with aim of generating 41 SAD combinations of variable size macroblock in optimal clock cycles with reduced resource utilization.Instead of using conventional raster scan pattern, proposed architecture uses z scan pattern, roblock and corresponding candidate macroblock from reference memory called reference memory block (RMB).For window size of  there will be (2 + 1) 2 candidate RMBs that need to be processed.By choosing  = (2 + 1), architecture can calculate SAD of current macroblock and (2 + 1) RMBs together and by repeating process (2 + 1) times SAD values for all candidate macroblocks are available.

Table 1 :
Pixel data scheduling for VBSME architecture.
16e used.16referencemacroblock pixels and 16 current macroblock pixels are fed to the ADC unit and result is forwarded to adder block.Adder block sums up all the difference values and stores them to the respective intermediate buffer labelled as b0 to b15. 1 × 16 demultiplexer is used to select respective buffer to compute 4 × 8, 8 × 4, 8 × 16, 16 × 8, and 16 × 16 combination further using multilevel addition.Summation of macroblock sizes less than 16 × 16 is kept on respective data buses for further computation and finally 41 combinations for VBSME are ready.At the end of 16 clock cycles according to schedule of Table 1 all 4 × 4 submacroblocks are read and their individual SAD values are available as shown in Table 2.At very next, that is, on 17th clock, the remaining 25 combinations are computed.Thus all 41 SAD values are available in total 17 clock cycles in all PEs.Immediately RMBs are shifted to next rows and computation of (2 + 1) combinations of that particular row is started.

Table 2 :
SAD output schedule for VBSME architecture.
-pixel z scan pattern to access pixels of current macroblock and 17 candidate macroblocks and can compute all 41 combinations of 16 × 16 macroblock in only 16 clock cycles.Process is repeated 17 times using 17 processing elements, hence in 272 clock cycles all the combinations of all candidate macroblocks are available based on which best match and motion vector is computed.Device utilization of proposed implementation is only 22k and it can process 179 HD (1920 × 1080) resolution frames in best case and 47 HD resolution frames in worst case per second.Implementation results show that proposed VBSME architecture outperforms in area utilization compared to existing 1-pixel scan, 4-pixel scan, and 16-pixel scan architectures due to 16-pixel z scanning pattern.VBSME architecture demonstrates 19 times better hardware efficiency in comparison with other VBSME implementations.Power efficiency of proposed VBSME architecture is either better or comparable with existing implementations.Architecture can be configured with more PEs to suffice need of extended

Table 4 :
Comparison among VLSI implementations of VBSME architectures.