This paper presents a low power and high speed architecture for motion estimation with Candidate Block and Pixel Subsampling (CBPS) Algorithm. Coarse-to-fine search approach is employed to find the motion vector so that the local minima problem is totally eliminated. Pixel subsampling is performed in the selected candidate blocks which significantly reduces computational cost with low quality degradation. The architecture developed is a fully pipelined parallel design with 9 processing elements. Two different methods are deployed to reduce the power consumption, parallel and pipelined implementation and parallel accessing to memory. For processing 30 CIF frames per second our architecture requires a clock frequency of 4.5 MHz.
The coding of video sequences has been the focus of a great deal of researches in recent years. Video phone, video conferencing, CD-ROM archiving, and HDTV are some of the present-day applications. Data compression techniques must be used before transmission due to a large amount of image data to be transmitted, whatever be the application.Video compression can be achieved by reducing spatial and temporal redundancies within video streams. Motion estimation and compensation (MEC) is the key technique for the exploitation of temporal redundancy. Since MEC operations take up to 80% of the computational burden of a complete video compression system, it is the most important component in real-time video applications. Many VLSI implementable algorithms aim at either high-performance or low-power design. Most of the architectures targeting the above-mentioned applications do not seem to be suitable for mobile and low-power applications. Due to the rapid advances in VLSI technology, the attributes of parallelism, pipeline ability, concurrency, modularity, and regularity have become a new set of criteria in designing the hardware for digital video processing.
Systolic
arrays are good candidates for such design. High-speed systolic array
architectures able to process a large number of calculations needed for FSBMA
have been widely proposed. Artieri and Jutand [
A
motion estimation chip for block-based MPEG-4 video applications, using predictive
diamond search, was presented by Abbas et al. [
The CBPS algorithm
is a proper blend of
FSBMA and FBMME approaches. With this, the subjective quality is at par with that of FSBMA whereas computational complexity and hence area consumption and power consumption are least compared to any fast BMME reported so far. In our earlier work [
The rest of the paper is organized as follows. Section
Even though many fast motion vector estimation techniques have been proposed as reviewed before, the spatial and temporal correlations of motion vectors have not yet been fully exploited in reducing the search time while maintaining a reasonable rate-distortion tradeoff. The use of an AR model to characterize spatiotemporal correlations of the motion field could provide an elegant theoretical result. However, its derivation requires a certain amount of computational complexity and its practical value decreases. The goal of this research is to develop a fast motion vector estimation algorithm, which exploits the spatiotemporal correlations of motion vectors in a computationally simple way and yet works effectively in the sense of producing small residual errors.
The
following framework is adopted in our discussion. Each image frame is divided
into nonoverlapping square macro blocks of
A
simple way to incorporate the temporal information with the spatial information
is to include the motion vector of the block at the same location from the
previous frame
Overlapped Candidate blocks in different orientations.
Typical
patterns with
For
Distribution lattice of overlapped search points.
Pixel distance of 1 refers to the
first-order neighborhood sites as per the Markov model. Markovianity emphasizes
the spatial interactions of adjacent sites, and hence this is employed as a
basis of fine search in this work. Fine search is performed with those
unselected overlapped blocks which are first-order neighbor sites for the block
with minimum block difference (MBD) among all the
An edge is defined as a line passing through the sampling grids in any of 0°, 45°, 90° , and 135° directions. The directional coverage is measured as the percentage of edges that at least one of the selected pixels exists on an edge.
To fully represent the spatial information
of an
To
reduce the computational burden, SAD is performed on 4-queen lattices within
4-Queen pixel lattice.
The above pattern is chosen based on the spatial homogeneity
as shown in Figure
Spatial homogeneity in the subsampling lattice.
Directional coverage in the subsampling lattice.
For any pixel value
For
example, let
The proposed RBSAD criterion can be described as follows:
A normally encountered problem in fast search methods is the chances of misinterpreting local minima as global minima. This is mainly due to the reduced number of search points spread in a particular small region out of the entire search area. So in the proposed technique, the search points are spread over the entire search area. This goes in line with the probabilistic global search procedure of “Multistart." The inherent drawback of MultiStart is the possibility of determination of same minimum several times. To avoid this to an extent, the search points are chosen based on Markov model.
Suppose that the maximum motion in the vertical and
horizontal directions is
The complexity of this error surface has a significant impact on the performance of the algorithm. Almost all conventional fast algorithms have explicitly or implicitly made the assumption that the SAD increases monotonically as the checking point moves away from the global minimum or the error surface is unimodal over the search window. Unfortunately, this assumption is usually not true due to many reasons such as the luminance change between frames. As a consequence, the search would easily be trapped at a local minimum. Despite that the error surface defined above exhibits uncertainties in large spatial scale, we can reasonably assume that it is monotonic in a small neighborhood around the global minimum. In the existence of local minima, one simple but perhaps the most efficient and reliable strategy is to put (at least) one checking point as close as possible to the global minimum point (representing the true motion vector). This is equivalent to reducing the distance from the true motion vector to the closest checking point as much as possible. If this distance is small enough, it will be very likely to find the global minimum through a local search.
Candidate block and pixel subsampling algorithm is based on a coarse-to-fine search approach. It is a proper blend of full search block matching algorithm and fast search block matching approach.
Here, we have constructed a new pattern of
candidate blocks, as shown in Figure
Selection of candidate blocks.
New 4-queen pattern.
CBPS algorithm can be expressed with the following equations:
In digital CMOS systems, there are three major sources of power consumption:
The proposed architecture consists of two main
parts, namely, the pipelined pixel selection unit (PPSU) and the SAD&MV computation
unit (SMCU) with pipelining registers between them and a control unit to
control their operation as shown in Figure
Fully pipelined parallel architechture for CBPSA.
A search window of
Pipelined pixel selection unit(PPSU).
Internal structure of the search window memory
consists of a horizontal address decoder and a vertical address decoder as
shown in Figure
Internal structure of search window memory.
Processing element consists of one subtractor
and one accumulator. Processing element takes two inputs: one coming from the
reference block queue and the other coming from one of the 17 columns
consisting the candidate block in the search window memory. The accumulator
adds or subtracts the subtractor output according to its sign bit to accumulate
the absolute value needed in the SSAD computation as in (
Block diagram of the SAD and MV Computation Unit (SMCU).
Structure
of final coarse comparator, as shown in Figure
Block diagram of the final coarse comparator.
The proposed architecture has been prototyped with Xilinx Virtex-II Pro XUPV2 as the target device, simulated using Xilinx 8.1
Comparison of performance parameters of propsed architecture with other architectures.
[ | [ | [ | [ | [ | [ | [ | Proposed | |
---|---|---|---|---|---|---|---|---|
Process ( | 1.2 | 0.8 | 0.8 | 0.25 | 0.18 | 0.18 | 0.18 | 0.18 |
Supply voltage (V) | — | — | — | 3.3 | 1.8 | 1.6 | 1.6 | 1.6 |
Core size (mm2) | 64 | 23 | — | 16.07 | 2.1 | 0.795 | 0.287 | 0.826 |
Clock rate (MHz) | 36 | 23 | 150 | 36.5 | 67 | 100 | 65 | 4.5 |
Power (mW) | 1200 | — | — | — | 452 | 312.07 | 234.4 | 147.6 |
Number of transistors required | 2,49,000 | 85,736 | 3,20,000 | — | 61,603 | — | 39,488 | 64,736 |
Performance of proposed architecture with frequency scaling.
Video application | Frame resolution | Frequency (MHz) |
---|---|---|
CIF | 352 x 288 | 4.5 |
QCIF | 176 x 144 | 1.1 |
NTSC | 480 x 483 | 20.74 |
PAL | 576 x 576 | 24.88 |
CCIR | 720 x 480 | 25.92 |
HDTV | 1280 x 720 | 165.9 |
HDTV2 | 1920 x 1080 | 188 |
Effect of voltage scaling for the proposed architecture (for 0.18
Architecture | [ | Proposed |
---|---|---|
Power (mW) | 452 | 168.7 |
This paper presented a fully pipelined parallel implementation of novel coarse-to-fine search CBPS algorithm for motion estimation which successfully reduces the computational complexity to a large extend without affecting the subjective quality. This architecture gives very good speed performance and very low-power dissipation compared to different architectures for block motion estimation. We have tested its performance with different CIF sequences for videophone applications choosing Xilinx FPGA Virtex-II Pro as the target device. We have also performed an ASIC design simulation with Microwind and DSCH version 3.0. Thus with algorithmic as well as architectural optimization, great performances in terms of speed and power dissipation are achieved. However, to accommodate different block and subblock sizes prescribed in H.264/AVC or to provide flexible search ranges to suit any complex application, certain modifications need to be incorporated in the present architecture. For any block size, some of the channels of the MUXes can be disabled. For flexible ranges, more MUXes can be appended to the existing MUX banks in the PPSU. Further work involves the application of leakage power reduction techniques to the proposed architecture.
The authors wish to express their sincere gratitude to all the anonymous reviewers whose valuable suggestions helped to improve the paper.