Hardware Design Considerations for Edge-Accelerated Stereo Correspondence Algorithms

Stereo correspondence is a popular algorithm for the extraction of depth information from a pair of rectified 2D images. Hence, it has been used in many computer vision applications that require knowledge about depth. However, stereo correspondence is a computationally intensive algorithm and requires high-end hardware resources in order to achieve real-time processing speed in embedded computer vision systems. This paper presents an overview of the use of edge information as a means to accelerate hardware implementations of stereo correspondence algorithms. The presented approach restricts the stereo correspondence algorithm only to the edges of the input images rather than to all image points, thus resulting in a considerable reduction of the search space. The paper highlights the benefits of the edge-directed approach by applying it to two stereo correspondence algorithms: an SAD-based fixed-support algorithm and a more complex adaptive support weight algorithm. Furthermore, we present design considerations about the implementation of these algorithms on reconfigurable hardware and also discuss issues related to the memory structures needed, the amount of parallelism that can be exploited, the organization of the processing blocks, and so forth. The two architectures (fixed-support based versus adaptive-support weight based) are compared in terms of processing speed, disparity map accuracy, and hardware overheads, when both are implemented on a Virtex-5 FPGA platform.


Introduction
Depth extraction from stereoscopic images is a vital step in several emerging embedded applications, such as robot navigation, obstacle detection for autonomous vehicles, and space and avionics [1].Depth information is extracted by solving the challenging problem of stereo correspondence, which aims at finding corresponding points (or conjugate pairs) between the input stereo images (usually referred to as reference image I r and target image I t ).Given that the input images are rectified, the correspondence of a pixel at coordinate (x, y) can only be found at the same vertical coordinate y, and within a maximum horizontal bound, called disparity range D (d m -d M ) [2].The disparity is computed as the absolute difference between the coordinates of the corresponding pixels in I r and I t .The disparities of all corresponding pixels form a disparity map, which once estimated, can be used to extract the depth of the scene by using triangulation [1].
In general, stereo correspondence algorithms mostly follow four steps: (1) matching cost computation, (2) cost (support) aggregation, (3) disparity computation/optimization, and (4) disparity map refinement [2].Moreover, they are classified into two broad categories: global and local [2].Global algorithms can produce very accurate results but are slower and computationally more demanding compared to local algorithms, due to their iterative nature and high memory needs [3].On the other hand, local algorithms are faster and less computationally expensive, hence suitable for the majority of embedded stereo vision applications.
Local algorithms determine the disparity of pixel p in I r by finding the pixel of I t that yields the lowest score of a similarity measure (see [3] for a review) computed on a support (correlation) window centered on p and on each of the d M -d m candidates defined by the disparity range.Early local algorithms have followed a simple approach that uses a fixed (typically square) support window during the cost aggregation step.This approach, however, is prone to errors as it blindly aggregates pixels belonging to different disparities, resulting in incorrect disparity estimation at depth discontinuities and regions with low texture [3].Recent local algorithms improve this basic approach by using multiplewindow and adaptive-support weight (ADSW) methods [4].The latter methods represent state-of-the-art in local stereo correspondence algorithms, as they can generate disparity maps that approach the accuracy of global algorithms [3].These methods operate by assigning different weights to the pixels in a support window based on the spatial and color distances to the center pixel [4], or on information extracted from image segmentation [5,6].In this way, they aggregate only those neighboring pixels that are at the same disparity.Despite their improvements in accuracy, ADSW algorithms are generally slower than other local algorithms.A survey for different approaches can be found in [2,3].
The real-time and low-power constraints of most embedded stereo vision applications complicate the realization of stereo correspondence algorithms.Software implementations usually struggle to meet these constraints, and as a result, some form of hardware acceleration or even a complete custom hardware implementation is preferred [3].Hardware acceleration of stereo correspondence algorithms has been done extensively using Digital Signal Processors (DSPs) and Graphics Processing Units (GPUs) [3].These systems, however, involve architectures that are generally not suitable for embedded applications.DSPs do not provide enough computational power to achieve parallel processing, while GPUs consume excessive power.The strong embedded requirements of such applications imply the use of dedicated hardware architectures such as ASICs and FPGAs, which can provide the necessary computational power and the energy efficiency.
The majority of dedicated hardware architectures implement fixed-support algorithms, providing adequate frame rates.However, numerous embedded vision applications require not only real-time processing speed, but also reliable depth computation.As such, there have been a few attempts recently to implement more complex algorithms in hardware such as ADSW algorithms, but these algorithms are computationally more expensive and thus the frame rate suffers, especially when high resolution images are used.It is imperative that the key to providing a successful realization of an embedded stereo vision system is the careful design of the core architecture so as to provide a good tradeoff between the processing speed and the quality of the resulting disparity map.
This paper presents an approach to the acceleration of hardware implementations of stereo correspondence algorithms for embedded stereo vision systems, by using edge information.The presented approach lies on the extraction of feature points (edges) that are used to restrict the stereo correspondence algorithm only to specific features of the input images (rather than to all image points), thus resulting in a considerable reduction of the search space.The paper presents and analyzes design issues and considerations associated with the edge-directed approach when it is applied to different hardware implementations of stereo correspondence algorithms, including both fixed-support and even more complex ADSW algorithms.
The paper extends our initial work on a hardware implementation of an edge-directed fixed-support stereo correspondence algorithm presented in [7].In particular, the extended work generalizes the idea of using edge information as part of the stereo correspondence process and presents design principles and considerations that arise when integrating edge information on the architecture of an ADSW algorithm as well.The paper also explores the effects of different edge detection algorithms on the ADSW-based architecture, showing the impact of each algorithm on the quality of the disparity maps produced by that architecture.Furthermore, the paper compares the two architectures (fixed-support based versus ADSW-based) in terms of processing speed, disparity map accuracy and hardware overheads, when both are implemented on a Virtex-5 FPGA platform.The experimental results indicate that the presented edge-directed approach for accelerating hardware implementations of stereo correspondence algorithms exhibits high potential for applications with hard real-time constraints, as both architectures achieve remarkable speedups relative to existing hardware architectures that implement equally complex algorithms.The architecture based on the fixed-support algorithm can meet the real-time requirements of such applications, even for high-resolution images.Moreover, the combination of the edge-directed approach with the ADSW algorithm enables the realization of accurate disparity computation systems that can also satisfy the realtime requirements of many embedded vision applications such as pedestrian and obstacle detection.
The rest of this paper is organized as follows.Section 2 discusses related work and Section 3 presents the edge-based stereo correspondence process.Section 4 presents the proposed edge-directed disparity map computation architectures.Section 5 presents the experimental framework and results, and Section 6 concludes the paper, discussing future work.

Related Work
There exists a large number of implementations in the literature that solve the stereo correspondence algorithm.Several existing works feature algorithms running on general purpose processors, as well as clusters and multiprocessor systems [8][9][10].However, most of these implementations require high-end computational equipment in order to compute the disparity map in real-time, especially as the image resolution increases.As a result, the real-time processing of those implementations is limited only to smallsized images (smaller than VGA).Alternatively, specialized hardware, such as Intel MMX [11], Graphics Processing Units (GPUs) [12,13], and Digital Signal Processors [14][15][16], has been used to accelerate the disparity map computation.[11] attempts to improve the accuracy of the simple SAD correlation technique by using a multiple window approach that decreases errors at object boundaries.This however requires a lot of computational resources.[12,13] present GPU-based implementations that achieve high frame rates that address the computational needs of stereo vision algorithms; however, they are not currently suitable for embedded and mobile applications due to their power demands.Approaches implemented on high-end fixed point DSPs [14][15][16] consume lower power; however, they do not provide the parallelism of FPGAs and ASICs, thus are less suitable for stereo vision applications with hard real-time requirements.There exist also attempts to implement stereo vision algorithms on the Cell processor [17,18], but are subject to restrictions imposed by the Cell platform, mainly due to the limited memory of the Synergistic Processing Elements (SPEs).
The last decade features an emergence in dedicated hardware architectures, as a means to address the aforementioned constraints found in software and specialized hardware approaches.Dedicated hardware architectures, implemented on ASICs and FPGAs, can exploit the parallelism inherent in the stereo correspondence process and optimize memory access patterns to provide fast computation.Most of dedicated hardware implementations have been implemented on FPGA platforms, taking advantage of the reconfiguration offered by these platforms, in order to exploit the intrinsic parallelism of the stereo correspondence algorithm.A stereo depth measurement system on an FPGA is introduced in [19].It generates disparities on 512 × 480 images at 30 fps, implementing a window-based correlation search technique.Another system presented in [20] yields 20 fps on 640 × 480 image sizes.However, the memory access pattern utilized does not provide scalability and performance optimization, limiting the overall performance of the system.In [21], a survey of some FPGA implementations, [22][23][24], is presented, and the survey highlights the common use of the SAD similarity measure, a relatively simple technique suitable for hardware implementations.Furthermore, [22][23][24] make extensive use of parallelism and pipelining in order to achieve real-time performance.[24] also raises the impact of the image size on the performance of the disparity map computation; it claims 5063 fps using very small (64 × 64) images.The frame rate, however, decreases exponentially as the image size increases.
Other dedicated hardware architectures suitable for realtime disparity map computation are given in [25][26][27][28][29][30][31].These works attempt to implement high-performance stereo correspondence algorithms that fail to achieve real-time performance in software.The FPGA-based architectures presented in [25,26] implement local fixed-support algorithms using the SAD similarity measure and compute intermediate-sized disparity maps at a rate of 768 and 600 fps, respectively.The system in [27] is based on a phase-based computational model, as an alternative to feature correspondence and correlation techniques.That work exploits the parallel computing resources of FPGA devices to produce dense disparity maps for high-resolution images at 52 fps.The supported disparity range, however, is not practical for embedded stereo vision systems, which traditionally utilize much larger disparity ranges, especially when high-resolution images are used.Another implementation of a computationally complex disparity algorithm that is based on locally weighted phase correlation is presented in [28] and utilizes 4 FPGAs to produce dense disparity maps of size 256 × 360 at the rate of 30 fps.A more recent FPGA implementation of a real-time stereo vision system is presented in [29].That system generates dense disparity maps based on the Census transform.Lastly, the hardware implementation presented in [31] performs a modified version of the Census transform in both the intensity and the gradient images, in combination with the SAD correlation metric (SAD-IGMCT algorithm), achieving 60 fps on 750 × 400 images.
The majority of the aforementioned implementations adopt local fixed-support algorithms to achieve real-time performance, as these algorithms can be greatly benefited by the use of parallel and straightforward structures, which are key factors available in dedicated hardware implementations.ADSW algorithms, which traditionally achieve better disparity map accuracy, have been rarely implemented on dedicated hardware architectures.As mentioned above, these algorithms are computationally more expensive compared to fixed-support algorithms and hence require complex and hardware-unfriendly operations.To the best of our knowledge, only the work in [30] implements a local ADSW stereo correspondence algorithm.That work proposes an accurate, hardware-friendly disparity estimation algorithm called mini-census ADSW, and its corresponding real-time VLSI architecture that achieves 42 fps on 352 × 288 image sizes.There has been a good effort in [30] to reduce the computational complexity of the ADSW and achieve real-time performance.However, that work achieves real-time performance for relatively small-sized images.
This work investigates the integration of edge-directed information into hardware architectures of stereo correspondence algorithms, as a means to restrict the stereo correspondence process only to the specific features (edges) of the input images (rather than to all image points), thus reducing the search space considerably.The proposed approach targets real-time embedded vision applications and can benefit both fixed-support and ADSW algorithms.When applied to fixed-support algorithms, it leads to very efficient implementations that can effectively address the hard real-time constraints of such applications even when using high resolution images and large disparity ranges.When applied to ADSW algorithms, the proposed approach provides a good tradeoff between accuracy of the disparity maps and processing speed; the reduction of the search space leads to real-time implementations for image sizes which are larger than the ones used in [30].A comparison between the proposed work and other existing disparity map computation systems is given in Table 7.

Edge-Based Stereo Correspondence Process Overview
3.1.Summary of the Algorithm.In this section we present the overall flow of the edge-based stereo correspondence process.The description is generic and applies to both fixedsupport and adaptive-support weight (ADSW) algorithms.
In general, there are three major tasks associated with an edge-based stereo correspondence algorithm: edge detection, stereo correspondence, and interpolation.Figure 1 illustrates the algorithm's steps towards computing a disparity value for each pixel p in I r .The algorithm starts by determining whether a pixel p corresponds to an edge or not.For each pixel p corresponding to an edge, the algorithm extracts an m × m correlation (or support) window W r centered on p in I r , and an m × m support window W t centered on q in I t (the coordinate of q is (x + d, y), where d lies in the range d m to d M ).The algorithm then computes a pointwise score for any pixel p i ∈ W r corresponding to q i ∈ W t and aggregates the scores spatially over the support windows.
Fixed-support and ADSW stereo correspondence algorithms differ in the way they perform the pointwise score computation and aggregation steps.The fixed-support algorithm computes a pointwise score for any pixel p i ∈ W r corresponding to q i ∈ W t as the absolute different (AD) of p i and q i .After the computation of the AD values, the final aggregated cost is computed by summing all pointwise scores as (1) Equation ( 1) holds only for the case of a fixed-support algorithm.In the case of an ADSW algorithm, each pointwise score is computed by multiplying the absolute difference of p i and q i by a weight coefficient w r (p i , p c ) and a weight coefficient w t (q i , q c ).These weight coefficients are assigned to each pixel in a support window, based on the pixel's spatial distance, as well as on its distance in the CIELAB color space with regard to the central pixel (p c or q c ) in the window.The final aggregated cost is computed by summing up all the weighted pointwise scores, and normalizing by the weights sum as in (2), where, d p and d c are the Euclidean distance between two coordinate pairs and two triplets in the CIELAB color space, respectively, and γ p and γ c are two parameters of the algorithm ( Both the fixed-support and ADSW algorithms compute an aggregated cost for all disparity levels in the range [d m d M ], and the best disparity for the pixel p is found by locating the disparity with the minimum aggregated cost through a winner-takes-all (WTA) approach.In the case where pixel p does not correspond to an edge, the disparity is obtained by an interpolation method.

Discussion.
The edge-based stereo correspondence algorithm described above applies an edge detection process over the input image I r prior to performing the correlation step.Edge detection returns locations in the image that indicate the presence of an edge [1].These locations, described by the edge points, determine the outline (i.e., perimeter) of an object and distinguish it from the background and other objects [1,32].The correlation window W r in I r is moved only to the edges (and not to each possible pixel) along the working scanline, resulting in a considerable reduction in the search space.As a result, the correlation step computes only a disparity value for the pixels that correspond to an edge, while the disparity values of the remaining pixels are computed using interpolation, which however is considered computationally less expensive compared to stereo correspondence.
Another consideration of the edge-based stereo correspondence process is whether it uses the edges only as directive points or as matching primitives as well (instead of using pixel intensities).Since edges are encoded using only one bit per pixel, rather than 8 bits (as used in grayscale images), matching edges instead of pixel intensities reduce the computational space and complexity of the search and match operations, as well as the data path requirements.However, this can be applied only to the fixed-support algorithm.Computing pointwise scores using binary data returns only zeros or ones, and this would invalidate the weights in the ADSW algorithm.

Hardware Realization of Edge-Accelerated Disparity Map Computation Architectures
This section presents the hardware architectures of two different edge-based stereo correspondence algorithms: an SAD-based fixed-support algorithm and an ADSW-based algorithm.We present design principles and considerations about the implementation of these algorithms on hardware, and discuss issues related to the memory structures needed, the organization of the processing blocks, the parallelism exploited, and so forth.This requires that the architecture should be able to perform edge detection on both input images.Moreover, the use of binary data during the matching process reduces the computation of the SAD values (cost computation step) to a hamming distance operation that can be directly implemented in hardware using only addition and 1-bit subtraction operations.The simplicity of the logic required during the cost computation step enables the parallel design of the architecture, so that to target a large number of search and match operations performed in a single cycle.However, the amount of parallelism that can be extracted depends on the ability of the edge detector to generate edge points (as the matching process is performed using edges).In this work we integrate the Sobel detector mainly due to its simple hardware structure, which can be easily parallelized to provide more than one edge points per single cycle.While the use of edge information reduces the search space and simplifies the logic required, the hardware implementation of the algorithm is not straightforward, but the overall design poses significant challenges in terms of the memory structures used in order to account for the irregular rates of data due to the edge only computation.The architecture requires a clever arrangement of the on-chip memory in order to be able to process multiple support windows centered at the edge points, while skipping the nonedge points found between successive edges (every clock cycle).The EDU and the DCU, which communicate through the use of internal memory (FIFO queues), are pipelined, and thus operate concurrently.They are also provided with scanline buffers, which temporarily store the pixels needed to perform convolution (in the case of the EDU), or correlation (in the case of the DCU).This reduces the clock cycles required to load image data from the input port, by exploiting the fact that working windows moved over the image use overlapping pixels.The scanline buffers are organized into FIFO structures and their size depends on the size of the working window (3 × 3 for the EDU, m × m for the DCU) and the width of the image, N. The delay to fill the scanline buffers is proportional to the I/O bandwidth.

EDU Architecture Overview. The Edge Detection Unit
(EDU) integrated to the system implements the Sobel edge detector, which performs a 2D spatial gradient measurement on the input grayscale images using a pair of 3×3 convolution masks.The masks hold data values between −2 and 2; thus the overall convolution can be implemented in hardware using shifters instead of multipliers.We integrated this detector mainly due to its simple hardware structure, which can be easily parallelized to provide more than one edge points per single cycle.This is important, as the ability of the DCU to perform parallel computations depends on the ability of the ECU to provide multiple edge points per cycle.The architecture of the EDU is shown in Figure 2 16 16 The DCU unit was designed with emphasis on parallelism, targeting a large number of search and match operations performed in a single clock cycle.This is facilitated by the simplicity of the adders and subtractors used (due to the use of binary data), and by the organization of the adders and comparators in tree structures.Due to its pipelined and parallel structure, the DCU presents good scalability in terms of correlation window size and disparity range.Particularly, it can compute the SAD values for all possible positions of the shifting window with a maximum size of 11×11 in two clock cycles, and the minimum SAD value for a maximum of 120 disparity levels in three cycles.However, once the pipeline fills up and the FIFO queues are not empty, the DCU can provide a disparity value at the output every clock cycle.

DCU Memory Architecture and Edge Tracking Process.
To keep a constant flow of data in the pipeline, the DCU must be able to locate one edge in the reference image every clock cycle, while discarding the non-edge points found between successive edges.At the same time, the DCU must have parallel access to the m × m window surrounding the edge found in the reference image, as well as to the corresponding d M windows from the target image.For these reasons, each scanline buffer used in the DCU consists of a series of 16-bit registers and can store m scanlines from an input image.We avoid using 1-bit registers in order to facilitate more parallelism and to make the process of discarding the non-edge points fast.The 16-bit registers are organized into FIFO structures and allow parallel access to their elements.Specifically, the scanline buffers for the reference image (SBR) output 16 successive m × m windows (stored in the candidate window buffers), while the scanline buffers for the target image (SBT) output 16 successive m × (m + d M ) windows (stored in the candidate disparity range buffers).The SBR also outputs a 16-bit vector (search vector) from positions 1 + w to 16 + w of the (w + 1)th scanline, where w = (m−1)/2.The search vector is being searched for potential edge points by the edge tracking unit (ETU), which is the connection point between the INPUT/OUTPUT stage and the remaining stages.The ETU works by locating an edge and its corresponding position in the 16-bit search vector every clock cycle.The positions of the edges found during the searching process are used to select the window and disparity range buffers (among the 16 candidates) corresponding to the edge points found; the selected buffers become the input of the next pipeline stage.It must be noted that the ETU requires from 1 cycle (in the best case) to 16 cycles (in the worst) to locate all edges in the search vector.During this period, the edge found signal is set to 1 and the scanline buffers are disabled, so that the content of the candidate window and candidate disparity range buffers remains constant.When all edges in the search vector are located, the ETU sets the edge found signal to 0. This informs the I/O controller to fetch new edges from the input FIFO queues and to shift the scanline buffers to the right.(i) INPUT/OUTPUT.This pipeline stage fetches pixel data from the I/O port, executes the edge tracking process, and selects the windows corresponding to the edges found.The data fetched from the input port (the two 16-bit vectors produced by the EDU) is stored into FIFO queues.The I/O controller reads data from the input FIFO queues (if they are not empty) and forwards data to the scanline buffers (16-bits to each scanline buffer), until the first m scanlines from both images are stored into the scanline buffers.After the scanline buffers are filled, the I/O controller reads new pixel data from the queues only if the edge found signal is set to 0 by the ETU.While the edge found signal is set to 1, the scanline buffers are disabled, and during this period the edge tracking process described above is performed.If there is new data available at the input during this period, this data is written to the input queues.Furthermore, during this pipeline stage, the I/O controller writes the disparity value computed in the previous cycle to the output port.
(ii) SADs COMPUTATION.The SAD values for all disparity levels are computed during this pipeline stage.The stage consists of d M absolute difference (ABDIF) units, which compute the absolute difference between the m × m correlation window (stored in the Window Buffer) and the d M m × m windows (stored in the disparity range buffer).Each ABDIF unit receives as input two m 2 -bit vectors, whose elements are the edge points of the correlation windows, and consists of m 2 1-bit subtractors that compute the absolute difference of the edge points.The output of each ABDIF unit is an m 2 -bit vector, which is next added bitwise using binary tree adders (BTA).Given the 11 × 11 maximum supported correlation window size, and 1-bit pixel intensities, the maximum value of the addition operation cannot be greater than 121.As such, the outputs of the BTA units are 7-bit values.
(iii) MINSAD COMPUTATION.The SAD values for all disparity levels in the range [1 : d M ] are compared with each other in order to compute the minimum value and its disparity.The comparison is carried out by a collection of 7-bit comparators and registers, arranged in tree structure to reduce the delay of the longest path.As stated previously, this stage was further divided into 3 pipeline stages in order to meet the targeted operating frequency (100 MHz). Figure 3 shows the circuit that computes the minimum SAD value and its disparity.Each minSAD unit receives as input two 14-bit vectors, each of which is a concatenation of an SAD value (7 bits) and its corresponding disparity (7 bits-up to 120 disparity levels).The minSAD unit compares the two SAD values and outputs the minimum of them along with its disparity.The entire circuit for computing the minimum SAD value and its disparity consists of multiple minSAD units, arranged in a structure of a binary tree of log 2 (d M ) levels.

Design Issues and
Requirements.This architecture implements an ADSW algorithm, which is computationally more expensive compared to the fixed-support algorithm both in terms of calculations and memory requirements.This is attributed in a significant way to the complex equation involved in the algorithm (2).In this work, we adopted some hardware optimization techniques that aim to simplify the algorithm and make it hardware-friendly and suitable for embedded constraints.We first eliminate the use of the spatial distance during the weight computation step based on our observation that it affects the accuracy of the disparity maps slightly.We also use YUV instead of CIELAB color representation during the computation of the weight coefficients.This allows the use of unsigned integers instead of signed floating-point integers, which are complex and hardware-unfriendly.Furthermore, the computation of the color distance between two YUV triplets is performed using Manhattan rather than Euclidean distance.In this way, the square and square root operations are replaced by simple absolute difference and addition operations.In addition, the exp(−x) function is approximated by the 2 8−x function, which assigns a maximum weight of 256 if the color distance is zero and a weight of 0 if the color distance is greater than 8.This function simplifies the circuits that implement the multiplication of the weight coefficients with the pointwise scores, as multiplications are reduced to left shift operations.The cost function is further simplified by setting γ c to a power of 2 (32 in our case).This converts the division to a right shift operation.Lastly, the denominator of ( 2) is approximated by the nearest power of 2 during the cost aggregation step, allowing the division to be replaced by a right shift operation.ensuring that there is always sufficient data for the CS, which is responsible for the calculation of the disparities.

General
The overall system architecture also consists of a control unit that coordinates data transfers and handshakes between the different system units.Figure 4 shows a block diagram of the architecture and the data flow between units.The data from the queues is forwarded to the window buffers, which form the inputs of the CS.The window buffer of the reference MA consists of m × m 8-bit registers, while the window buffer of the target MA consists of m•(m+d M −1)  registers (d m is set to 0).The use of registers allows parallel access to the window buffers; after an initial delay of m • (m + d M − 1) cycles per scanline (dominated by the cycles needed to fill in the window buffer of the target MA), both on-chip MAs can provide an m × m window per cycle.The window buffer of the target MA is organized in a cyclic structure and is provided with a series of multiplexers at its input, which determine whether the input comes from the FIFO queues or from the rightmost registers of the window buffer.This structure is adopted to enable data reuse.

Calculation Stage (CS). The CS consists of an Edge
Detection Unit, two units for the generation of the weight coefficients (weight generators), a unit that computes the aggregated costs and selects the disparity with the minimum cost, and a unit responsible for the nearest neighbor (NN) interpolation step.The Edge Detection Unit implements the Canny operator, which performs image smoothing using Gaussian convolution, vertical and horizontal gradient and angle calculation, nonmaximum suppression and thresholding.The detector employs hardware features such as parallelism and pipelining in order to provide an edge point every single clock cycle.The weight generator computes the weight coefficients w r,t for a support window W r,t in parallel, based on the YUV color values fetched from the [Y,U,V] MAs, and by utilizing m 2 instances of the circuit shown in Figure 6.That circuit consists of a Manhattan distance core and a weight table (LUT).Since the multiplication of the pointwise scores by the weight coefficients (cost aggregation step) is performed using shifters instead of multipliers, each location x of the LUT stores the shift amount corresponding to the weight coefficient 2 8−x .This shift amount is equal to the binary logarithm of 2 8−x , except from values of x greater than 8, for which a binary logarithm does not exist.In that special case, the corresponding entries in the LUT are set to a number, which is large enough so that the result of a shift operation by that number is equal to zero.The weight coefficient w r,t (i, j) is looked up in the LUT using the color distance generated by the Manhattan distance core as index.
The architecture of the cost aggregator and WTA unit is shown in Figure 7.The unit has been design in a parallel manner and utilizes m 2 absolute different circuits that compute the pointwise scores between corresponding pixels in W r and W t .Those scores are then shifted by the shift amounts corresponding to the weight coefficients w r and w t using a series of left shifters (equivalent to multiplying the scores by w r and w t ).The final aggregated cost is computed by summing the outputs of the left shifters using a tree adder, and then normalizing (dividing) it by the weights sum, which, before being used for division, is rounded to the nearest power of 2 by using tree comparators.This enables a cost-effective implementation of the division using a right shifter.Finally, the WTA unit is responsible for selecting the disparity with the minimum cost.

Experimental Platform and Methodology.
The architectures presented in this paper have been implemented on the ML505 Evaluation Platform, which features a Virtex-5 LX110T FPGA.The basic features of the edge-directed disparity map computation architectures are listed in Table 1, while more details are presented in the following subsections.We evaluated the architectures using rectified synthetic and real-world data, initially stored in the compact flash memory card.The synthetic data includes stereo images from the Middlebury database [2] and the pedestrian data-set in [33], and the real-world data includes stereo images taken in the lab.The images were loaded into the on-board DRAM using the Microblaze soft-processor and were used as input to the systems shown in Figures 2 and 4, respectively.The resulting disparity maps were directed to a TFT monitor.
The evaluation results of the synthetic images are shown in Figure 8, while the evaluation results of the real-world images are shown in Figure 9.The architectures were compared in terms of processing speed, disparity map accuracy, and FPGA resource utilization.They were also compared against their equivalent hardware systems that do not integrate the edge detectors in order to emphasize on the benefits of the edgebased approach adopted by the architectures.

Disparity Map Quality-Impact of Edge Detection.
To evaluate the quality of the disparity maps generated by the proposed edge-directed hardware architectures, and to examine the impact of the edge detection algorithm in the overall system quality, we use Middlebury stereo pairs, as well as stereo pairs from [33], for which the ground truth disparity maps are known and measure the incorrect disparity estimates using the percentage of bad pixels, a commonly accepted metric [2].
In our previous work in [7], we have investigated the impact of three different edge detectors (Sobel, Canny, and Evolvable [34]) on the accuracy of the disparity maps generated by an SAD-based, fixed-support algorithm, in order to select the best detector to be integrated to the system described in [7].As we discussed in our previous paper, we have selected the Sobel detector, both for its simplicity on an FPGA and also for its good accuracy.In this paper, we also investigate the impact of the same detectors on the ADSW algorithm.The results about the percentage of bad pixels for the Tsukuba and Venus stereo image pairs are given in   As can be seen, the Canny detector achieves on average better accuracy (lower percentage of bad pixels) compared to the other two edge detectors when is being applied to the ADSW algorithm, but the quality of the disparity maps is lower when using the Canny detector in the fixed-support algorithm.This difference in the behavior of the Canny detector in the two algorithms lies in the fact that the fixedsupport algorithm utilizes edges both as directive points and also as matching primitives.As a result, the fixed support algorithm works better if the edge detector preserves not only strong edges, but also "busy" texture regions (edge regions with numerous small edge elements), which aid the matching process.The Canny detector does not work well with the fixed support algorithm as it is less likely to be fooled by noise, and more likely to detect true weak edges.As such, we have selected the Canny detector to be integrated into the ADSW-based disparity map computation architecture.
For a detailed quality analysis, we compare the quality of the disparity maps generated by the edge-directed architectures to the disparity maps generated by their equivalent hardware architectures that do not integrate the edge detectors (stand-alone fixed-support and ADSW architectures).The stand-alone fixed-support architecture has a similar structure with the edge-directed architecture shown in Figure 2 (without the EDU, the ETU and the candidate window and disparity range buffers), which is able to process 8-bit pixels instead of edges.The stand-alone ADSW architecture has a similar structure with the edge-directed architecture shown in Figure 4 (without the edge detection and interpolation units).For simplicity purposes, we will refer to each hardware system configuration as "System 1-4," as defined in Table 3.
The results extracted by comparing the disparity maps of Systems 1-4 to the ground truth disparity maps are presented in Table 4 for two sample image pairs.The resulting disparity maps are also given in Figure 10 for a qualitative comparison.Obviously, the disparity maps generated by the systems that implement the ADSW algorithm are more accurate, and particularly at depth discontinuity regions and at regions with repetitive patterns, where ADSW algorithms traditionally work better.The results also indicate that the use of the edge detectors does not impact negatively the accuracy of the disparity maps, but, in some cases, they improve the accuracy and especially at depth borders.This is because the stand-alone architectures suffer from changes in pixels intensities, while the edge-directed architectures potentially struggle this problem by limiting the correspondence search to specific reliable features in the images (edges in our case).It must be noted that the aforementioned results are based on a nearest neighbor interpolation method.It is anticipated that the proposed edge-directed hardware systems can produce disparity maps with better quality by integrating more complex interpolation methods such as bilinear and bicubic interpolation; this however is left as part of future work.

Processing Speed.
The processing speed of the edgedirected disparity computation architectures described in  Section 4 is measured in frames per seconds (fps) and is affected by several algorithmic and implementation-specific parameters.The algorithmic parameters include the amount of data reduction (the percentage of non-edge points over the total image points) obtained from the edge detectors, the image size and the representation of the input images (grayscale or color), as well as the disparity range.The implementation-specific parameters include the operating frequency, the I/O bandwidth from the external memory, and the available FPGA resources.
The architecture based on the fixed-support approach (System 2) has lower requirements in terms of input bandwidth, as it processes grayscale stereo images.Therefore, the external memory bandwidth can be exploited to fetch multiple pixels per clock cycle compared to the architecture that adopts the ADSW algorithm (System 4), which fetches one pixel per input image.The frame rate of System 2 is also independent from the support window size and the disparity range, as that system utilizes the edge information not only as directive points, but also as matching primitives, thus exploiting more parallelism.System 2 can compute the aggregated costs for all d M support windows in a single clock cycle.System 4 requires d M clock cycles to compute the costs for all disparity levels as that system processes color stereo images and uses grayscale pixels as matching primitives.Thus, it can exploit much less parallelism when assuming a constant amount of FPGA resources; particularly, it processes a support window every clock cycle, but requires d M clock cycles to compute a disparity value in the case where the processing pixel represents an edge.System 4, however, performs the matching process on a smaller number of edge points, since it is directed by the Canny detector, which achieves larger percentage of data reduction compared to the Sobel detector.With respect to the I/O bandwidth and the image size, the performance of both system architectures decreases as the image size increases and the I/O bandwidth decreases, since in the former case there is more data to be processed, while, in the latter case, the amount of data flowing into the system limits the system throughput.
To evaluate the processing speed of the edge-based architectures, we compare them with their equivalent system architectures that do not integrate the edge detectors (standalone architectures).We identify the speedup of the proposed architectures compared to the stand-alone architectures and provide results when increasing the input image size and the disparity range, in order to illustrate the impact of the edge detector in all cases.We used synthetic stereo images as benchmarks from [2,33] in order to extract the processing speed of the systems mentioned above.The results are given in Tables 5 and 6.
As it can be seen, the processing speed is inversely proportional to the image size in all cases.Another particular  The frame rates for the systems without the edge detectors are projected frame rates.The frame rates for the systems without the edge detectors are projected frame rates.observation is that the architectures based on the ADSW approach are slower even if they operate on higher operating frequencies and have larger memory bandwidth; this is due to the high computational complexity of the ADSW algorithm and due to the limited parallelism that can be exploited by such an algorithm in a resource constraint implementation platform such as an FPGA.The most important observation is that the use of the edge detection offers significant speedups, when compared to the standalone architectures for all image sizes.The impact of the edge detector is even more emphasized in the architecture based on the ADSW algorithm, as the data reduction of the Canny detector is larger compared to the Sobel.While the use of the edge detector in System 4 yields a speedup of ∼5, the speedup obtained in System 2 is ∼2.This is because System 2 was designed with more parallelism and provides a disparity value every clock cycle, irrespective of whether or not the pixel being processed is an edge.As a result, the edge detector in System 2 needs to provide multiple edges per clock cycle, so the bottleneck in this case is the Edge Detection Unit.Of course, the EDU in System 2 can further be parallelized, so the real limitation in terms of parallelism is the external memory I/O.On the other hand, System 4 exploits parallelism only on the level of a support window.If the pixel being processed is an edge, it requires d M cycles to compute the disparity, so the edge detector used in System 4 needs to provide only one edge every clock cycle.Table 6 indicates that the disparity range affects only the ADSW-based system architectures, due to the limited parallelism that can be exploited on the targeted FPGA.This however does not necessarily imply that the architecture is not scalable.The parallelism is limited by the amount of hardware resources.Of course, the ADSW-based architecture could be implemented to process multiple support windows in parallel on a larger FPGA.This would increase the frame rate as well.
Table 7 presents a comparison between existing implementations ( [12,13,15,16,19,20,[22][23][24][25][26][27][28][29][30][31]) and the proposed FPGA prototypes (System 2 and System 4).Performance is provided in frames per second (fps), as well as in points × disparity estimates per second (PDS).The proposed edgedirected architecture that implements the fixed-support algorithm (System 2) achieves a PDS of 23592×10 6 and is the fastest implementation listed in the table, when considering the input image size and the maximum disparity range supported.System 4 achieves a PDS of 922×10 6 for an image size of 800 × 600 and a disparity range of 64.Such PDS seems sufficient considering that the algorithm implemented by System 4 is much more complex compared to the algorithms implemented by the other implementations in the table; only [30] implements an ADSW algorithm and achieves lower performance rates.Conclusively, performance results indicate that the proposed edge-directed approach for accelerating hardware implementations of stereo correspondence algorithms exhibits high potential for applications with hard real-time constraints.System 2 can meet the real-time requirements of such applications, even for high-resolution images.Moreover, the combination of the edge-directed approach with the ADSW algorithm enables the realization of accurate disparity computations systems that can also satisfy the real-time requirements of many embedded vision applications such as pedestrian and obstacle detection.

Hardware Overheads.
The proposed FPGA systems were evaluated for relevant metrics such as area and operating frequency.Table 8 gives the overall hardware demands of the FPGA prototype that is based on the fixed-support algorithm (System 2).The table also lists the hardware overheads associated with each of the implemented components.8 and 9 leads us to the following useful observations.(i) Even though System 2 exploits parallelism aggressively in order to compute the cost values for all support windows in the range [1 d M ] in a clock cycle, it requires only ∼10% more FPGA LUTs than those required by System 4. This indicates that the edgebased approach for stereo correspondence is much more efficient (in terms of hardware usage) when applied to a fixed-support algorithm, mainly due to the fact that it performs correlation using binary data (edges), thus requiring very simple circuits during the cost computation and aggregation steps.
(ii) While the Sobel detector is computationally less demanding than Canny, it requires ∼1% more FPGA LUTs.This is due to the fact that the EDU is able to process 4 pixels per clock cycle (2 per input image); thus, the hardware resources have been replicated by a factor of 4. The Canny detector, on the other hand, outputs only a pixel per cycle.
(iii) The EDU operates at a slower clock frequency compared to the Canny detector.It could be possible to divide the EDU into more pipeline stages so that to increase its frequency, at the expense of further hardware usage.However, this would be meaningless considering that the DCU works at a slower frequency.Moreover, the frame rates achieved by System 2 are extremely high, and therefore any further increase of the operating frequency to achieve higher frame rates would not justify the extra hardware cost.

Conclusion
Stereo correspondence is an important algorithm in several embedded vision applications that require real-time computation of depth information.This paper presented an overview of the use of edge information, as a means to accelerate hardware implementations of stereo correspondence algorithms.The paper analyzed design issues and considerations associated with the edge-directed approach when it was applied to different hardware implementations of both fixedsupport and even more complex ADSW algorithms.Our immediate plans involve the integration of other interpolation methods into the proposed architectures in order to achieve better disparity map accuracy.We also plan to investigate the applicability of the presented edge-directed approach to global stereo correspondence algorithms, which return better quality on the disparity map.Furthermore, the FPGA prototypes will be extended to full-custom ASIC designs in order to evaluate large-scale performance and power.

Figure 1 :
Figure 1: Steps of the edge-based stereo correspondence process.

4. 1 .
Disparity Map Computation Hardware Architecture Based on the Fixed-Support Algorithm 4.1.1.Design Issues and Requirements.This architecture implements an SAD-based, fixed-support algorithm and uses edges both as directive points and as matching primitives.
(a).The EDU employs hardware features, such as parallelism and pipelining, in an effort to parallelize the repetitive calculations involved in the Sobel operation, and uses optimized memory structures in order to reduce the memory reading redundancy.The detector architecture consists of an I/O

Figure 2 :
Figure 2: Block diagram of the edge-accelerated hardware architecture that implements the fixed-support algorithm.(a) Edge Detection Unit, (b) Disparity Computation Unit.

4. 1 . 6 .
DCU Pipeline Stages Description.The major pipeline stages of the DCU are described below.

Figure 4 :Figure 5 :
Figure 4: The architecture of the ADSW algorithm.

Figure 6 :
Figure 6: Circuit that computes the weight of a single pixel.

Figure 7 :
Figure 7: Architecture of the cost aggregator and WTA.

Figure 8 :
Figure 8: Evaluation results for synthetic images.From left to right: Tsukuba, Venus, and pedestrian stereo images.From top to bottom: reference image, target image, output of the Sobel detector, output of the Canny detector, disparity maps generated by the fixed-supportbased architecture, disparity maps generated by ADSW-based architecture, ground truth disparity maps.

Figure 9 :
Figure 9: Evaluation results for real-world images.From left to right: reference image, target image, output of the Sobel detector, output of the Canny detector, disparity map generated by the fixed-support-based architecture, disparity map generated by the ADSW-based architecture.

Figure 10 :
Figure 10: Disparity maps of the Tsukuba and Venus image pairs for the system configurations listed in Table 3. From left to right: ground truth disparity map and disparity maps generated by System 1, System 2, System 3, and System 4, respectively.
Architecture.The IS and CS stages communicate through the use of on-chip buffers (memory arrangements (MAs)), which temporarily store the pixels required to perform correlation between W r in I r and the d M candidate support windows W t in I t .
buffers.The column buffer consists of m 8-bit registers that store the pixels of an entire column of a support window.It receives one pixel per clock cycle and outputs a column every m cycles.The output column is stored in a series of m FIFO queues (1 pixel per queue), which are used to allow the memory controller to continuously fetch data from the external memory to the on-chip MAs (given that there is free space in the queues) irrespective of the data consumption rate of the CS.The CS consumes data at irregular rates; it consumes a column in d M cycles if p is an edge and in a single cycle if p is not an edge.

Table 1 :
Features of the FPGA Prototypes.

Table 2 :
Quality comparison for different edge detectors.

Table 2 .
The average percentages of data reduction associated to the results presented in the table are 82.95%,70.23%, and 57.67% for the Canny, Sobel, and Evolvable detectors, respectively.

Table 4 :
Comparison of disparity map quality between stand-alone and edge-directed architectures.Disparity map quality results for System 1 and System 3 were extracted using Matlab simulation.

Table 9 :
Hardware overheads of the edge-directed ADSW-based architecture (System 4) (image size = 800 × 600, max disparity range = 64, support window size = 11 × 11).The slice LUTs are dominated by the cost aggregator and the weight generators, which consume ∼44% of the available LUTs.The slice registers are dominated by the on-chip MAs, which consume ∼28% of the available slice registers.A closer look in Tables