Dense stereo correspondence enabling reconstruction of depth information in a scene is of great importance in the field of computer vision. Recently, some local solutions based on matching cost filtering with an edgepreserving filter have been proved to be capable of achieving more accuracy than global approaches. Unfortunately, the computational complexity of these algorithms is quadratically related to the window size used to aggregate the matching costs. The recent trend has been to pursue higher accuracy with greater efficiency in execution. Therefore, this paper proposes a new costaggregation module to compute the matching responses for all the image pixels at a set of sampling points generated by a hierarchical clustering algorithm. The complexity of this implementation is linear both in the number of image pixels and the number of clusters. Experimental results demonstrate that the proposed algorithm outperforms stateoftheart local methods in terms of both accuracy and speed. Moreover, performance tests indicate that parameters such as the height of the hierarchical binary tree and the spatial and range standard deviations have a significant influence on time consumption and the accuracy of disparity maps.
Stereo correspondence between stereo images results in a depth image, also called a disparity map, which can be categorized as sparse or dense. Sparse disparity maps are obtained mainly using featurebased methods derived from human vision research [
Dense stereo correspondence algorithms can be classified as global or local according to whether they obtain disparities from global or local information. The goal of global methods (energy based) is to minimize a global cost function which combines matching costs and smoothness terms, depending on information derived from the whole image. These methods are time consuming but very accurate [
This paper proposes a dense stereo correspondence approach very similar to the original adaptive support weight (ASW) method [
The main contributions of this paper include the following.
A novel matchingcost filtering model is proposed based on an edgepreserving filter for which the adaptive support weights are computed using a hierarchical clustering algorithm (as shown in Section
The computational complexity of the proposed method is essentially linear both in the number of image pixels and the number of clusters, regardless of the matching window size and the intensity range (as described in Section
A new disparity refinement method is presented, which has been proved to be robust and effective for improving the accuracy of coarse disparity maps (as presented in Section
The influence of algorithm parameters on accuracy and efficiency is discussed, especially regarding the weight coefficient, the height of the hierarchical binary tree, and the size of the spatial and range standard deviations (as discussed in Section
The rest of this paper is organized as follows: Section
A disparity map is obtained by determining the disparity which has the lowest matching cost in each local matching window, a method which is widely used in local algorithms. Many local methods have been proposed to obtain a dense disparity map recently. For instance, adaptivewindow methods [
Instead of searching for an optimal matching window of arbitrary size and shape, it is possible to aggregate costs after local smoothing within a matching window to reduce matching noise. It is clear that most noise can be reduced effectively by a linear filter, such as Gaussian filter, but the disparity map always results in a wellknown “edgefattening” phenomenon. Therefore, the local filtering results will not be a good neighborhood representative close to an edge region. To address this problem, the recently proposed ASW algorithm [
A literature review has provided a taxonomy and an evaluation of typical matching algorithms and has emphasized that such a coarsetofine algorithm generally performs the following four steps [
cost initialization, in which the matching costs for assigning different disparity hypotheses to different pixels are calculated;
cost aggregation, in which the initial matching costs are aggregated spatially over matching windows;
disparity optimization, in which a cost function is minimized to obtain the best disparity hypothesis for each pixel;
disparity refinement, in which the coarse disparity maps are postprocessed to remove mismatches or to generate fine disparity maps.
According to these four steps, in this paper, the cost aggregation with local filtering consists of five parts: matching cost initialization, cost aggregation with filtering, clustering range values for the sampling points, disparity selection, and refinement. In addition, the computational complexity is discussed.
Generally, it is possible to identify matching pairs in stereo images by measuring their similarity. The most common algorithms which use a matching cost function to establish a correspondence between the two points are the sum of absolute intensity differences (SAD), the sum of squared intensity differences (SSD), and the normalized crosscorrelation (NCC) [
The cost initialization module computes the initial matching cost
The original local filtering approach tried to compute the weights which are the average of the adjacent matching costs. The costs aggregated over the weights can therefore be expressed as
The weights
The Gaussian over the range similarity
As mentioned before, the key point of Yang’s algorithm [
Assume that pixel
After introducing the improved cost aggregation and complexity analysis, an algorithm for clustering the range values can be summarized as follows.
Generate the first sampling point
Generate the
Segment the pixels into two clusters
Compute a new sampling point
The number of sampling range values
Remember that Steps
At the top of the tree, the sampling points are better adapted to smooth regions. Points further down this tree would become gradually better adapted to edge regions.
Figure
Hierarchical binary tree generated by the clustering algorithm.
The first three levels of sampling points
Once the matching costs have been filtered using a cluster method, the disparity optimization step computes an optimal disparity map
The coarse disparity maps generated by WTA may contain some mismatches because local optimization does not obey the smoothness constraint. Therefore, a twostep postprocessing method for fine disparity maps is proposed.
The first step is a left and right crosschecking procedure for mismatches. Two corresponding disparity maps with the left and the right images as reference images are obtained. Then the left and right consistency check divides all the pixels into stable or unstable pixels. Note that all stable pixels in the left and right disparity maps have the same disparity value and that the rest of the pixels are labeled as unstable, represented by a value of zero for all disparity levels.
Secondly, let
In this section, the performance of the proposed method is evaluated using the Middlebury stereo benchmark, which provides stereo images with known ground truth [
The proposed method was run with constant parameter settings for all four testing images:
The GIFbased costaggregation method and the proposed hierarchical clustering method were first used to aggregate matching costs. Then winnertakeall and refinement operations were used to obtain the dense disparity maps. As shown in Figure
Experimental results on the Middlebury benchmark. Dense disparity maps from the first to the last row are the “Tsukuba,” “Venus,” “Cones,” and “Teddy” images. ((a) and (b)) The results of GIF and the proposed method without refinement procedure. ((c) and (d)) The disparity maps obtained using the GIF and the proposed method with refinement procedure. (e) Ground truth.
The corresponding quantitative results are presented in Table
Quantitative evaluation for the Middlebury image pairs.
Method  Tsukuba  Venus  Teddy  Cones  AE  

Non  All  Disc  Non  All  Disc  Non  All  Disc  Non  All  Disc  
HCR  1.56  1.78  8.07  0.22  0.34  2.96  6.36  11.93  15.62  2.88  8.14  8.22 

GIR  1.87  2.23  7.92  0.27  0.47  2.60  6.74  12.28  16.20  2.94  8.35  8.36 

ASW  1.38  1.85  6.90  0.71  1.19  6.13  7.88  13.30  18.60  3.97  9.79  8.26 

HCN  2.14  2.94  9.16  1.25  1.94  9.36  7.22  15.28  17.96  3.41  12.93  9.61 

GIN  2.53  3.32  8.63  1.98  3.13  15.81  8.35  16.87  18.81  3.64  12.64  9.70 

DCBG  5.90  7.26  21.0  1.35  1.91  11.20  10.5  17.2  22.2  5.34  11.9  14.9 

To verify algorithm stability, the performance of the GIF and the proposed methods was compared on an additional 27 Middlebury stereo images [
Evaluation for stereo methods on all 27 Middlebury stereo pairs.
Method  Aloe  Baby1  Baby2  Baby3  Bowling1  Bowling2  Cloth1  Cloth2  Cloth3  Cloth4 

HCN  12.71  11.14  11.81  17.87  26.70  19.10  9.71  16.37  11.15  14.95 
GIN  13.42  12.39  12.88  17.98  27.37  19.19  10.36  16.56  11.30  15.40 
HCR  8.19  4.99  7.24  9.74  20.69  14.38  4.61  10.96  5.34  10.66 
GIR  8.78  5.41  7.59  9.99  20.26  14.61  5.03  11.43  5.43  10.81 
 
Method  Flowerpots  Lampshade1  Lampshade2  Midd1  Midd2  Monopoly  Plastic  Rocks1  Rocks2  Wood1 
 
HCN  23.60  23.03  30.95  45.66  41.66  36.51  43.62  11.90  12.24  16.78 
GIN  23.41  24.13  32.64  46.58  42.90  34.78  47.81  11.72  11.83  17.61 
HCR  18.48  15.86  23.46  43.95  37.30  22.71  35.60  5.72  5.19  5.26 
GIR  18.81  16.67  24.01  44.35  38.62  25.01  38.33  5.55  5.02  5.57 
 
Method  Wood2  Art  Books  Dolls  Laundry  Moebius  Reindeer  AE  
 
HCN  15.43  26.26  21.59  17.32  27.98  20.32  21.96 


GIN  14.83  26.41  21.10  16.68  29.19  20.06  23.12 


HCR  0.64  18.60  17.85  11.90  20.80  14.68  8.19 


GIR  0.57  18.59  17.50  11.96  22.80  14.22  8.36 

We have implemented two versions of the local matching filter described in this paper and tested them on the four benchmark images. These implementations include CPU versions written in MATLAB and a GPU version written in CUDA. The performance numbers reported in this paper were measured on a 2.99GHz Intel Core 2 Duo processor with 3.25 GB of memory and on a GPU (GeForce 9500GT) with 512 MB of memory. Note that all of the algorithms were run on the same testing platform to achieve a fair comparison.
As demonstrated by the results shown in Table
Run time comparison of the GIF and the proposed method in seconds.
Version  Method  Tsukuba  Venus  Teddy  Cones 

CPU  GIF  32  80  251  257 
HC  28  62  204  203  
GPU  GIF  0.330  0.543  1.677  1.695 
HC  0.270  0.434  1.307  1.315 
Obviously, all the run times increase with the dimensional size of the disparity maps, where the “Tsukuba,” “Venus,” “Cones,” and “Teddy” disparity maps are 384 × 288 × 15, 434 × 383 × 19, 450 × 375 × 59, and 450 × 375 × 59, respectively. As a result, our CPU implementation processes a 1megapixel image in about 16 to 20 seconds, resulting in a timeconsuming process. Due to the simple and parallel operations used by our approach, our filter achieves significant performance gains on GPU platform. The total time required for filtering a 1megapixel image ranges from 0.1 to 0.2 seconds. This represents a speedup from 80 to 200 compared to our CPU implementation.
Consequently, the proposed approach seems to perform slightly better than others in terms of accuracy and computational efficiency.
All of the stereo benchmark images used in Section
In order to confirm that the proposed method is robust when applied to illuminationvariant stereo pairs, PBP results of the altered Tsukuba images with different weight coefficient were presented in Table
Evaluation on illuminationvariant stereo pairs with different weight coefficient.

 

0  0.1  0.5  0.9  1  
Non  All  Disc  Non  All  Disc  Non  All  Disc  Non  All  Disc  Non  All  Disc  
−25%  3.33  4.24  12.27  3.86  4.82  13.50  15.7  16.8  27.2  40.9  41.6  48.1  67.0  67.3  69.9 
−20%  3.12  4.00  11.96  3.47  4.36  12.80  17.0  18.1  27.9  40.8  41.5  46.5  60.2  60.6  62.6 
−15%  2.97  3.84  11.65  3.27  4.10  11.86  18.8  19.7  29.0  44.8  45.3  46.6  55.4  55.8  55.4 
−10%  3.03  3.93  11.87  3.43  4.22  11.23  18.0  18.7  25.3  41.4  41.8  41.9  44.7  45.0  43.7 
−5%  3.08  4.07  12.24  3.25  4.01  10.50  12.5  13.3  19.2  25.1  25.7  27.9  26.4  27.0  29.0 
0% 















5%  3.33  4.32  12.47  2.25  3.09  9.79  10.9  11.7  16.3  21.3  21.9  23.5  22.4  23.0  24.2 
10%  3.51  4.54  13.12  3.22  4.10  11.45  19.8  20.5  23.6  41.2  41.4  38.1  42.6  42.8  39.3 
15%  3.85  4.88  13.74  3.88  4.82  13.03  20.5  21.3  26.3  49.5  49.9  46.0  52.3  52.6  48.8 
20%  4.09  5.14  14.39  4.23  5.24  14.17  19.0  20.1  29.6  54.4  54.9  53.7  59.2  59.5  59.0 
25%  4.54  5.62  15.32  4.67  5.76  15.37  20.0  21.1  31.6  53.1  53.7  54.3  62.0  62.5  62.6 
It can be seen from Table
The first step is to discuss how tree height affects the performance of the proposed method. “Tsukuba” was chosen as the test image, and the GPU run time and PBP of the disparity maps were recorded with increasing tree height, as shown in Table
Run time and PBP of the disparity maps vary with respect to tree height increasing.
Height  Time (s)  Non  All  Disc 

1  0.031  4.39  5.91  20.63 
2  0.056  3.00  3.99  13.78 
3  0.128  2.32  3.21  9.93 
4  0.270  2.14  2.94  9.16 
5  0.499  2.05  2.78  8.90 
6  1.203  2.02  2.68  8.99 
It is clear from the second column that the proposed algorithm will increase greatly in compilation time with increasing tree height. Because the number of sampling points
The first three levels of weight (
However, accuracies improve slightly or even become worse between
The results obtained from varying
More sampling points will be needed for good accuracy when the range spread
Using (
The filter weights (
Using (
“Tsukuba” was chosen as the test image. A fast way to determine the best choice of
The PSNR decreases as
The PSNR decreases with increasing
PSNR with different parameters

 

1  10  20  30  40  50  60  70  80  90  
0.01  11.63  12.99  13.10  13.21  13.24  13.28  13.46  13.38  13.08  12.73 
0.1  12.71 









0.2  12.87 

13.99  13.97  13.89  13.85  13.83  13.67  13.64  13.62 
0.3  12.93 

13.93  13.88  13.77  13.70  13.65  13.47  13.43  13.39 
0.4  12.96 

13.88  13.81  13.70  13.62  13.56  13.38  13.33  13.29 
0.5  12.98 

13.85  13.77  13.64  13.56  13.50  13.32  13.27  13.23 
0.6  12.99 

13.83  13.73  13.60  13.52  13.45  13.27  13.22  13.18 
0.7  12.99 

13.81  13.71  13.57  13.49  13.42  13.24  13.19  13.15 
0.8  13.00 

13.80  13.69  13.55  13.46  13.39  13.21  13.16  13.12 
0.9  13.00 

13.79  13.68  13.53  13.44  13.37  13.19  13.13  13.09 
From the two findings, it can be confirmed that the optimal values for
The PBP distributions for the “non,” “all,” and “disc” disparity maps were then recorded with
All the PBP perform like the results of PSNR; the PBP values increase as
Figure
PBP for the (a) “non,” (b) “all,” and (c) “disc” disparity maps with various
Consequently, accuracy was reduced when
In this paper, a new local solution for fast, highquality dense stereo correspondence has been proposed that focuses on matching cost filtering method which is based on a highperformance hierarchical clustering algorithm. Instead of filtering the matching costs using an edgepreserving smoothing operator as in the popular bilateral filter, the cost aggregation model was adjusted to compute the matching responses for all image pixels at a set of sampling points generated using a clustering method. The computational complexity for this filtering is linear both in the number of image pixels and the number of clustering classes. The experimental results of the comparison have demonstrated that the proposed method outperforms the GIFbased matching algorithm, which is one of the best local methods on the Middlebury benchmark in terms of both speed and accuracy. Moreover, the results of performance tests, which provide effective guidelines for parameter selection, indicate that good accuracy is highly dependent on the weight coefficient, the height of the hierarchical binary tree, and the spatial and range standard deviations. As a result, it can now be confirmed that the proposed approach can be capable of highspeed processing and offer highquality disparity maps for dense stereo correspondence.
In the experimental results, we show that both of the GI and HC filtering methods make some of the erroneous disparity values due to the lack of texture, which is a traditional challenge for stereo algorithms. The reason is that a pixel’s disparity value is obtained by selecting the point of highest matching score and independently of disparity assignments of neighboring pixels. Hence, most of the disparity values in the lowtexture areas maybe incorrect using a local matching method. To overcome this bottleneck, the authors plan to make the algorithm capable of handling large untextured regions, which remains an active area for future research [
This work was supported by the open project of Beijing Key Laboratory on Measurement and Control of Mechanical and Electrical System (no. KF20121123206), Key Laboratory of Modern Measurement and Control Technology (BISTU), Ministry of Education, Funding Project for Academic Human Resources Development Institutions of Higher Learning under the Jurisdiction of Beijing Municipality (no. PHR201106130), and Funding Project of Beijing Municipal Science & Technology Commission (no. Z121100001612011).