A Nonlocal Method with Modified Initial Cost and Multiple Weight for Stereo Matching

This paper presents a newnonlocal cost aggregationmethod for stereomatching.Theminimum spanning tree (MST) employs color difference as the sole component to build the weight function, which often leads to failure in achieving satisfactory results in some boundary regions with similar color distributions. In this paper, amodified initial cost is used.The erroneous pixels are often caused by two pixels from object and background, which have similar color distribution. And then inner color correlation is employed as a new component of the weight function, which is determined to effectively eliminate them. Besides, the segmentation method of the tree structure is also improved. Thus, a more robust and reasonable tree structure is developed. The proposed method was tested on Middlebury datasets. As can be expected, experimental results show that the proposed method outperforms the classical nonlocal methods.


Introduction
Dense two-frame stereo matching is one of the most extensively researched topics in machine vision.Finding corresponding points in two or more images is the most important progress.After their disparities are computed, the results are used to distinguish the objects and background.Moreover, the depth information arises from the obtained disparity map.Scharstein and Szeliski [1] performed the following four steps: Additionally, they separated stereo matching algorithms into local methods and global methods.On the one hand, in local methods, they require cost aggregation, which ensures that the disparity between pixels is more accurate and specific than making the calculation with only one pixel.Therefore, in local methods, the support windows of cost aggregation for each pixel are significant.On the other hand, global methods construct a global energy function, and then the matching problem can be replaced by optimization.In these methods, a global energy function always consists of data and a smoothness item.The former measures the matching degree of the guidance image and the disparity function.However, the latter is capable of embodying the constraint of the definition model.An important problem for these methods, however, is to find the balance.It is different to obtain the perfect matching result between both measures.A number of global methods have been developed such as dynamic programming [2], graph cut [3], and belief propagation [4].
The semiglobal matching (SGM) algorithm by Hirschmüller [5] plays a good trade between matching accuracy and speed.SGM performs energy minimization along several 1D paths across the image and, thus, approximates the otherwise two-dimensional NP-complete energy minimization problem.However, high computational complexity and memory demand are a challenge for fast implementations.SGM can be implemented relatively efficiently by parallelization schemes.Real-time designs are possible and have been reported for CPU and GPU systems [6].There also exist some realtime embedded system designs, for example, on FPGA [7].Schumacher and Greiner designed a higher data throughput FPGA architecture for SGM [8].
As for local methods, the problem in finding the correspondence of pixel  and pixel   can be concluded as a similarity comparison of the two local patches, which exist around  and   , respectively [9].Hence, the problem of finding the correspondence of two pixels is how to compute the cost value about two patches surrounded.Since then, it requires gathering the cost of each pixel during the cost aggregation procedure.Yoon and Kweon [10] proposed an adaptive support weight (ASW) method, which has higher matching accuracy but low efficiency.They use large support windows for robust cost aggregation which causes a huge computational burden [11] and fails to obtain satisfactory results on large planar surfaces.
For this reason, to obtain accurate results, the matching windows with an appropriate size and shape should be selected.However, the fixed windows method (shown in Figure 1(a)) is restrictive.It may result in incorrect matching in low-texture areas if the support windows are not large enough, and the windows break the boundaries between the object and background to influence the validity of the depth discontinuity regions [12].
To this end, many methods to construct matching windows have been proposed recently.For instance, Qu et al. [13] presented an algorithm that filters the inapposite pixels around the matching point by using the color similarity of the pixels around a central matching point.This algorithm finally acquires the appropriate pixels that construct the adaptive support windows, which are helpful to the matching point.Zhang et al. [14] also proposed a cross-based structure (Figure 1(b)) and constructed it in the form of adaptive support windows by comparing the color similarity around the adjacent pixels.Both methods calculate the disparity of pixels with the assistance of adaptive support windows, which make the operations more specific and suitable than the approaches using a predefined fixed-size window.These computations, however, are dependent on the construction of each support window.And the time consumption caused by cost aggregation still does not satisfy the real-time requirement.Therefore, Mei et al. [11] designed an accurate stereo matching system by using an accelerated CUDA implementation on the basis of the previous proposed methods, which significantly improved the efficiency of the algorithm under the help of hardware.
Recently, Yang [15] proposed a nonlocal cost aggregation (NLCA) method and then relied on it to perform tree-based filtering [16].The NLCA algorithm is a novel cost aggregation method on a tree structure instead of using support windows.It also has been demonstrated to outperform the tradition of cost aggregation methods on support windows in terms of both speed and accuracy.In the NLCA algorithm, the nodes of the tree are all the image pixels, and the edges are all the edges between the nearest neighboring pixels.The similarity between any two pixels is decided by their shortest distance on the tree.All the pixels are connected to make a tree as shown in Figure 1(c), each node is aggregated only with its parents and children directly, and then every node on the tree makes a contribution to the final results.Hence, both the accuracy and the efficiency have been improved in this method.Nevertheless, this method does not perform well when the scene is composed of boundaries between object and background areas with similar color distribution because it considers color correlation as the only component of the weight function.
Mei et al. [17] proposed segment-tree cost aggregation (STCA) that segments the guidance image into several independent trees and then independent segment graphs are linked to form the segment-tree structure.In addition, they selected initial depth as a new component when computing the weight function.This method involves a new process; it leads to consistent scene segmentation; and only one judgement condition is adopted during the three-step image segmentation process.More recently, a cross-scale framework which unified aggregated based algorithms was also proposed [18].With the proposed color-depth weight, Peng et al. [19] further iteratively rebuilt the tree to improve the matching efficiency in textureless regions.Besides, based on a minimum spanning tree, Pham et al. [20] proposed a robust nonlocal stereo matching algorithm that improves the performance of nonlocal approaches for outdoor driving images.
In this paper, we propose an improved nonlocal cost aggregation algorithm that modifies the original algorithm in both computational cost and aggregation.The additional vertical gradient will be used as one of the components to calculate the initial cost of each pixel.We also employ a known function named  −  [21] to deal with outliers.Furthermore, we add the inner correlations and mix them with color correlation.And then we compute the weight function with a mixture of both correlations together.Moreover, when segmenting the guidance image more reasonably is under consideration, we also try to provide a new segmentation method with brand.
We evaluate our proposed method on standard and extra Middlebury datasets and compare our method with ST and MST.Experimental results show that our method can achieve acceptable results when it is in the process of computing the accuracy of disparity, especially in some representative regions.The average number of erroneous pixels around discontinuous regions can be reduced efficiently while the disparities of flat regions become more stable.Compared with NLCA and STCA, a performance evaluation on Middlebury datasets shows that the proposed method has higher correct matching rate.In our method, the percentage of matching error declined to between 5% and 15%.Additionally, the computational cost of the new segmentation method can be ignored usually, while only the cost from the inner color correlation which was employed in our cost aggregation procedure also has a weak impact on the computational complexity.In this method, the computational complexity is the same as color correlation in terms of magnitude.Therefore, the total computational complexity retains the same magnitude as the STCA algorithm but slightly improves the result.
The main contribution of this paper is to improve the original nonlocal cost aggregation method with the following advantages: (1) It has higher accuracy by adding the vertical gradient as one of the components in the process of cost computation.It is proved to be better in some discontinuous areas.Its initial value is more stable with the  −  function.
(2) Inner color correlation is employed in the computation of the weight function to make constructing a tree structure more robust and reasonable.
(3) The segmentation method of STCA is improved and it achieves a better result.Moreover, irrelevant pixels contribute less to each other.
The rest of this paper is organized as follows.In Section 2, we briefly introduce related work on local methods.Then, our proposed improved method is described in Section 3. Section 4 describes and analyzes the experimental results, and Section 5 discusses setting the parameters.Finally, we provide conclusion in Section 6.

Related Work
Cost aggregation, which consists of constructing support regions and aggregating the disparity for each pixel within those support regions, is one of the important processes in stereo matching.The efficiency and effectiveness rely on the used aggregation method; therefore, they are different from each other.In this section, we review the related work on cost aggregation, especially on the traditional local methods and nonlocal cost aggregation methods based on tree structure.

The Traditional Local Methods.
The stationary support windows with a stationary weight for each pixel are used by the simplest local method of cost aggregation.However, note that this method fails in many specific regions, including occlusion regions and low-textured areas.Furthermore, this method is unable to achieve decent robustness and its matching accuracy falls well short of the ideal result.To resolve this dilemma, there are usually two approaches: (1) make the fixed support window alterable using shiftable windows, multiple windows [22], or variable windows [23,24] or (2) concentrate on varying the weights to achieve excellent matching accuracy.
The algorithms based on adaptive weight consider every pixel in the support windows as a unique unit and calculate weight for the central point by themselves.The pixel will have a dramatic effect on the final result only if there is a cost value which is similar to the central point.Hence, every pixel is able to receive proper contributions from all the other neighboring pixels.This approach blurs the boundaries between local methods and global methods due to its remarkable accuracy and the obvious increase of computational cost.
Yoon and Kweon [10] first proposed an adaptive weight method and Gu et al. [25] further enhanced their method by introducing rank transform and disparity refinement.Tombari et al. [26] obtained the cost value after using the Meanshift [27] algorithm to segment the image, which revises ASW algorithm performance calamitously in repetitive texture regions and discontinuous regions.Hosni et al. [28] performed connectivity by using the geodesic distance transform; nevertheless, the computational efficiency of their strategy still has similar efficiency to others.

Nonlocal Cost Aggregation
Based on Tree Structure.Even though great progress has been made in local algorithms, they still aggregate pixels into local regions.As mentioned above, a nonlocal cost aggregation (NLCA) method has been proposed that breaks through the boundaries of local and global methods.This method transforms the guidance images into a graph and constructs a tree structure so that all the image pixels become the nodes of the tree.Before aggregating, a minimum spanning tree (MST) must be constructed.The nodes attached to edges with the lowest weights (calculated by differences in color distribution process) are connected to one another until all the pixels are finally included in the tree.It is an important step, that is, to convert the guidance image into a cost tree after all the pixels have been connected.Then, the whole process is separated into three steps: (1) Traversing the cost tree (2) Assigning an appropriate value to each node (3) Calculating each node's disparity level with its relatives After constructing the tree structure, the aggregation costs can be efficiently computed by executing a tree filter, which traces the MST from the leaf nodes to the root nodes and from the root nodes to the leaf nodes.Hence, the aggregation is complete after only two trees traverse, and then any pixel receives proper contributions from every node in the constructed tree (more or less).Based on the tree structure, some effective disparity refinement methods are proposed as follows.
Chen et al. [29] improved the NLCA by adding depth information in the weight function, which enhances the effect of regions around the border.Mei et al. [17] proposed a new segment-tree (ST) method that divides the construction of the tree structure into two rounds.In the first round, it combines subtrees in the homogeneous regions, and it also keeps those subtrees that belong to different regions separate from each other if they break the predefined equation.In the second round, to ensure that the different regions have little impact on each other, it combines the remaining subtrees with a penalty value.However, the segmentation performance is not robust because the segmentation equation is extremely ordinary.Therefore, the performance of this method falls short of expectations.

Our Proposed Method
Our work is directly motivated by the above two nonlocal cost aggregation methods.We further improve these methods during cost computation and tree construction process, respectively.We include the vertical gradient as a new component in the cost computation.On the other hand, due to its stability and versatility, inner color correlation is employed instead of using a single color component.Moreover, we modify the structure of the segment tree, which improves its validity and robustness.In this section, we divide our methods into five parts as follows: (1) Cost computation (2) Tree construction (3) Cost aggregation (4) Disparity computation and refinement (5) Computation complexity More details can be found in the following subsections.

Cost Computation.
Traditional nonlocal methods are considered to employ the truncated absolute difference of the color and the horizontal gradient as the initial cost.However, the performance of this cost measurement is unstable in marginal areas.Hence, we decided to employ the vertical gradient to make the cost measurement reveal more detailed description of the reference images.We compute the individual cost values   (, ),    (, ), and    (, ) primarily for a pixel  = (, ) in the guidance image with a disparity level .Let   denote RGB color component.  (, ) is defined as the average absolute difference of  and its relevant pixel  in the  channel (as shown in (1)): Then, we compute the gradient cost values    (, ) and    (, ) using ( 2) and (3), respectively.The equations can be designed as follows: In addition, our proposed method works pretty well when truncated values are used for discarding the extremum of the initial cost.However, the improvement this method yields is not obvious.Therefore, we employ the  −  function to handle the exception values as shown in where  tran  and  init  denote the final and initial cost values of the color, respectively.And then let  tran  and  init  denote the final and initial cost values of the gradient, respectively.In addition,   and   are user-specified parameters for adjustment.The former is related to the color adjustment and the latter is related to adjustments on behalf of the gradient.  is set to 7, and   is set to 2 in our experiments.The effect of this function declines smoothly when the initial cost reaches a certain value and the final cost value converges to 1 under the control of .So, by using three cost components as mentioned above together, the final initial cost value can be expressed as the following equation: where  and  are the weights for each component.Figure 2 shows a comparison between the traditional cost computation and our method, which demonstrates the improvement after adding the discontinuous regions.

Tree Construction.
According to Yang's contribution [15], we treat the guidance image  as a graph  = (, ) in this paper, where each node denotes the corresponding pixel in  and each edge represents the weight that connects two neighboring nodes.Accordingly, a flow chart shows how to construct our tree structure in Figure 3.The weight   of an edge  is determined with its conjoint nodes  and ; this process can be described as follows: where  In is the predefined weight and is set to 0.2 in this paper. In denotes the inner color correlation, which is shown in Then, the edges in  are sorted in an ascending order according to their weights.And then the subtrees are created for each node in .
where    denotes the weight of edge   that connects the two nodes  and .    and     denote the weight sequence of edges in subtrees   and   , respectively.  Avg denotes the average weight of all the edges. is a predefined parameter.We employ   Avg and divide the equation into two cases, which guarantees that the constraint condition will not be lost in those boundary regions with high weights and makes the segmentation of the tree more precise and robust.
After traversing all the edges, a large number of subtrees are merged with each other and changed into some new subtrees that have a bigger structure but are small in quantity.
Note that the integrated graph  has been segmented into several smaller pieces.We then traverse the edges once again and merge the rest of the subtrees.Meanwhile, we add a penalty value to the weight of edges to ensure that boundary regions do not interact with each other.Finally, all the nodes are constructed into a segment tree , and there is only one path between any two nodes in .The segment tree  is used in aggregating the final cost value.

Cost Aggregation.
The nonlocal cost aggregation method is a linear-time method in which the computational complexity is extremely low.We employ a weighting function (, ) to compute the contribution from pixel  to ; its function is decided as follows: where (, ) denotes the distance from  to  in the tree structure that relates to (6) and  is a predefined parameter for adjustment.Because of the otherness of our initial matching cost,  is set to 0.08 in our experiments, and the setting of  will be discussed in Section 5. Let   () denote the cost value for pixel  at disparity level ; the aggregated cost value    () is computed as follows: where  denotes the whole graph and therefore    () is aggregated with all the nodes in the graph .Yang employs a tree filter to compute the cost aggregation that traverses the tree structure from leaves to root and root to leaves [15], as shown in Figure 4.A node is affected by all the other nodes in the segment tree  but aggregates with only its children and parents.For a pixel , the aggregated value is calculated as follows: where the set Child() contains the children of node , and the computation for the node will be complete only if its child nodes have already been computed.Therefore, all the nodes have been aggregated by their low-grade nodes.Then, the tree structure is traversed from root to leaves, and the final aggregated cost value of pixel  is computed as follows: where Parent() denotes the parent node of pixel .After that, all the pixels eventually obtain a reliable aggregated cost.The complexity of computation is ( ⋅ ), where  denotes the number of pixels in the guidance image and  denotes the disparity level.

Disparity Computation and
Refinement.This subsection describes the universal winner-takes-all strategy, which is employed to seek the appropriate disparity level.And it carries the lowest matching cost, as shown in where set dislevel denotes the disparity level.We employ a tree structure to refine the coarse disparity map.First, we use the left and right images as guidance images, respectively.And the tree filter is executed twice, receiving two corresponding disparity maps.Then, we employ left and right consistency checks to mark the mismatched pixels and store them in set  mis .For the left disparity map , the cost value  new (, ) for each pixel  at each disparity  is recalculated as follows: where () denotes the initial disparity of pixel .This method uses the tree structure mentioned above to execute the tree filter, and the process of creating a new mathematical model has no extra computation cost.The total running time is taken by recalculating the cost value and executing the tree filter.Furthermore, all the pixels with unstable disparity are marked as mismatch pixels, and the cost value of each  Table 2: Comparison of the four nonlocal algorithms (MST [15], ST-1 [17], ST-2 [17], and the proposed method) with Middlebury datasets and the standard of benchmark.The error threshold is set to 1 and three regions (nonocc, all, and disc) are used to evaluate the performance of the methods.Our proposed method exhibits the best accuracy in every region.method achieves superior results, resulting in a more accurate disparity map and more reliable boundaries.In ℎ1, the results are adversely affected by illumination.Although other methods fail to detect the authentic boundaries, our method produces a better result.For example, the boundaries of the yellow trapezoid block are extremely close to the ground-truth map.As for 1, nearly the entire image contributes a similar color intensity.Therefore, it is crucial to calculate a rational result from the discontinuous regions.Unfortunately, all the other methods fail to detect clear boundaries on these datasets.However, the percentage of erroneous pixels declined to 2.49% by using our proposed method, which improves on the other nonlocal methods.
We mentioned the computational complexity in Section 3.5.In this section, we test 4 datasets and the average time consumption of each nonlocal method.The results are listed in Table 4.Most of the time is consumed during tree construction and tree filter requires only a slight amount of time.Moreover, MST is the shortest among the four methods, while our proposed method is a bit shorter than ST-2.The superiority of the proposed improved method over MST, ST-1, and ST-2 methods is demonstrated on experimental results (Tables 2 and 3, Figures 5 and 6).Moreover, in contrast to    MST and ST-1, the overall runtime cost of our proposed method does not increase obviously and is even shorter than ST-2.In contrast to the color-gradient based matching cost computation method proposed by Rhemann et al. [31], our method also has higher accuracy.

Parameter Setting
Several parameters are used in our proposed method.  and   are user-specified parameters used for adjustment in (4).They follow the truncated value in [31] while the predefined parameter  = 1200 in the tree construction follows the settings of the segment-tree [17] method.In this section, we discuss the rationale and sensitivity of the remaining four parameters, the weights for each component ( and ) in the initial computation, the predefined weight of inner color correlation ( In ) in tree construction, and the adjustment value () of the weight function.First, we test the adjustment value () of (10).The results are shown in Figure 7(a).When  ∈ [0.04, 0.14], the experimental results from most of the images are extremely low and vary slightly.In contrast, the erroneous pixels decline to a minimum when  ∈ [0.06, 0.08], which is due to the variation in the initial cost value.We employ the  −  function to protect the initial cost value from the encroachment of extremum, and the initial cost value converges to 1.With the adjustment of the initial cost value, a parameter  is required to be adjusted accordingly, or disparity boundaries will be unclear and foreground objects will be confused with background.
As for the weight of the inner color correlation  In , the parameter range of this experiment is 0 to 1.More details are shown in Figure 7 The experimental results show that employing inner color correlation is obviously reasonable but the parameter  In should be confined to 0.5 or below.
Figure 7(c) evaluates the sensitivity of the initial component weights  and  with four original Middlebury datasets, to clarify that the final results (percentage of erroneous pixels) are processed by an exponential function.The figure shows that the algorithm achieves its best performance when the parameters  and  ∈ [0.15, 0.3].The range of the parameters that achieve dramatic performance is much larger than the original nonlocal methods.And Figure 7 further demonstrates that employing the − function helps to resolve the errors caused by outliers more effectively and robustly than the methods described above which use truncated values.

Conclusion
In this paper, our work is directly motivated by two original algorithms [15,17].We propose an improved nonlocal cost aggregation algorithm based on them.The proposed method is developed with modified initial cost and multiple weight for stereo matching, which modifies the original algorithm in both computational cost and aggregation.Our method has some advantages.First, it has higher accuracy by adding the vertical gradient as one of the components in the process of cost computation.Particularly, the performance near some discontinuous areas is much better than that of other methods.Second, due to its stability and versatility, inner color correlation is employed instead of using a single color component.Thus, it makes constructing a tree structure more robust and reasonable.Besides, we modify the structure of the segment tree.
The performance was tested on a PC with a 3.40 GHz CPU and 4 GB of memory.The proposed method was evaluated on Middlebury datasets.The experimental results verified that our proposed method could achieve better accuracy with a minor cost of increased execution time.In the near future, we would like to focus on more novel tree structures.And we will continue to study nonlocal methods and image segmentation, proposing new ideas to resolve the issues mentioned above.

Figure 1 :
Figure 1: The support regions of cost aggregation.The red circle denotes the central pixel, and the red squares denote the pixels in the support region.The blue pixels are irrelevant.(a) Fixed support window; (b) cross-based support window; and (c) tree structure.

Figure 2 :Figure 3 :
Figure 2: Cost measure comparison.(a) The input image; the black boxes express the target areas.(b) Insets of target area; (c) and (d) denote the results of the traditional cost measure and our method, respectively.

Figure 4 :
Figure 4: The tree filter for cost aggregation;  denotes the matching pixel.(a) Leaves to root pass; (b) root to leaves pass.

Figure 5 :
Figure 5: The final disparity maps of the four most common images in the standard Middlebury datasets.(a) denotes the guidance images.From top to bottom, these are Tsukuba, Venus, Teddy, and Cones.The subfigures (b) to (e) show the disparity maps computed by different nonlocal methods.(b) shows the results of MST [15]; (c) and (d) show the results of the two segment-tree cost aggregations [17], respectively, and (e) shows the results of our proposed method.

Figure 6 :
Figure 6: The final disparity maps of the extra Middlebury datasets.Four representative images were selected to show the superiority of our proposed method.(a) denotes the guidance images.From top to bottom, these are Baby3, Flowerpots, Lampshade1, and Laundry.Subfigures (b) to (e) show the disparity maps computed by different nonlocal methods.(b) shows the results of MST [15]; (c) and (d) show the results of the two segment-tree cost aggregations [17], respectively, and (e) shows the results of our proposed method.

Figure 7 :
Figure 7: Parameter sensitivity analysis of our experiments.Four standard datasets (Tsukuba, Venus, Teddy, and Cones) were used in this experiment.(a) is a line chart representing the percentage of error pixels as parameter  increases from [0.02 to 0.22], whereas (b) is a line chart representing the changes as parameter  In increases from [0 to 1].Avg denotes the average percentage of erroneous pixels for three evaluation regions (nonocc, all, and disc), and All denotes the four standard datasets.(c) represents the average number of erroneous pixels from the four standard datasets using different weights for the initial components; an exponential function is employed to make the results more intuitive;  ∈ [0, 0.4] denotes the weight of color cost and  ∈ [0, 0.4] denotes the weight of the vertical gradient cost.

Table 3 :
[17]comparison of the four nonlocal algorithms (MST[15], ST-1, ST-2[17], and the proposed method) with 16 extra Middlebury datasets.The error threshold is set to 1 and only nonoccluded regions are used to evaluate the performance of the methods.

Table 4 :
Average time consumption for each nonlocal method with 4 Middlebury datasets.