An Occlusion Approach with Consistency Constraint for Multiscopic Depth Extraction

,


Introduction
Augmented reality has many applications in several domains such as games or medical training.On the other hand autostereoscopic display is an emergent technology, which adds a perception of depth enhancing the users immersion.Augmented reality can be applied to autostereoscopic display in a straightforward way by adding virtual objects on each image.However, it is much more interesting to use the depthrelated information of the real scene so that virtual objects could be hidden by real ones.
To that end, we need to obtain one depth map for each view.The particular context of images destined to be viewed on autostereoscopic displays allows us to work on a simplified geometry (e.g., no rectification is needed, epipolar pairs are horizontal lines of the same rank, and disparity vectors are thus aligned along the abscissa).However, our aim is to obtain good assessment of depth in all kinds of scenes, without making any assumption on their contents.Indeed images may have homogeneous colors as well as they may have various colors.Also, due to the principle of autostereoscopic displays, the users can see two images at the same time.It is then crucial to have strongly consistent depth maps.For example if a virtual object is drawn in front of a real object in one view, it has to be drawn in the same order in all views.Therefore we introduce a new occlusion approach for multiview stereovision algorithms, which aims to ensure the consistency of the depth maps.
We propose an example of application of our approach in a correlation-based method and in a symmetrical graphcuts-based method.Finally we discuss their results.

Related Work
Stereovision algorithms aim to find the disparity maps in order to deduce the depth maps.That is the reason why we will use the phrase "disparity maps" instead of "depth maps" in the following lines.Depth maps can be easily obtained from disparity maps using a triangulation step.
Let us admit we have a set of N images, numbered from the left (0) to the right (N − 1), shot with a parallel capture system specifically designed for autostereoscopic displays.Figure 1 illustrates these shooting conditions with N = 4. f is the focal distance and dioc is the Distance Intra Optical Centers.I i designates the set of pixels of the ith image.d i (p) is a function associating a disparity vector with any pixel p from I i .This vector is the difference between the coordinates of the corresponding pixel of p in image i + 1 and those of p in image i.The corresponding pixel of p in the next image is then at position p + d i (p).Since we are using a simplified geometry, d i (p) is of dimension one.It is an integer too International Journal of Digital Multimedia Broadcasting (d i (p) ∈ Z), so that we do not have to deal with subpixels.Moreover it is possible to find the corresponding pixel of p in any image using the only disparity d i (p).In any k image, the corresponding pixel is given by p + (k − i)d i (p).For example the corresponding pixel of p in the previous image is p − d i (p).
Optical flow algorithms are based on a cost L(p, q), which evaluates the color dissimilarity between two pixels p and q.In the case of color images, it is given by where r p , g p , and b p (resp., r p , g p , and b p ) are the red, green, and blue components of p (resp., q).Several methods use the multiview aspect in order to make such algorithms more robust and less sensitive to noise.The local cost M i α of a pixel p (∈ I i ) according to disparity α is then given by The reader can refer to Scharstein and Szeliski [1] for a complete taxonomy of dense stereovision algorithms.In many recent publications from this domain, authors use color segmentation in their methods [2][3][4][5].However color segmentation and other primitive extraction methods are independent for each view.It is then impossible to ensure the consistency between the disparity maps.Moreover, it may be a cause of errors when applied to images with mainly homogeneous colors.Therefore these methods are incompatible with the objective presented previously, which imposes working in a local colorimetric context.We may not have any assumption about the content of the images and, thus, we cannot extract other features.So we aim to make up for this lack of information by taking advantage of redundancies in the N images.
A lot of algorithms deal with occlusions in order to obtain better disparity maps, which preserve discontinuity at object's boundaries.The first step to deal with occlusions is to be able to detect them.Egnal and Wildes [6] compare five approaches.Some of them are based on the idea that color discontinuities correspond to depth discontinuities.These approaches are called photometry-based approaches.Alvarez et al. [7] use the gradient of the gray levels in order to locally adjust the smoothness constraint of their energy: the lower the value of the gradient, the stronger the smoothness constraint.On the other hand, geometrybased approaches use disparities in order to detect occlusion areas.The reader can refer to [6] for a complete comparison of such methods.We prefer geometry-based approaches to the photometry-based ones since they do not make any assumption on colors and allow disparity maps from the N images to interact and to be linked.This link ensures the consistency of the disparities and makes up for the lack of information previously discussed.The most widely used geometry-based method is the Left Right Checking (LRC) approach [8][9][10].The principle is that a pixel should match a pixel from another image with the same disparity; otherwise an occlusion occurs.In the case of two images (numbered 0 and 1), this can be expressed by where i (p) is close to zero when there is no occlusion and high when pixel p is occluded in image i.There have been several attempts to improve the robustness of this method [11,12].However, the original LRC is still the most popular approach [2,5,13,14].We propose an LRC-based approach, which differs in several points: (i) it is extended to the multiview context, (ii) it ensures a geometric consistency between depth maps.
After the detection step, the main difficulty is to handle occlusions in the matching algorithms.Woetzel and Koch [15] propose a correlation-based algorithm, which do not add up dissimilarity costs for the whole set of image pairs but for a subset of it.They replace (2) by where S is the set of chosen image pairs.The authors propose two methods to choose the set S of image pairs.The first one is to select the furthest left or the furthest right images.The second one is to select the pairs with the smallest L costs.This method reduces the impact of occlusions on the results but introduces a lot of errors in the images which are the futhest away.
There are two categories of methods based on energy minimization performing the matching while taking occlusions into account.
The first category contains iterative methods [8,10] based on Algorithm 1.In order to apply this principle, Proesmans et al. [8], who work with two images, use four maps, one disparity map plus one occlusion map per image.The occlusion maps are computed using the LRC approach.Strecha and Van Gool [10] extend this principle to the multiview context.First, N disparity maps are computed from the N views.Then for each view, N − 1 occlusion maps are computed as being the LRC evaluation with all other views.
In order to obtain better results, some methods start again at step 2 when step 3 is over and loop until the system converges.The problem of iterative methods is that disparities and occlusions estimations are independent, do not interact with each other and, thus, do not ensure a global geometric consistency.
The second category is then composed of methods to estimate occlusions and disparities simultaneously.In the context of two views, Alvarez et al. [9] introduce the following energy where E d k evaluates dissimilarities between corresponding pixels, E s k corresponds to the smoothness constraint, and E o k is the sum of k for every pixel of image k. χ and β are weighting factors.Note that even if pixels are detected as occluded, their dissimilarities are still taken into account in the dissimilarity term.That means that this term contains dissimilarities of mismatching pixels, which have nothing in common.This is a problem since that introduces noise into the energy.In order to solve that, Ince and Konrad [13] use an energy similar to (5) with the dissimilarity terms of the form where F is a weighting function, which approaches zero when is high (i.e., occlusions are detected).Moreover, they use the smoothness term in order to extrapolate the disparities in occluded areas.
By the same token, we have proposed a multiview graphcuts-based method in [14], which integrates occlusion penalties in its energy function without integrating dissimilarities of mismatching pixels.
In spite of the fact that these methods use smooth and discontinuity preserving functions, they still can contain inconsistencies that we will detail in Section 3.

A New Approach for Occlusion Detection
3.1.Overview.Let us imagine a standard scene with a man behind a wall.Four views of this scene are shot; Figure 2(a) shows corresponding epipolar lines from each image and superimposes them.This representation, that we call matching graph, is useful in order to define matches and occlusions.In Figure 2(b), pixel p man ∈ I 3 (green circle) has disparity d 3 (p man ).The corresponding pixel in I 2 is also a pixel of the man, so there is a match between these pixels.In image 1, the corresponding pixel of p man is part of the wall (blue square).It has a larger disparity (since the wall is nearer).Therefore there is an occlusion between these two pixels, as shown in Figure 2(b) with an orange diamond.The scene described in Figure 2(c) is an example of an impossible matching graph.This graph shows that the wall pixel in image 2 corresponds to a pixel in image 1, which is not a pixel of the wall but a representation of the man.It has a smaller disparity.This situation is impossible as it supposes that the wall is hidden by the man whose disparity implies that he is behind it.This is what we call a consistency error.The LRC does not handle this kind of inconsistencies.
We propose the following rules in order to define our approach of occlusions.Let us assume p ∈ I i and its corresponding pixel q ∈ I j : , there is a consistency error.

International Journal of Digital Multimedia Broadcasting
In order to simplify writing, we call θ the occlusion image when an occlusion occurs at image number θ, that is, when corresponding pixels from images θ and θ + 1 do not match.In Figure 2(b), for instance, there is an occlusion between images 1 and 2, θ is then equal to 1.

Energy Function.
In order to take the rules presented above into account, we use an energy function of the form where d is the set of all disparity functions d i .E s i is the smoothness term, which may be the same as the one used in (5).This smoothness constraint is applied to each disparity function.E m contains all the dissimilarity, occlusion, and consistency penalties.
In the case of Figure 2(b), E m will include the three dissimilarities between the four pixels of the wall, plus one dissimilarity between the two pixels of the man, plus one occlusion penalty between the two mismatching pixels.In Figure 2(c), E m is the sum of the three dissimilarities between the man pixels, the dissimilarity between the wall pixels, and the consistency penalty.Of course these examples take a very small number of pixels into account whereas E m considers all of them.
Finally, this term is given by where E local is the local cost between p ∈ I i and its corresponding pixel either in the previous (i−) or the next (i+) image.Let us call this pixel q.We have The term E local i− (d, p, q) is the same except that q is in image i − 1 instead of i + 1.K occ and K con are two constant values corresponding to the occlusion penalty and the consistency penalty, respectively.Due to our specific domain of application (augmented reality on autostereoscopic displays), the consistency constraint must be very strong and the value of K con is then very high to ensure that this case will never happen.
This energy function can be used in different methods as we will see in Section 4.

Application
In this section, we present two applications of our occlusion approach.The first one does not use any smoothness constraint and focuses on our approach of occlusions in order to emphasize its relevance on a correlation-based method.The second one is an application of the energy function as defined in the previous section on a graph-cuts-based method.
Both methods use the same constant K occ .We found empirically that a value of 100 gives good results with our different sets.We give K con a value of 3000.

Correlation-Based
Method.This method uses two distinct local costs.The first one supposes there is no occlusion and the second one supposes there is exactly one occlusion.These two costs are in competition by means of a Winner Takes All (WTA) algorithm.
The first cost could be any local cost as found in the literature.Our implementation uses cost M i α described in (2) where L is the absolute difference of intensities summed over the three color components.
The second cost C local is a local subset of E m (8).Indeed the cost of a p pixel includes the E local (9) energies for all pixels linked to p in the N images, assuming that there is only one occlusion and two disparities (on the left and the right of the occlusion).In order to ensure the consistency of the matching graph implicitly, only values meeting the consistency condition are tested.Therefore the constant K con never appears.C local finally contains one penalty K occ due to the occlusion, plus dissimilarities between matching pixels.The local cost for p in image i is where θ is the occlusion image and δ θ− and δ θ+ are two disparities, respectively, on the left and on the right of the occlusion.Figure 3 shows the three terms of C local with i = 2 and θ = 1 with the same configuration as in Figure 2(b), that is, it shows the costs of each term.q is the corresponding pixel of p in image θ: q is equal to p + (θ − i)δ θ− if i is on the left of the occlusion (i < θ), and equal to p + (θ − i)δ θ+ if it is on the right (i > θ).C left and C right contain dissimilarity costs on the left and on the right sides of the occlusion, respectively.They are given by In order to ensure the consistency of the matching graph, only disparities meeting the following condition are tested: Finally, the selection is based on a WTA algorithm: if the minimum cost for a pixel is obtained using M i α , then the disparity α is assigned to the pixel.If it is obtained using C local then the disparity assigned to the pixel is either δ θ− or δ θ+ , depending on whether i is, respectively, on the left or on the right of the occlusion.

Graph-Cuts-Based Method.
Our method is based on the energy function previously described in (7) and (8).We use the graph-cuts method in order to minimize our energy.Please refer to publications by Boykov et al. [16] for a complete presentation of the graph-cuts method and by Kolmogorov and Zabih [17] for an explanation of the graph construction.We use the α-ex pansion algorithm and, unlike others, we loop only once for each disparity.Moreover we always loop from the highest disparity to the lowest one, since we have found that this gives more accurate occlusion areas in the results.The graph we present in this section is based on this assumption, as we will see further.Our graph is composed of one node for each pixel from all the images.It also has a source (s) and a sink (t), which mean "keep the same disparity" and "change to disparity α", respectively.Now, we will see how to construct the graph corresponding to E local (9).In fact p will not have the same corresponding pixel in the other image whether it changes its disparity or not.The graph corresponding to E local is then composed of three nodes p, q s , and q t , which are the corresponding pixels if p is cut from the source or from the sink.However, it can be separated into two graphs using pairs (p, q s ) and (p, q t ). Figure 4 shows the general graph corresponding to E local .is equal to either L(p, q s ) or K occ depending on the previous disparities of p and q s .Since our main loop is going from the highest disparity to the lowest one, cut C pqs st means q s will have a lower disparity than p.It is then equal to K con .As mentioned by Kolmogorov and Zabih [17], the  following constraint has to be checked in order to ensure that the graph is achievable: In is equal to dissimilarity penalty L(p, q t ) and C pqt ts is equal to the occlusion penalty.Note that the value of K occ has to be greater or equal to L(p, q t ) in order to respect the constraint given by (13).To that end, we truncate L(p, q t ) to the value of K occ .
The smoothness of the result is ensured by term E s of the energy given in (7).We use the following definition: where Ψ i is the set of neighbour pixels pairs in image i.We use two implementations of the smoothness constraint.In the first one Ψ i contains only horizontal neighbours in order to obtain a 1D smoothness constraint.Such a constraint makes for an independent selection of epipolar lines and then a parallel implementation.The second one uses a 2D smoothness constraint and includes both horizontal and vertical neighbours.

Results
To compare our methods, first between them and secondly with other existing ones, we use two sets of 8 images.The first one is a set of images of a virtual scene, which allows us to compare results against ground truth.The second one is a set of photographies taken at Palais du Tau in Reims [18].The dimensions of images in both sets are 512 × 384 pixels.Figure 5 shows one image of each set.The photography has homogeneous colors whereas the virtual scene has various colors.This allows us to test our methods in both cases.We compare three pairs of methods.The first pair is composed of correlation-based methods.One uses the cost of ( 2) and the other is our own correlation-based method.The second and third pairs are graph-cuts based methods.
One with a smoothness constraint along epipolar lines, and one with a 2D smoothness constraint.Each pair is composed of a method using our energy function and one using a standard energy of the form Using the ground truth of the virtual scene, we give the error rate corresponding to each method in Table 1.This error results from the absolute differences between real disparities and our disparities summed on all pixels and divided by the number of pixels.The result is then the average disparity error per pixel.Our local methods have the highest error rate, and our global method has the smallest error rate.For each category, the error rate of the method using our occlusion approach is always lower than the method which does not use it.We observe that the error rate of the correlation-based method without occlusion handling is lower than the one of the method using 1D smoothness constraint.We think that is due to the fact that our virtual images contain no noise and no specular light at all.That is the reason why the correlation-based method gives particularly good results in this case.Therefore, the comparison between correlation-based methods and energy minimization based methods is meaningless.
Table 1 also gives computation times on both sets of images.We used an Intel Core 2 Duo CPU E4700 and 2 Go of memory.Both correlation-based methods are implemented using CUDA on an NVIDIA Quadro FX 3700 graphic card.Times include the computation of the whole set of the N disparity maps.Globally, methods using the occlusion approach are slower than the other methods.This is due to the fact that they have more possibilities to take into account.Indeed, the correlation-based method has more tests to carry out, and the graph-cuts based method has a more complicated graph structure, that is slower to solve.Figure 6 shows results obtained without and with occlusion handling (K occ = 100).Figure 6(a) illustrates the standard method; whereas Figure 6(b) corresponds to the method we have presented.We can see in these images that our method has precisely detected occlusions on boundaries of the columns.However, areas without depth discontinuities like the background contain errors.We think that it is due to the noise sensitivity of our occlusion detection.The first three extracts given in Figures 7(a), 7(b), and 7(c) show the details of these results on the front column.We observe that our correlation-based method has accurately defined discontinuities fitting the real boundaries of the column, but disparities in occluded areas contain errors (noise in Figure 7(c)), since the method has difficulties to find them in such areas.Finally, our approach is not well suited to the principle of correlation-based methods.In fact, in such methods the selection of disparities is independent for each pixel, and the consistency from one disparity map to another cannot be ensured.
On the other hand, graph-cuts based methods allow the symmetrical minimization of energy, ensuring a strong consistency.Figures 6(c) and 6(d) show results with the 1D smoothness constraint, and Figures 6(e) and 6(f) show results with the 2D smoothness constraint.Again, the method using our occlusion approach has very accurate depth discontinuities at object boundaries.In our 1D smoothness method (Figure 6(d)), the horizontal line effect is more visible than in the classical method (Figure 6(c)).This is due to the fact that our occlusion approach penalizes any disparity variation, because it is detected as an occlusion.Our method tends to add a strong smoothness constraint along epipolar lines.The 2D smoothness constraint allows compensating for this artifact.The images in Figures 7(d) and 7(e) show the front column obtained with these methods.We notice that without occlusion handling the method cannot find the disparities of some pixels, which are not visible in all images.On the other hand, it accurately detects the occlusion using our occlusion approach.However, we observe in Figure 6(f) that the column in the background (in a white ellipse) is not well defined.This is due to the principle of plane-sweeping algorithms and to the fact that this column is actually between two planes.The reader can refer to [14] for a presentation of our refinement step, which solves this problem.

Conclusion
We have introduced a new approach in order to handle occlusions of a scene in a multiview context.As a proof of the relevance of this new detection rule, we have presented two methods with the particularity of handling objects boundaries very accurately.Even if these methods can handle two-view stereovision, they are designed for the multiview context with any number of views.The results we obtain show that our occlusion approach succeeds in detecting objects boundaries to the detriment of computation times, and can still create disparities even for pixels that are not visible in all views.Moreover, used on symmetrical energy minimization-based methods, our approach ensures a geometric consistency, which is crucial for autostereoscopic displays.However, computation time is the main problem of our methods.That is the reason why our objective is to find a means to minimize energy faster.One idea is the GPU implementation of the graph cuts.Some work has already been done in this domain [19,20] but we are not using it for the moment since it induces a lot of constraints on the graph structure and must be adapted to our energy.Another possiblity that we are working on is to reduce the number of nodes used in our graph, in order to simplify the maximum flow problem.

Figure 1 :
Figure 1: Illustration of the parallel camera configuration with four optics (illustrated by black dots).

Figure 3 :
Figure 3: Example of the components of C local (10).
(i) Pair (p, q s ) This pair makes sense only if p is cut from the source.It means that cuts C pqs ts and C pqs tt are equal to 0. Cut C pqs ss

Figure 5 :
Figure 5: A photography shot at Palais du Tau (a) and a virtual scene (b).

Table 1 :
Errors with ground truth on the virtual scene plus computation times on both Palais du Tau and virtual sets.