Based upon the framework of the structural support vector machines, this paper proposes two approaches to the depth restoration towards different scenes, that is, margin rescaling and the slack rescaling. The results show that both approaches achieve high convergence, while the slack approach yields better performance in prediction accuracy. However, due to its nondecomposability nature, the application of the slack approach is limited. This paper therefore introduces a novel approximation slack method to solve this problem, in which we propose a modified way of defining the loss functions to ensure the decomposability of the object function. During the training process, a bundle method is used to improve the computing efficiency. The results on Middlebury datasets show that proposed depth inference method solves the nondecomposability of slack scaling method and achieves relative acceptable accuracy. Our approximation approach can be an alternative for the slack scaling method to ensure efficient computation.
1. Introduction
Learning for stereo vision has been a challenging subject for a long time. Owing to the increment of ground truth datasets, considerable progress has been achieved, that is, using the scene structure of input images to learn a probability distribution model for matching [1–4] and adopting an expectation maximization algorithm to estimate disparity and then relearn the model parameters based on the estimation [5]. Although these methods have shown exciting results, the shortcoming is obvious, that is, the parameters must be preset or initialized manually on the basis of their prior knowledge. In [6], a new supervised machine learning method was proposed to handle such problem based on conditional random fields (CRFs), and the results had shown a promising future.
As mentioned above, supervised image labeling has been a long-lasting problem in computer vision. In recent years, CRFs have become a popular alternative to address this problem [7, 8], where the spatial correlations among neighboring pixels are incorporated by defining proper unary and pairwise potential functions on the related pixels. In addition, support vector machines have been widely used in image labeling [9], but they are less successful as noisy label results occurred for the absence of consideration of the spatial correlations.
Recently, structured prediction has caused widespread attention, and many new approaches have been proposed. Structured learning approaches solve the above-mentioned problems. In its computation process, both inputs and outputs are well structured, and strong internal correlations are revealed. It is formulated as the learning of complex functional dependencies between multivariate input and output representations. Structured learning has significant impact on addressing important computer vision tasks including image denoising [10], stereo [11], segmentation [12, 13], object localization [14, 15], and human pose estimation [16, 17]. A common way is to generalize the max-margin binary/multiclass classification to incorporate with structured information [14, 18–20]. It has been utilized in many respects, such as sequence labeling, image segmentation, grammar parsing, dependency parsing, bipartite matching, and text segmentation [21]. Furthermore, with the development of SVMs, structured information is introduced which generated two new support vector machines named max-margin-based and slack-based SSVMs, respectively.
Max-margin method, with its decomposability of the error function, is possible to find the most violating constraint using the maximum a posteriori (MAP) inference algorithm for prediction [21]. But the shortcomings of the max-margin method are also obvious: it requires the error function being linearly comparable with the features, and it is sensitive to the most violating label. A label with large error would greatly decrease the separability of any other labels. An alternative choice is the slack scaling method. It has a fixed margin of 1 and reduces the violations in proportion to their errors which provide excellent accuracy. However, due to the nondecomposability of its error function, the slack method is not used widely. Therefore, we proposed an approximation method which modifies the slack method while reserving its normal properties. Depending on different given tasks, the proposed approximation method is effective to design most suitable loss functions and generate the corresponding solver.
This paper is organized as follows. In Section 2, we briefly discuss the principles of the SSVM. Our approach is proposed in Section 3 including steps to conduct the structural support vector machine, the typical max-margin method, and the expression of the improved slack method. Section 4 elaborates an approximation of the slack method. Section 5 provides the feature vectors which are utilized in our algorithm. As for Section 6, relative conditions and strategies for the training will be discussed and improved to make the training more efficient. Finally, we apply both methods for the depth restoration and make a detailed comparison between them.
2. Structural Support Vector Machine
Derived from statistical machine learning, the discriminative models focus on the posterior probability p(y∣x,ω) and have been viewed as the most successful techniques for structural prediction. Here x is the input sample in the input space χ and y is the associated output in the output space γ. Given a feasible training set, for the training sample xi and their associated truth output yt, firstly a model for p(y∣x,ω) will be learnt that the correct labels yt have a higher probability than the wrong labels y, that is, p(yt∣x,ω)≥p(y∣x,ω),and secondly, it can perform prediction by MAP estimation for a new sample x:
(1)y*=argmaxyp(y∣x,ω).
Under the framework of CRFs, p(y∣x) is modeled by a log linear model, which is often assumed as follows:
(2)logp(y∣x,ω)=〈ω,Φ(x,y)〉-Aω(x),
where Φ(x,y) is a certain relationship between the input and its output; the second term, Aω(x), is the normalization factor to make p(y∣x,ω) a valid probability distribution.
By adopting the framework of max-margin method, the structural support vector machine tries to learn the weight vector, denoting the ω-parameterized model, to predict the correct output labels. And then, the optimization problem that results from the learning can be written as
(3)minω,ξ12∥ω∥2+C∑i=1nξi
subject to
(4)〈ω,Φ(xi,yt)〉-〈ω,Φ(xi,y)〉≥Δ(y,yt)-ξi,
here, i from 1 to n denotes different samples, y is the label that is not equal to the true label yt and Δ(y,yt) denotes the loss between the two labels, ξi is the slack variables. Thus, the most violated constrains can be found by solving
(5)y*=argmaxy(Δ(y,yt)+F(x,y)),
where F(x,y)=〈ω,Φ(x,y)〉 is the discriminative function. Therefore, y* is reformulated as the minimization problem of energy, that is, argmaxyF(x,y)=argminyE(x,y).
3. Our Approach3.1. Problem Formulation
In stereo matching tasks, stereo images are two (or more) images of the same object taken from different views, named the left image (reference image) and right image, respectively. Assume that the right view image is just a horizontal shift of the left view, and the two images are the same size R×C. Denoting I(r,c) is the pixel on the cross of rth row and cth column in reference image, and I′(r,c) the pixel on the same position in right image. The matching is aimed at finding the pixel-wise disparity which minimizes the energy
(6)Y*=argminΥE(I,I′,Y)=argminΥ(∑r,c∥I(r,c)-I(r,c-yr,c)∥2+Esmooth(Y)),
where yr,c denotes the local disparity and Esmooth(Y) is the smooth term which usually takes the form of Pot’s Model(7)Esmooth(Y)={0yi=yjpotherwiese,
where i and j are the index of neighboring pixels, yi and yj represent the neighboring disparity label, and p is a constant for penalty.
Normally the features of I and I′ represent certain categories of visual information, for example, color, texture, or gradient. However, each category suits different situations. Texture features work well in boundary regions which usually are rich-textured but not applicative in weakly textured regions. Gradient-based features have opposite characters in comparison with texture features. In addition, different categories of features are not easy to be combined for learning. Simply expanding the dimension of feature vectors to involve more features from different categories is dangerous due to sampling effect and scale. The highly weighted features will greatly influence the final results, also suppressing other features. Therefore, the data term should be constructed in the form of 〈ωn,∅(I,I′,Y)〉, where ωn is the unary weight parameter which can balance the components in the combination feature vector against the sampling effect and different scales. These parameters W can be learnt from training examples.
By expanding the squared difference in data term, we will get three terms; that is, I2(r,c), I′2(r,c-yr,c), and -2I(r,c)I′(r,c-yr,c). We use ∅(I,I′,Y)-∅(I,I′,Yt) as the constraints in training phase, where Yt is the ground truth, the term I2(r,c) would be canceled out by the subtraction because of its independency of label Y. We use ∥I(r,c)-I′(r,c-yr,c)∥2 to take the place of I2(r,c). Parameters working on these terms can balance the difference between I(r,c) and I′(r,c-yr,c), which is caused by sampling effect and camera settings. Overall, the data term is built as ∅(I,I′,Y) = [∥I(r,c)-I′(r,c-yr,c)∥2,I′2(r,c-yr,c),-2I(r,c)I′(r,c-yr,c)]T.
3.2. Max-Margin Formulation for Stereo Learning
Assuming a learnt pairwise weight ωe=p, then the parameter W can be denoted as W=(ωn,ωe)T, and the energy is written as E(I,I′,Y)=〈W,Φ(I,I′,Y)〉. Here Φ(I,I′,Y) is the vector including data term ∅(I,I′,Y) and also the smooth term. The energy on ground truth Yt should be minimized, that is, for all possible Y we have E(I,I′,Y)≥E(I,I′,Yt). By adopting the margin scaling and adding the slack variables ξt to account for violations, the optimization problem reads, for η>0,
(8)minW,ξt∥W∥22+ηT∑t=1Tξt,s.t.〈W,Φ(I,I′,Y)〉-〈W,Φ(I,I′,Yt)〉≥Δ(Yt,Y)-ξt∀t,Y,ξt≥0.
3.3. Slack Scaling Formulation
The margin rescaling method requires the label loss Δ(Yt,Y) to be linearly comparable with the feature values Φ(I,I′,Y). However, this is normally hard to be satisfied in structured learning, since Δ(Yt,Y) counts the loss over each pixel in the image, and thus the aggregate value is much larger than feature values. Especially in stereo matching tasks, the pixel-wise loss may reach up to hundreds, which makes the overall loss even larger. Thus, we would like to adopt slack scaling, as it is invariant to the label loss scale. Nevertheless, the slack rescaling formulation is difficult to be solved, because no efficient approximation algorithm for Y* exists. We follow the method introduced in [21] to solve this problem.
The slack rescaling optimization formulation is as follow:
(9)minW,ξt∥W∥22+ηT∑t=1Tξt,s.t.〈W,Φ(I,I′,Y)〉-〈W,Φ(I,I′,Yt)〉≥1-ξtΔ(Yt,Y)∀t,Y,ξt≥0.
4. The Approximation for Slack Scaling
For the slack scaling optimization formulation, the inference engine problem is to find
(10)yS=argminy(s(y)+ξL(y)),
where y∈{s(y)-s(yi)<1-ξ/L(y)} is the set of the most violating label, ξ is the slack variable, and s(y)=〈W,Φ(I,I′,Y)〉,L(y)=Δ(Yt,Y).
As it is seen in the formulation, because L(y) must be considered entirely, the second part of the formula cannot be decomposed easily. Thus, an approximation yA is used to take the place of yS and make it possible to be decomposed into the local parts.
It should be noted that s(y)+ξ/L(y)is concave, and it has been proved approximated in the form of a linear function with respect to L(y) [22]. The linearization and to be approximation procedure will be shown in the following parts.
4.1. Linearization and Approximation
According to [22], a concave function can be expressed in a linear form. Therefore, (10) is expressed as
(11)s(y)+ξiL(y)≥maxλ≥0(s(y)-λL(y)+2ξλ).
The aim of the inference problem is to find the optimal label y which minimizes the left side of (11). Therefore, we have
(12)miny(s(y)+ξL(y))=miny(s(y)-minλ≥0(λL(y)-2ξλ))=minymaxλ≥0(s(y)-λL(y)+2ξλ)=maxλ≥0miny(s(y)-λL(y)+2ξλ).
Here, let F′(yλ;λ)=s(y)-λL(y)+2ξλ, thus
(13)F(λ)=minyF′(yλ;λ),
which leads to the simplified formulation as
(14)miny(s(y)+ξL(y))=maxλ≥0minyF′(yλ;λ)=maxλ≥0F(λ).
For a fixed λ, firstly the most optimal label yλ can be computed through minimization
(15)yλ=argminy(s(y)-λL(y)).
Then, yλ can be substituted into the formula F(λ). We can find a λ that enables F(λ) to catch its maximum, because F(λ) is a function which is convex with respect to λ. F(λ) can be seen as the max of a set of convex functions; therefore, F(λ) is convex as well.
With the help of linear search algorithm such as Golden Search, the maximum of F(λ) can be acquired in an efficient way. During the search procedure, it will encounter many different λs. By evaluating the F(λ) for each λ, we can get different labels. The goal is to find the optimal label to get a minimum of s(y)+ξ/L(y), which is denoted as yA.
4.2. The Determination of Interval for <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M111"><mml:mrow><mml:mi>λ</mml:mi></mml:mrow></mml:math></inline-formula>
Since a simple constrain has been given out, λ≥0, it is obvious that λ=0 can be the lower bound of λ as λl. However, if λ=0, it will be hard to distinguish the F′(yλ;λ) between different labels in the early iterations, due to the neglect for the different loss L(y). Let λl=ε/Lmax, where Lmax is the possible maximal label loss and ε is the tolerance of the difference between two continuous iterations for this algorithm. In this way, a proper correct λl is obtained.
Then we come to determine the upper bound of λ. It is sufficient to find an upper bound λ as λu such that it returns F(λ)=F(λu) for any λ≥λu. And it also satisfies
(16)y′=argminys(y)-λuL(y),
which leads to the following formula
(17)s(y′)-λuL(y′)≤miny,L(y)<L(y′)s(y)-λuL(y).
Here, let y1=argminys(y) and Lε be the minimal difference between L(y) and L(y′), such as Lε=1 for Hamming loss. Then the right side of the function becomes s(y1)-λu(L(y′)-Lε). That requires λu≥s(y′)-s(y1)(Lε=1).
Since s(y′)<s(yi)+1-ξ/L(y′)<s(yi)+1-ξ/Lmax, so λu can be set as λu=s(yi)+1-ξ/Lmax-s(y1).
5. Construction of Feature Vector
Image features are the terms used to describe images, as well as the clues for distinguishing the differences of images. Some image features may be the basic visual features, while others are defined for specific applications. Three types of features are used in this paper, that is, color, texture, and edge features.
5.1. Color Features
Color features are the basic visual description of images. Generally, color features are based on the characteristics of pixels, and each pixel in the image or the image region makes its own contribution to the color features. However, as a global feature, it is not sensitive to the changes of the size of the image or image region and also the directions in image. In other words, color features cannot capture the local characteristics of the image. And due to its nonuniqueness, pixels in different objects may share the same color features. Two basic color descriptions are RGB color space and YCbCr color space. While RGB concentrates on the gray levels of the pixels, the YCbCr pays close attention to the intensity, chromaticity, and the color difference. In YCbCr color space, the channel Y represents the intensity of the color, while channels Cb and Cr denote the chromaticity for blue and red, respectively. YCbCr color space can be easily obtained just by a linear transformation from RGB color space. Both the RGB and YCbCr color features are shown in Figure 1. In this paper, we use both RGB and YCbCr as the color features in the training process.
The color features of the image: RGB color features (first row) and YCbCr color features (second row). From left to right, first row: the original image in RGB color space, R channel, G channel, and B channel; second row: the original image in YCbCr color space, Y channel, Cb channel, and Cr channel.
5.2. Texture Features
Similar to color features, texture features are also global features. The major difference is that texture features describe the statistical characteristics of the pixels in the image region. And the texture features have the properties of rotational invariance and noise immunity, but they are sensitive to the revolution of images, if the revolution changes, different features may be generated. On top of that, the light and the reflection on the surface of the objects may make it hard for computing the texture features.
In [23], Laws developed a method for computing texture features. According to this method, different convolution kernels, which were named Laws’ masks, will be applied to our images. And the results will give some characteristics of the images. Here, the 2D Laws’ masks can be generated from the following small kernels both with the length 3 and 5: L3=[121],E3=[10-1],S3=[1-21],L5=[14641],E5=[-1-2021],S5=[-1020-1],W5=[-120-21],R5=[1-46-41].
Here, L denotes the average gray levels, E denotes the edge features, S stands for extracting the spots in the image, W stands for extracting the wave feature, and R stands for extracting the ripples in the image.
In order to generate the 2-D Laws’ masks, we adopted matrix multiplication by a vertical 1D kernel and a horizontal 1-D kernel, such as L5E5=L5T×E5. Take the masks scaled 3 × 3, for example, all the possible masks were listed in Table 1. After the convolve operation with these masks on an image sized M × N, the gray-scale texture feature image sized (N-masks_size + 1) × (M-masks_size + 1) will be generated. Figure 2 demonstrates the texture feature results generated by the 3 × 3 Laws’ masks.
The possible Laws’ masks scaled 3 × 3.
Masks
Method
Description
L3L3
L3TL3
The gray level intensity within 3 neighboring pixels in both vertical and horizontal directions
L3E3
L3TE3
In horizontal direction edge diction and in vertical direction gray level intensity
L3S3
L3TS3
In horizontal direction spots detection and in vertical direction gray level intensity
E3L3
E3TL3
In horizontal direction gray level intensity and in vertical direction edge diction
E3E3
E3TE3
Edge detection in both vertical and horizontal directions
E3S3
E3TS3
In horizontal direction spots detection and in vertical direction edge diction
S3L3
S3TL3
In horizontal direction gray level intensity and in vertical direction spots detection
S3E3
S3TE3
In horizontal direction edge diction and in vertical direction spots detection
S3S3
S3TS3
Spots detection in both vertical and horizontal directions
The outputs after the convolution of all the Laws’ masks scaled 3 × 3.
5.3. Edge Features
The object edge is the visual features of the discontinuity in the local image region which has a significant change in intensity. Generally, in images, the pixels along the edge have a smooth change in gray levels; however, on the direction which is vertical to the edge, the intensity of pixels change sharply.
The former denoted features are the local visual features. From the description, they are the surface features of the objects. On the other hand, the edge features are the measurement of the local compatibility. In this paper, 4 different Prewitt edge detectors which were directed in 0°, 45°, 90°_{,} and 135° were adopted in order to extract the edge features. The detectors in different directions and corresponding results are shown in Figure 3. By applying the 4 detectors, almost all the edges in the images can be captured.
The results achieved by different edge detectors in 4 directions.
6. Parameter Learning and Inference Problem6.1. Bundle Method for Parameter Learning
For parameter learning, this paper utilizes the bundle method. Due to the formulation, such as
(18)minW,ξt∥W∥22+ηT∑t=1Tξt,s.t.〈W,Φ(I,I′,Y)〉-〈W,Φ(I,I′,Yt)〉≥Δ(Yt,Y)-ξt∀t,Y,ξt≥0.
In order to obtain the optimal parameter, the constraints can be rearranged in the following form:
(19)〈W,Φ(I,I′,Yt)〉≥〈W,Φ(I,I′,Y)〉-Δ(Yt,Y)+ξt.
This formula means that it is lower bounded by 〈W,Φ(I,I′,Yt)〉. Then it generates the objective function to find the most violated constraints
(20)Y*=argminY(〈W,Φ(I,I′,Y)〉-Δ(Yt,Y)).
Thus, this forms an inference problem. And the bundle method can guarantee the optimal solution in a small number of iterations, so the problem can be solved efficiently. Algorithms 1 and 2 provide the parameter learning algorithm for both margin and slack method.
<bold>Algorithm 1: </bold>The parameter learning for margin method.
Input: data Xt, label Yt, size T, tolerance ε
Initialize parameter W→ 0, constraint set R → ∅
Repeat
for t = 1 to T
Y*=arg minY[s(y)-Δ(yt,y)]
end for
increase constraint set R←R∪{Y*}
(W,ξ)← solve the QP using all the existing Y*
Until ∑t[Δ(yt,y)+s(yt)-s(y)]≤ξ+ε
<bold>Algorithm 2: </bold>The parameter learning for slack method.
Input: data Xt, label Yt, size T, tolerance ε
Initialize parameter W → 0, constraint set R → ∅
Repeat
for t = 1 to T
Y*=arg minY[s(y)+ξ/Δ(yt,y)]
end for
increase constraint set R←R∪{Y*}
(W,ξ)← solve the QP using all the existing Y*
Until ∑t[ξ/Δ(yt,y)+s(yt)-s(y)]≤ξ+ε
Both the margin and slack method refer to the optimal inference problems, so the best solution for them can be obtained via a standard graph-cuts algorithm (see reference [8] for detail). The frameworks seem to be the same, but in Algorithm 2, the inference engine is not similar to that in Algorithm 1. In this case, it needs to be approximated into a linear form, so that it searches for the best λ in the interval by the golden search algorithm.
6.2. Golden Searching
In this paper, we adopted the golden searching algorithm during searching for the best approximation of the optimal label.
Firstly, suppose that there exists a continuous concave function f over the interval [a,b], meanwhile it has only one minimum or maximum in the interval. Taking the minimum case for example, the binary searching algorithm is not the optimal algorithm for minimum searching, shown as follows:
Take the middle point as
(21)m=a+b2,
then two different points x1 and x2 are determined by
(22)x1=m-δ2x2=m+δ2,
such that f(x1)≠f(x2). If f(x1)<f(x2), the interval will be updated by [a,x1], otherwise [x2,b] will be the new interval. Obviously, each iteration step should call the binary searching for two times, which is not optimal.
In order to optimize the iteration process, there should be a factor which is capable of reducing the interval, named c. For x1 and x2 in the interval [a,b], there are two different cases.
(1) If f(x1)<f(x2), then the interval becomes [a,x2], and the interval size is compressed by c as follows:
(23)x2-ab-a=c,
as a result,
(24)x2=(1-c)a+cb.
(2) If f(x1)>f(x2), similarly the interval is compressed by c and the new interval is [x1,b], then
(25)b-x1b-a=c,x1 is obtained by
(26)x1=(1-c)b+ca.
Obviously, if the factor c is determined, it is easy to locate the points x1 and x2 in the interval. There are two rules for Cases (1) and (2), respectively, while Algorithm 3 shows the algorithm for golden searching.
<bold>Algorithm 3: </bold>The algorithm for golden searching.
If f(x1)<f(x2), set x2=x1, then compute another new x1.
Rule 2.
If f(x1)>f(x2), set x1=x2, then compute another new x2.
7. Experiments and Results
We test the proposed methods on the Middlebury stereo datasets. The dataset contains many different scenes, that is, art, books, dolls, laundry, moebius, and reindeer, and each scene is consisted of 2 ground-truth images, related to view 1 and view 5 in each scene, and several different images which were caught from different views. The ground-truth images are used as the label images of each scene, and its labels were compressed from 0–255 to 0–22 for the computing efficiency, and two neighbor view images are adopted to extract the different features.
Two groups of features are introduced in our experiments. The first group is local visual features, such as colors and textures, including the 3 dimensions of RGB color channels, the 3 dimensions YCbCr color channels, the 9 dimensions texture features, the outputs of Laws’ masks scaled 3 × 3, and the 4 dimensions edge features, the outputs of the different Prewitt edge detectors. The second group is the graph edge features, which are the absolute difference between labels of neighboring pixels and one-dimensional bias constant. Practically, the method for conducting features may construct a large amount of dimensions, which can supply a rich set for choosing the suitable features to learn the parameters of the wanted model. By adopting the features and the Max-margin method, it may be easy for us to get the reasonable depth for different scenes, as shown in Figure 4.
Inference depth maps by Max-margin method for different scene. From row 1 to 3: images, ground truth, and the obtained depth map. From column 1 to 4 are four scenes: 1st art, 2nd book, 3rd laundry, and 4th reindeer.
7.1. Comparison on Inference Accuracy with Different Feature Combination
Suppose that the ground truth is denoted as yt and the output results as yo. Defining Cy as the number of the matched pixels in yt and yo and Cn as the number of different pixels in yo from yt, the inference accuracy can be denoted as
(27)acc=CyCy+Cn,
which stands for the ratio of the correct output.
In order to study the effects of different features, we have tested different combinations of image features. For the convenience of the expression, 1 denotes the state of the feature which was chosen, and 0 otherwise. Figure 5 shows the inference accuracy of different feature combinations for the 2nd scene book. Note, the order of the features arranged from left to right is RGB, YCbCr, laws’ masks scaled 3 × 3, and the edge features. For example, 1000 denoted that only the RGB feature was chosen.
The inference accuracy of different features combination.
The combination of features does not always boost the accuracy of the results. In a word, some features have a negative effect on the results while others have a positive effect. In order to test it, a comparison between the set with a certain feature and another without it has been carried out. The results show that an offset effect does exist between features, such as between color and edge feature, and also some features do boost the result, such as the textures in most situations (see Figures 6(a), 6(b), 6(c), and 6(d)).
(a) the effect of the edge features. 1 to 7 means 1000, 0100, 0010, 1100, 1010, 0110, 1110. It shows that the edge features can boost the accuracy. (b) the effect of the RGB features. 1 to 7 means 0100, 0010, 0001, 0110, 0101, 0011, 0111. It shows that the RGB features can reduce the accuracy and the color feature with the texture features that can boost the inference accuracy. (c) the effect of the YCbCr features. 1 to 7 means 1000, 0010, 0001, 1010, 1001, 0011, 1011. It is easy to find that the YCbCr features have a similar effect on the accuracy with the RGB features. (d) the effect of the Laws’ masks scaled 3 × 3. 1 to 7 means 1000, 0100, 0001, 1100, 1001, 0101, 1101. It is easy to find that the texture features can boost the accuracy in most of the situations.
7.2. Comparison between Margin and Slack Methods
To overcome the above-mentioned shortcomings of the Max-margin method, this paper adopts the slack scaling method to improve the results. In order to solve the nondecomposability problem, we introduce an approximated algorithm as described in Section 4 to make the slack method feasible. Both methods are tested on the Middlebury database, see Figure 7. As in Figure 8, the comparison results of inference accuracy for scene art show that the slack method performs better than the margin method.
Depth inference results of different images by Max-margin and proposed slack method. From column 1 to 4: images, ground truth, the result of margin method, and the result of proposed slack method. And from row 1 to 3 are three scenes in Middlebury datasets: art, book, and laundry.
Different inference accuracy shows that the proposed slack method performs better than the margin method through the comparison of the inference accuracy.
7.3. Comparison on the Convergent Properties
To take a step further, the convergent property between margin and slack methods is compared. In the training procedure, the convergence of both margin and slack methods requires the use of the bundle method and one-slack trick. Take the margin method for example, the bundle method is used by rearranging the terms, then the constraints will be
(28)ξ≥Δ(y,yt)+s(yt)-s(y).
This means that the constraints are up bounded by ξ. Given the current parameter, the objective function can be optimized using the bundle method, where the most violation constraint is
(29)y*=argminy(s(y)-Δ(yt,y)).
While the bundle method has the ability to achieve the optimal solution, the one-slack trick makes the procedure convergent in a small number of iterations. The computing process of the margin and the slack methods is examined to observe the convergence speed of the iteration. The error between two continuous iterations in the objective function is denoted as itaeps. Figure 9 shows the convergent property, indicating that both methods could converge in several iterations, while the slack method produces better accuracy without too much loss in convergence.
The comparison of the convergent property between margin and slack. Both the two methods converge in a small number of iterations. With the increasing accuracy, the slack method has a pronounced advantage in convergence compared to the margin method.
8. Conclusion
This paper presented two methods for the depth restoration of different scenes using structural vector machine. The proposed methods, including both margin and slack, have their own advantages and disadvantages, respectively. While the form of margin rescaling method can be decomposed into local parts easily, it is hard for the slack rescaling method to perform such operation. In contrast, the slack one outperforms the margin rescaling method in accuracy outstandingly. Besides the advantageous promotion in accuracy, there is no need for the slack rescaling method to abandon too many convergences while computing the parameters. The proposed approximation aiming at the slack rescaling approach manages to solve the decomposability problem successfully and make it computable in an efficient way. The pity is that the approximation method requires the formulation being concave which may be an over strong constraint. Our future works focus on these optimization algorithms, including improving the computing speed and enhancing the accuracy of the results.
KongD.TaoH.A method for learning matching errors in stereo computationProceedings of the British Machine Vision Conference (BMCV '04)2004ChenS. Y.WangZ. J.Acceleration strategies in generalized belief propagationWangW. L.XiaB. B.GuanQ.ShengyongC.Neural network based 3D model reconstruction with highly distorted stereoscopic sensors3512Proceedings of the 8th International Workshop on Artificial Neural Networks (IWANN '05)June 2005661668Lecture Notes on Computer Science2-s2.0-25144444279FridmanT.RazumovskayaJ.VerberkmoesN.HurstG.ProtopopescuV.XuY.The probability distribution for a random match between an experimental-theoretical spectral pair in tandem mass spectrometryZhangL.SeitzS. M.Parameter estimation for MRF stereoProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05)June 20052882952-s2.0-24644467375LiY.HuttenlocherD. P.Learning for stereo vision using the structured support vector machineProceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08)June 20082-s2.0-5194911155910.1109/CVPR.2008.4587699BatraD.SukthankarR.ChenT.Learning class-specific affinities for image labellingProceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08)June 2008182-s2.0-5194909813410.1109/CVPR.2008.4587432BoykovY.VekslerO.ZabihR.Fast approximate energy minimization via graph cutsCristianiniN.TaylorJ. S.McAuleyJ. J.CaetanoT. S.SmolaA. J.FranzM. O.Learning high-order MRF priors of color imagesProceedings of the 23rd International Conference on Machine Learning (ICML '06)June 20066176242-s2.0-33749239850CarrP.HartleyR.Minimizing energy functions on 4-connected lattices using eliminationProceedings of the 12th International Conference on Computer Vision (ICCV '09)October 2009204220492-s2.0-7795320829610.1109/ICCV.2009.5459450SzummerM.KohliP.HoiemD.Learning CRFs using graph cutsProceedings of the European Conference on Computer Vision (ECCV ’08)20085825952-s2.0-5674910399010.1007/978-3-540-88688-4_43AnguelovD.TaskarB.ChatalbashevV.KollerD.GuptaD.HeitzG.NgA.Discriminative learning of Markov Random fields for segmentation of 3D scan dataProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05)June 20051691762-s2.0-24644490260TaskarB.ChatalbashevV.KollerD.Learning associative Markov networksProceedings of the 21th International Conference on Machine Learning (ICML '04)July 20048078142-s2.0-14344251215IonescuC.BoL.SminchisescuC.Structural SVM for visual localization and continuous state estimationProceedings of the 12th International Conference on Computer Vision (ICCV '09)September 2009115711642-s2.0-7795320802910.1109/ICCV.2009.5459346BlaschkoM.LampertC.Learning to localize objects with structured output regressionProceedings of the European Conference on Computer Vision (ECCV '08)2008215KimM.PavlovicV.Dimensionality reduction using covariance operator inverse regressionProceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08)June 2008182-s2.0-5194908341510.1109/CVPR.2008.4587404BoL.SminchisescuC.Structured output-associative regressionProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’09)200918TsochantaridisI.JoachimsT.HofmannT.AltunY.Large margin methods for structured and interdependent output variablesTaskarB.GuestrinC.KollerD.Max-margin Markov networksSarawagiS.GuptaR.Accurate max-margin training for structured output spacesProceedings of the 25th International Conference on Machine LearningJuly 20088888952-s2.0-56449089882TaskarB.GuestrinC.KollerD.Max-margin Markov networksLawsK.