Approximated Slack Scaling for Structural Support Vector Machines in Scene Depth Analysis

Based upon the framework of the structural support vector machines, this paper proposes two approaches to the depth restoration towards different scenes, that is, margin rescaling and the slack rescaling. The results show that both approaches achieve high convergence, while the slack approach yields better performance in prediction accuracy. However, due to its nondecomposability nature, the application of the slack approach is limited. This paper therefore introduces a novel approximation slack method to solve this problem, in which we propose a modified way of defining the loss functions to ensure the decomposability of the object function. During the training process, a bundle method is used to improve the computing efficiency. The results on Middlebury datasets show that proposed depth inference method solves the nondecomposability of slack scaling method and achieves relative acceptable accuracy.Our approximation approach can be an alternative for the slack scalingmethod to ensure efficient computation.


Introduction
Learning for stereo vision has been a challenging subject for a long time.Owing to the increment of ground truth datasets, considerable progress has been achieved, that is, using the scene structure of input images to learn a probability distribution model for matching [1][2][3][4] and adopting an expectation maximization algorithm to estimate disparity and then relearn the model parameters based on the estimation [5].Although these methods have shown exciting results, the shortcoming is obvious, that is, the parameters must be preset or initialized manually on the basis of their prior knowledge.In [6], a new supervised machine learning method was proposed to handle such problem based on conditional random fields (CRFs), and the results had shown a promising future.
As mentioned above, supervised image labeling has been a long-lasting problem in computer vision.In recent years, CRFs have become a popular alternative to address this problem [7,8], where the spatial correlations among neighboring pixels are incorporated by defining proper unary and pairwise potential functions on the related pixels.In addition, support vector machines have been widely used in image labeling [9], but they are less successful as noisy label results occurred for the absence of consideration of the spatial correlations.
Recently, structured prediction has caused widespread attention, and many new approaches have been proposed.Structured learning approaches solve the above-mentioned problems.In its computation process, both inputs and outputs are well structured, and strong internal correlations are revealed.It is formulated as the learning of complex functional dependencies between multivariate input and output representations.Structured learning has significant impact on addressing important computer vision tasks including image denoising [10], stereo [11], segmentation [12,13], object localization [14,15], and human pose estimation [16,17].A common way is to generalize the max-margin binary/multiclass classification to incorporate with structured information [14,[18][19][20].It has been utilized in many respects, such as sequence labeling, image segmentation, grammar parsing, dependency parsing, bipartite matching, and text segmentation [21].Furthermore, with the development of SVMs, structured information is introduced which generated two new support vector machines named maxmargin-based and slack-based SSVMs, respectively.
Max-margin method, with its decomposability of the error function, is possible to find the most violating constraint using the maximum a posteriori (MAP) inference algorithm for prediction [21].But the shortcomings of the max-margin method are also obvious: it requires the error function being linearly comparable with the features, and it is sensitive to the most violating label.A label with large error would greatly decrease the separability of any other labels.An alternative choice is the slack scaling method.It has a fixed margin of 1 and reduces the violations in proportion to their errors which provide excellent accuracy.However, due to the nondecomposability of its error function, the slack method is not used widely.Therefore, we proposed an approximation method which modifies the slack method while reserving its normal properties.Depending on different given tasks, the proposed approximation method is effective to design most suitable loss functions and generate the corresponding solver.
This paper is organized as follows.In Section 2, we briefly discuss the principles of the SSVM.Our approach is proposed in Section 3 including steps to conduct the structural support vector machine, the typical max-margin method, and the expression of the improved slack method.Section 4 elaborates an approximation of the slack method.Section 5 provides the feature vectors which are utilized in our algorithm.As for Section 6, relative conditions and strategies for the training will be discussed and improved to make the training more efficient.Finally, we apply both methods for the depth restoration and make a detailed comparison between them.

Structural Support Vector Machine
Derived from statistical machine learning, the discriminative models focus on the posterior probability ( | , ) and have been viewed as the most successful techniques for structural prediction.Here x is the input sample in the input space  and y is the associated output in the output space .Given a feasible training set, for the training sample   and their associated truth output   , firstly a model for ( | , ) will be learnt that the correct labels   have a higher probability than the wrong labels , that is, (  | , ) ≥ ( | , ),and secondly, it can perform prediction by MAP estimation for a new sample : Under the framework of CRFs, ( | ) is modeled by a log linear model, which is often assumed as follows: where Φ(, ) is a certain relationship between the input and its output; the second term,   (), is the normalization factor to make ( | , ) a valid probability distribution.By adopting the framework of max-margin method, the structural support vector machine tries to learn the weight vector, denoting the -parameterized model, to predict the correct output labels.And then, the optimization problem that results from the learning can be written as min here,  from 1 to  denotes different samples,  is the label that is not equal to the true label   and Δ(,   ) denotes the loss between the two labels,   is the slack variables.Thus, the most violated constrains can be found by solving where (, ) = ⟨, Φ(, )⟩ is the discriminative function.Therefore,  * is reformulated as the minimization problem of energy, that is, argmax  (, ) = argmin  (, ).

Our Approach
where  , denotes the local disparity and  smooth () is the smooth term which usually takes the form of Pot's Model where  and  are the index of neighboring pixels,   and   represent the neighboring disparity label, and  is a constant for penalty.Normally the features of  and   represent certain categories of visual information, for example, color, texture, or gradient.However, each category suits different situations.Texture features work well in boundary regions which usually are rich-textured but not applicative in weakly textured regions.Gradient-based features have opposite characters in comparison with texture features.In addition, different categories of features are not easy to be combined for learning.Simply expanding the dimension of feature vectors to involve more features from different categories is dangerous due to sampling effect and scale.The highly weighted features will greatly influence the final results, also suppressing other features.Therefore, the data term should be constructed in the form of ⟨  , 0(,   , )⟩, where   is the unary weight parameter which can balance the components in the combination feature vector against the sampling effect and different scales.These parameters  can be learnt from training examples.
Parameters working on these terms can balance the difference between (, ) and   (,  −  , ), which is caused by sampling effect and camera settings.Overall, the data term is built as

Max-Margin Formulation for Stereo
Learning.Assuming a learnt pairwise weight   = , then the parameter  can be denoted as  = (  ,   )  , and the energy is written as (,   , ) = ⟨, Φ(,   , )⟩.Here Φ(,   , ) is the vector including data term 0(,   , ) and also the smooth term.The energy on ground truth   should be minimized, that is, for all possible  we have (,   , ) ≥ (,   ,   ).By adopting the margin scaling and adding the slack variables   to account for violations, the optimization problem reads, for  > 0,

Slack Scaling Formulation.
The margin rescaling method requires the label loss Δ(  , ) to be linearly comparable with the feature values Φ(,   , ).However, this is normally hard to be satisfied in structured learning, since Δ(  , ) counts the loss over each pixel in the image, and thus the aggregate value is much larger than feature values.Especially in stereo matching tasks, the pixel-wise loss may reach up to hundreds, which makes the overall loss even larger.Thus, we would like to adopt slack scaling, as it is invariant to the label loss scale.Nevertheless, the slack rescaling formulation is difficult to be solved, because no efficient approximation algorithm for  * exists.We follow the method introduced in [21] to solve this problem.
The slack rescaling optimization formulation is as follow:

The Approximation for Slack Scaling
For the slack scaling optimization formulation, the inference engine problem is to find where  ∈ {() − (  ) < 1 − /()} is the set of the most violating label,  is the slack variable, and As it is seen in the formulation, because () must be considered entirely, the second part of the formula cannot be decomposed easily.Thus, an approximation   is used to take the place of   and make it possible to be decomposed into the local parts.
It should be noted that () + /() is concave, and it has been proved approximated in the form of a linear function with respect to () [22].The linearization and to be approximation procedure will be shown in the following parts.

Linearization and Approximation.
According to [22], a concave function can be expressed in a linear form.Therefore, (10) is expressed as The aim of the inference problem is to find the optimal label  which minimizes the left side of (11).Therefore, we have Here, let   (  ; ) = () − () + 2√, thus which leads to the simplified formulation as For a fixed , firstly the most optimal label   can be computed through minimization Then,   can be substituted into the formula ().We can find a  that enables () to catch its maximum, because () is a function which is convex with respect to . () can be seen as the max of a set of convex functions; therefore, () is convex as well.
With the help of linear search algorithm such as Golden Search, the maximum of () can be acquired in an efficient way.During the search procedure, it will encounter many different .By evaluating the () for each , we can get different labels.The goal is to find the optimal label to get a minimum of () + /() , which is denoted as   .

The Determination of Interval for 𝜆.
Since a simple constrain has been given out,  ≥ 0, it is obvious that  = 0 can be the lower bound of  as   .However, if  = 0, it will be hard to distinguish the   (  ; ) between different labels in the early iterations, due to the neglect for the different loss ().Let   = / max , where  max is the possible maximal label loss and  is the tolerance of the difference between two continuous iterations for this algorithm.In this way, a proper correct   is obtained.
Then we come to determine the upper bound of .It is sufficient to find an upper bound  as   such that it returns () = (  ) for any  ≥   .And it also satisfies which leads to the following formula Here, let  1 = argmin  () and   be the minimal difference between () and (  ), such as   = 1 for Hamming loss.Then the right side of the function becomes ( 1 ) −   ((  ) −   ).That requires   ≥ (  ) − ( 1 ) (  = 1).

Construction of Feature Vector
Image features are the terms used to describe images, as well as the clues for distinguishing the differences of images.Some image features may be the basic visual features, while others are defined for specific applications.Three types of features are used in this paper, that is, color, texture, and edge features.

Color Features.
Color features are the basic visual description of images.Generally, color features are based on the characteristics of pixels, and each pixel in the image or the image region makes its own contribution to the color features.However, as a global feature, it is not sensitive to the changes of the size of the image or image region and also the directions in image.In other words, color features cannot capture the local characteristics of the image.And due to its nonuniqueness, pixels in different objects may share the same color features.Two basic color descriptions are RGB color space and YCbCr color space.While RGB concentrates on the gray levels of the pixels, the YCbCr pays close attention to the intensity, chromaticity, and the color difference.In YCbCr color space, the channel Y represents the intensity of the color, while channels Cb and Cr denote the chromaticity for blue and red, respectively.YCbCr color space can be easily obtained just by a linear transformation from RGB color space.Both the RGB and YCbCr color features are shown in Figure 1.In this paper, we use both RGB and YCbCr as the color features in the training process.

Texture Features.
Similar to color features, texture features are also global features.The major difference is that texture features describe the statistical characteristics of the pixels in the image region.And the texture features have the properties of rotational invariance and noise immunity, but they are sensitive to the revolution of images, if the revolution changes, different features may be generated.On top of that, the light and the reflection on the surface of the objects may make it hard for computing the texture features.
In [23], Laws developed a method for computing texture features.According to this method, different convolution kernels, which were named Laws' masks, will be applied to our images.And the results will give some characteristics of the images.Here, the 2D Laws' masks can be generated from the following small kernels both with the length 3 and 5: Here,  denotes the average gray levels,  denotes the edge features,  stands for extracting the spots in the image,  stands for extracting the wave feature, and  stands for extracting the ripples in the image.
In order to generate the 2-D Laws' masks, we adopted matrix multiplication by a vertical 1D kernel and a horizontal 1-D kernel, such as  5  5 =   5 ×  5 .Take the masks scaled 3 × 3, for example, all the possible masks were listed in Table 1.After the convolve operation with these masks on an image sized M × N, the gray-scale texture feature image sized (N-masks size + 1) × (M-masks size + 1) will be generated.Figure 2 demonstrates the texture feature results generated by the 3 × 3 Laws' masks.

Edge Features.
The object edge is the visual features of the discontinuity in the local image region which has a significant change in intensity.Generally, in images, the pixels along the edge have a smooth change in gray levels; however, on the

Masks
Method Description The gray level intensity within 3 neighboring pixels in both vertical and horizontal directions In horizontal direction edge diction and in vertical direction gray level intensity In horizontal direction spots detection and in vertical direction gray level intensity In horizontal direction gray level intensity and in vertical direction edge diction Edge detection in both vertical and horizontal directions In horizontal direction spots detection and in vertical direction edge diction In horizontal direction gray level intensity and in vertical direction spots detection In horizontal direction edge diction and in vertical direction spots detection Spots detection in both vertical and horizontal directions direction which is vertical to the edge, the intensity of pixels change sharply.
The former denoted features are the local visual features.From the description, they are the surface features of the objects.On the other hand, the edge features are the measurement of the local compatibility.In this paper, 4 different Prewitt edge detectors which were directed in 0 ∘ , 45 ∘ , 90 ∘ , and 135 ∘ were adopted in order to extract the edge features.The detectors in different directions and corresponding results are shown in Figure 3.By applying the 4 detectors, almost all the edges in the images can be captured.

Parameter Learning and Inference Problem
In order to obtain the optimal parameter, the constraints can be rearranged in the following form: This formula means that it is lower bounded by ⟨, Φ(,   ,   )⟩.Then it generates the objective function to find the most violated constraints Thus, this forms an inference problem.And the bundle method can guarantee the optimal solution in a small number of iterations, so the problem can be solved efficiently.Algorithms 1 and 2 provide the parameter learning algorithm for both margin and slack method.
Both the margin and slack method refer to the optimal inference problems, so the best solution for them can be obtained via a standard graph-cuts algorithm (see reference [8] for detail).The frameworks seem to be the same, but in Algorithm 2, the inference engine is not similar to that in Algorithm 1.In this case, it needs to be approximated into a linear form, so that it searches for the best  in the interval by the golden search algorithm.
Figure 2: The outputs after the convolution of all the Laws' masks scaled 3 × 3.

Golden Searching.
In this paper, we adopted the golden searching algorithm during searching for the best approximation of the optimal label.Firstly, suppose that there exists a continuous concave function f over the interval [, ], meanwhile it has only one minimum or maximum in the interval.Taking the minimum case for example, the binary searching algorithm is not the optimal algorithm for minimum searching, shown as follows: Take the middle point as then two different points  1 and  2 are determined by such that ( 1 ) ̸ = ( 2 ).If ( 1 ) < ( 2 ), the interval will be updated by [,  1 ], otherwise [ 2 , ] will be the new interval.Obviously, each iteration step should call the binary searching for two times, which is not optimal.
In order to optimize the iteration process, there should be a factor which is capable of reducing the interval, named .For  1 and  2 in the interval [, ], there are two different cases.
(1) If ( 1 ) < ( 2 ), then the interval becomes [,  2 ], and the interval size is compressed by c as follows: as a result, (2) If ( 1 ) > ( 2 ), similarly the interval is compressed by  and the new interval is 1 is obtained by Obviously, if the factor  is determined, it is easy to locate the points  1 and  2 in the interval.There are two rules for Cases (1) and (2), respectively, while Algorithm 3 shows the algorithm for golden searching.

Experiments and Results
We test the proposed methods on the Middlebury stereo datasets.The dataset contains many different scenes, that is, art, books, dolls, laundry, moebius, and reindeer, and each scene is consisted of 2 ground-truth images, related to view 1 and view 5 in each scene, and several different images which were caught from different views.The ground-truth images are used as the label images of each scene, and its labels were compressed from 0-255 to 0-22 for the computing efficiency, and two neighbor view images are adopted to extract the different features.Two groups of features are introduced in our experiments.The first group is local visual features, such as colors and textures, including the 3 dimensions of RGB color channels, the 3 dimensions YCbCr color channels, the 9 dimensions texture features, the outputs of Laws' masks scaled 3 × 3, and the 4 dimensions edge features, the outputs of the different Prewitt edge detectors.The second group is the graph edge features, which are the absolute difference between labels of neighboring pixels and one-dimensional bias constant.Practically, the method for conducting features may construct a large amount of dimensions, which can supply a rich set for choosing the suitable features to learn the parameters of the wanted model.By adopting the features and the Max-margin method, it may be easy for us to get the reasonable depth for different scenes, as shown in Figure 4.

Comparison on Inference Accuracy with Different Feature
Combination.Suppose that the ground truth is denoted as   and the output results as   .Defining   as the number of the matched pixels in   and   and   as the number of different pixels in   from   , the inference accuracy can be denoted as which stands for the ratio of the correct output.
In order to study the effects of different features, we have tested different combinations of image features.For the convenience of the expression, 1 denotes the state of the feature which was chosen, and 0 otherwise.Figure 5 shows the inference accuracy of different feature combinations for the 2nd scene book.Note, the order of the features arranged from left to right is RGB, YCbCr, laws' masks scaled 3 × 3, and the edge features.For example, 1000 denoted that only the RGB feature was chosen.
The combination of features does not always boost the accuracy of the results.In a word, some features have a negative effect on the results while others have a positive effect.In order to test it, a comparison between the set with a certain feature and another without it has been carried out.The results show that an offset effect does exist between

Comparison between Margin and Slack Methods.
To overcome the above-mentioned shortcomings of the Maxmargin method, this paper adopts the slack scaling method to improve the results.In order to solve the nondecomposability problem, we introduce an approximated algorithm as described in Section 4 to make the slack method feasible.Both methods are tested on the Middlebury database, see Figure 7.As in Figure 8, the comparison results of inference accuracy for scene art show that the slack method performs better than the margin method.

Comparison on the Convergent Properties.
To take a step further, the convergent property between margin and slack methods is compared.In the training procedure, the convergence of both margin and slack methods requires the use of the bundle method and one-slack trick.Take the margin method for example, the bundle method is used by rearranging the terms, then the constraints will be  ≥ Δ (,   ) +  (  ) −  () . ( This means that the constraints are up bounded by .Given the current parameter, the objective function can be optimized using the bundle method, where the most violation constraint is  * = arg min  ( () − Δ (  , )) . (29) While the bundle method has the ability to achieve the optimal solution, the one-slack trick makes the procedure convergent in a small number of iterations.The computing process of the margin and the slack methods is examined to observe the convergence speed of the iteration.The error between two continuous iterations in the objective function is denoted as itaeps.Figure 9 shows the convergent property, indicating that both methods could converge in several iterations, while the slack method produces better accuracy without too much loss in convergence.

Conclusion
This paper presented two methods for the depth restoration of different scenes using structural vector machine.The The comparison of convergent property Figure 9: The comparison of the convergent property between margin and slack.Both the two methods converge in a small number of iterations.With the increasing accuracy, the slack method has a pronounced advantage in convergence compared to the margin method.
proposed methods, including both margin and slack, have their own advantages and disadvantages, respectively.While the form of margin rescaling method can be decomposed into local parts easily, it is hard for the slack rescaling method to perform such operation.In contrast, the slack one outperforms the margin rescaling method in accuracy outstandingly.Besides the advantageous promotion in accuracy, there is no need for the slack rescaling method to abandon too many convergences while computing the parameters.The proposed approximation aiming at the slack rescaling approach manages to solve the decomposability problem successfully and make it computable in an efficient way.The pity is that the approximation method requires the formulation being concave which may be an over strong constraint.Our future works focus on these optimization algorithms, including improving the computing speed and enhancing the accuracy of the results.

Figure 1 :
Figure 1: The color features of the image: RGB color features (first row) and YCbCr color features (second row).From left to right, first row: the original image in RGB color space, R channel, G channel, and B channel; second row: the original image in YCbCr color space, Y channel, Cb channel, and Cr channel.

Figure 3 :
Figure 3: The results achieved by different edge detectors in 4 directions.

Figure 4 :Figure 5 :
Figure 4: Inference depth maps by Max-margin method for different scene.From row 1 to 3: images, ground truth, and the obtained depth map.From column 1 to 4 are four scenes: 1st art, 2nd book, 3rd laundry, and 4th reindeer.

Figure 6 :
Figure 6: (a) the effect of the edge features.1 to 7 means 1000, 0100, 0010, 1100, 1010, 0110, 1110.It shows that the edge features can boost the accuracy.(b) the effect of the RGB features.1 to 7 means 0100, 0010, 0001, 0110, 0101, 0011, 0111.It shows that the RGB features can reduce the accuracy and the color feature with the texture features that can boost the inference accuracy.(c) the effect of the YCbCr features.1 to 7 means 1000, 0010, 0001, 1010, 1001, 0011, 1011.It is easy to find that the YCbCr features have a similar effect on the accuracy with the RGB features.(d) the effect of the Laws' masks scaled 3 × 3. 1 to 7 means 1000, 0100, 0001, 1100, 1001, 0101, 1101.It is easy to find that the texture features can boost the accuracy in most of the situations.

Figure 7 :Figure 8 :
Figure 7: Depth inference results of different images by Max-margin and proposed slack method.From column 1 to 4: images, ground truth, the result of margin method, and the result of proposed slack method.And from row 1 to 3 are three scenes in Middlebury datasets: art, book, and laundry.