A Mean-Shift-Based Feature Descriptor for Wide Baseline Stereo Matching

We propose a novel Mean-Shift-based building approach in wide baseline. Initially, scale-invariance feature transform (SIFT) approach is used to extract relatively stable feature points. As to each matching SIFT feature point, it needs a reasonable neighborhood range so as to choose feature points set. Subsequently, in view of selecting repeatable and high robust feature points, Mean-Shift controls corresponding feature scale. At last, our approach is employed to depth image acquirement in wide baseline and Graph Cut algorithm optimizes disparity information. Compared with the existing methods such as SIFT, speeded up robust feature (SURF), and normalized cross-correlation (NCC), the presented approach has the advantages of higher robustness and accuracy rate. Experimental results on low resolution image and weak feature description in wide baseline confirm the validity of our approach.


Introduction
Stereo matching [1,2] is an analyzing and representing approach of which different objects in real scene should be in different relative depth relationships.The key technology is disparity map acquired from two images taken at different angles of the same scene.There are two realization approaches so far, local matching [3,4] and global optimization matching [5] algorithm.The former initially sets a fixed size window in each pixel and then searches the corresponding pixel in another image by pixels' gray values.The main advantage of this approach is simple while it has a few of drawbacks.It will give rise to big errors when the gray value of image window center is much closer to its surrounding pixels.Oversized image window is destined to increase matching search time and the error in low resolution.In contrast, the depth image will be distortion because there is no matching feature point in small size window.For the sake of avoiding the problem of window size, many researchers focus on global optimized approach which chooses the least error assignment scheme in whole image.Nevertheless, the above algorithms are only limited to narrow baseline.The essential reason is that the corresponding relation between reference image point and the searched image center point is not at the same horizontal line.So it increases point searching range and easily leads to several-for-one relationship which will destroy the uniqueness of point matching.On another facet, considering the image complexity in real scene, only using pixelbased algorithm will bring large error (which is attributed to 1D searching in narrow baseline and 2D searching in wide baseline [6][7][8]) when searching area expands to a certain degree.In summary, the two above approaches can generate depth image in narrow baseline and do nothing about it in wide baseline.
Generally speaking there are two ways of depth image acquisitions including active type and passive type.In recent years, active depth image sensors have been promoted to a certain degree.In [9], Shieh and Hsieh used structured light to project 3D scene and then a complex algorithm of triangle measuring with geometry relation was utilized to compute depth image.But there is no denying that its applications in engineering are hindered by expensive cost and stringent calibration condition.On the contrary, passive depth image acquisition is the basis of the construction of 3D stereo vision and some relatively perfect theories promote its developments.Typically, depth image is computed from photograph taken by binocular camera in narrow baseline.It does not need to consider point matching problem because the captured photograph is on the same horizontal line.Thus, the depth of points in weak texture image can be computed by using different disparity scales [10,11] and continuity hypothesis.But in wide baseline, it cannot work out disparity map just utilizing horizontal disparity because the matching point is not on the same horizontal line.Hence, the matching approach between different images feature points always uses high repeatability and robustness feature detection operator such as SIFT [12,13], SURF [14], and Harris [15].However, those operators usually only collect those feature points which owned high gray change rate.The acquirement of depth map needs all image points' depth estimation.As a result, above algorithms do not compute depth map on wide baseline because traditional feature extraction and matching algorithms cannot deal with those weak texture points in image.
In fact, many researchers describe feature by local information surrounding feature point instead of focusing on feature point itself.Binary affine invariant descriptor (BAND) which is defined by local affine invariant area is the best case [16] and it results in good matching results.In this paper, a feature descriptor based on Mean-Shift [17,18] is represented to overcome the abovementioned problem which is especially adapted to real situation.
The main contributions of this paper are summarized as follows.(1) A novel feature description approach based on Mean-Shift is proposed.It designs feature point description operator of weak texture region which not only has effect on weak texture feature point but also fits other strong texture feature points.(2) The Bhattacharyya coefficient is utilized in feature point matching processing and a fast matching search scheme is proposed.Bhattacharyya coefficient plays an important role for the similarity estimation between two multidimension vectors.Moreover, in order to decrease matching search time in wide baseline, we give searching constrain condition.Searching area of each point is constrained on corresponding epipolar line according to basic matrix computed by eight-point approach.(3) In noncalibration wide baseline situation, we embed our descriptor in stereo matching algorithm.Experimental verification shows that our algorithm owned high matching accuracy and low complexity outperforms many state-of-the-art stereo matching algorithms and is especially suited for engineering applications.In order to facilitate understanding, we collect important abbreviations mentioned in this paper and show them in Nomenclature.
The rest of this paper is organized as follows.In Section 2, we present a novel stereo matching algorithm fitting in wide baseline and give the detailed deduction processing.In Section 3 we describe how to compute depth image based on Graph Cut and the method of our descriptor embedded.In Section 4, some experimental results are achieved to test the effectiveness and efficiency of the proposed algorithm.Section 5 concludes the whole paper and gives the further direction in the future.

Mean-Shift-Based Feature Point Description Operator
2.1.Texture-Based Feature Point Description.In order to get accurate matching point pairs, it is better to collect texture pattern near the point and then describe corresponding features.However, the images captured at wide baseline situation are harshly affected by light changing, visual angle and scale changing, and so on.Therefore, it does not result in texture robustness between different image pairs.As shown in Figure 1, it is not the same texture (circle area) around the same name image point under different angles.Obviously, it cannot directly describe feature points.In contrast, the classic feature point detection algorithm can get the stable matching feature points at wide baseline situation.Therefore, feature points obtained by this approach can be used for feature description.So we use classical SIFT approach to get the corresponding robustness feature points.On the other hand, in view of acquirement of the image depth information under wide baseline, each of image points should be matched.SIFT will get large number of feature points and if all feature points are involved to describe corresponding features, it surely brings a great many pieces of information that will decrease compute efficiency [19,20].So a compromise approach is to select partial feature points to build each feature description vector, in sure about each image point can correspond to feature points set; we use those feature points in its circle area as members of feature points set and compute the scale of this circle area with Mean-Shift approach.At last, the statistical information of feature points set will be used to describe each point's feature.Generally, a useful building process of feature descriptor is shown in Figure 2.
First of all, image pairs can be captured from different wide baseline and different angle of view.Stable SIFT feature points are set and will be, respectively, extracted by SIFT method from the two images.In view of improving efficiency, we present the method of deleting redundant SIFT feature points and improving feature descriptor coverage.Then, it randomly generated circular feature template which includes many SIFT feature points.The Bhattacharyya coefficient between two feature points will be computed in order to check their similarity and the size of circular area must be changed when the similarity exceeds threshold.The method of computing scale is Mean-Shift.At the end of feature extraction, the corresponding feature descriptor will be obtained by regular arranging of the feature points set within feature pattern.At last, we apply the above methods to two captured images and use Graph Cut algorithm for global optimization and then get disparity map finally.

The Building Approach for Mean-Shift-Based Feature
Point Descriptor.Through careful analysis on Figure 1, it is indicated that the corresponding feature points set should be found around each point and then built its feature descriptor.Thus, we use Mean-Shift algorithm to compute the corresponding feature scale and determine the feature points set.

Feature Scale Based on Mean-Shift.
Mean-Shift algorithm can be used to detect and track moving objects.It determines moving objects' locations by sample density distribution maximum value [21].If a set {x  } =1,..., belongs to   space, the corresponding density estimation function f(x) will be computed by density kernel (X) and window radius ℎ: Due to practical problem, density kernel (X) has many forms.The mostly useful expression is According to the above two equations, density estimation function will be If (x) is equal to −  (x), kernel (x) can be defined as where  is a normal constant number.Now, we can deduce the gradient of density function we can get another form as follows: This expression indicates that the gradient of density estimation function can be denoted as kernel function.Furthermore, the corresponding coordinate of the maximum point of density estimation function will be obtained through iterative computing.
In real application, it always needs to translate corresponding problem into mathematical model as mentioned before.Bhattacharyya coefficient [22] is usually used to estimate corresponding position coordinates.Suppose an object feature is  and its density function is   .Meanwhile, the center position of candidate object is y and the corresponding density estimation function is   (y).For the sake of finding the center position of candidate object which is most close to original object, we use a kind of Bhattacharyya coefficient presented as follows to estimate the distance between   and   (y): where If p(y) and q are all  dimensional vectors, [ p(y), q] will be regarded as the cosine function between them.So (7) can denote the vector similarity.
Here, we need to estimate a reasonable area for every point in image so that it can involve enough SIFT feature points to describe the point's features.According to Mean-Shift approach, we need to give an object's feature initially.
A key point should be paid attention on our concern: the involved problem is not tracking problem.Thus, it results in being unable to find corresponding pattern in image.On the other hand, we believe that it can get the only feature descriptor if SIFT feature around the point has uniform distribution (the detailed descriptor structure approach will be given on the next section).Figure 3 gives several feature area pattern types.
We use Mean-Shift approach to estimate the area and covered angle.The corresponding coordinate of covered area, unlike tracking algorithms, is denoted as polar coordinate form x = (, ), where  is radius and  is angle variable.Supposing {x  } =1,..., is denoted as the polar coordinate position of feature points set in pattern, we define function  :  2 → {1, . . ., } as corresponding number index function, where  is the area number of circle area and (x  ) is the sequence number of x  .Then, the density estimation function [23,24] can be written as where (⋅) is impulse response function and q should satisfy The normalization constant  can be denoted as follows: As shown in Figure 4, in view of estimating area radius, there is an assumption y = (, 0), where  is estimation radius.If {x  } =1,..., ℎ is the polar coordinate of real feature points set and ℎ is the width of every little area, then the density estimation function of corresponding candidate radius will be where ∑  =1 p = 1.The normal constant  ℎ can be denoted as The following iterative procedures can be used to compute Bhattacharyya coefficient.
Otherwise, it sets ŷ0 ← ŷ1 and returns to Step 1.
According to the above algorithm, we can get the estimation of radius  and final radius   should be summation of two parts; that is, where  is the layer of the area and ℎ is the width of every layer.Figure 5 shows the results of Mean-Shift algorithm and different radius depended on different patterns.

Acquirement of Feature Point
Descriptor.With regard to real position, there are some points whose texture features are not available in their neighbor areas.We hope to use Mean-Shift approach to get the SIFT feature points set around this kind of points.At the same time, SIFT approach can collect those repeatable and robust feature points [25].So those points' set can describe the feature of the point.Here, we build the corresponding feature descriptor of the point by feature points' space and pixel information.For the sake of feature matching, it has two conditions for building feature points descriptor.
(2) The feature descriptor of each point should be strong differentiation [27].
After given feature distribution pattern, each area can be cut into several fan areas.And then, according to texture feature distribution position of every fan area, we can define the corresponding descriptor as follows.
Step 1.According to feature points distribution pattern, we use Mean-Shift algorithm to search corresponding area and then find the number of corresponding SIFT texture feature points according to each fan area.
Step 2. According to the number difference of feature point distribution in each fan area, we define a line from detecting point to center of feature point (this point is the average position of feature points in the fan area) as shown in Figure 6.
Step 3.According to gray value of the end of the line and gradient information, the corresponding feature description vector is denoted as  is the difference between real feature value and linear interpolated value and  is the corresponding fan area.Supposing  is the number of fan area, the length of corresponding feature description vector is  ×  dimension.
Step 4. Taking max   (1) corresponding fan area as initial area, the corresponding feature vector can be counterclockwise ranked.
After the above steps, each point can obtain  ×  dimension feature vector.Significantly, we do not choose single feature point as feature description fundamental element.Within self-adaptive Mean-Shift feature pattern area, our descriptor is represented by texture distribution structure around the feature point.As a result, the structure is not changed in different scale.That is, the operator is scale invariance.Besides, in view of image rotation, it cannot apply in image matching if the feature vectors are out-of-order.In other words, feature vectors of the same point will vary with rotation.Ingeniously, we define the initial position and its rank order in Step 4. So it is obviously known that those vectors produced by this algorithm are rotation variance.

Wide Baseline Depth Image Based on Graph Cut Algorithm
In traditional narrow baseline disparity map, it is always supposed that those captured images only have horizontal deviation.Figure 7(a) shows that the disparity value is horizontal deviation  of the same name point in image pairs.Dot point denotes the position of the same point in reference image and square point shows the real position of disparity map.Because of the above supposition, the point's disparity value can be easily obtained by searching the position of the same name point along 1D horizontal level.This approach is easily achieved and has high computing efficiency.However, real captured images are almost not satisfied to horizontal deviation supposition.Thus, the stereo matching algorithm needs to rectify captured image and the rectification processing will surely bring a certain error.This will largely restrict its application.On the other hand, the algorithm complexity must be increased when directly dealing with disparity map in wide baseline situation.The key problem of the same name point searching way will not be limited on a horizontal line.It needs to search a whole image space.
Taking the above two problems into account, we give a disparity map building algorithm to overcome them.Figure 7(b) shows the disparity sketch map in wide baseline.Here, the definition of dot point and square point is the same as shown in Figure 7(a).Their distance is the disparity value.Then, the corresponding same point needs to tackle the two critical problems described as follows: (1) How to search the same name point in disparity map.
(2) How to adjust noninteger disparity value to integer form.
As to the first problem, we use eight-point approach and feature descriptor by weak feature description (WFD) to search the position of the same name point in disparity map.The dotted line in Figure 7(b) is the epipolar line constraint based on eight-point approach to find the corresponding same name point.In view of accuracy, we use WFD descriptor to match it.
As to the second problem, we should know the problem which results from traditional disparity map processing.The disparity value is denoted by discrete integer.But, we will get noninteger form of disparity value by our above definitions.Thus, our approach needs to adjust them to integer form so as to subsequent optimization process.The distance from the reference point to the extreme line of each point is segmented by the given value of the disparity and then the obtained distance values can be output as the integer form.
Chen and Jhang [28] used improved Graph Cut algorithm to transform stereoscopic image sequence to depth map sequence.Bleyer and Gelautz [29] used their Graph Cutbased algorithm to overcome three classical problems in stereo methods.We simply review Graph Cut algorithm in order to get better understanding of disparity map generating process.In stereo matching, the disparity is usually considered as a label of pixels.Suppose there are two images to be matched.The set of all pixels in the first image is represented by , and the matching target is assigned a label () ∈ { 1 ,  2 , . . .,   } to each pixel  in , where the significance of the label is denoted as a possible disparity of pixel  and  represents the set of all the possible labels.Under the above assumption of narrow baseline, the disparity will only be changed depending one component like horizontal component.At this time, the elements in the set  are onedimensional vector.However, as for the case of wide baseline, label set  is a two-dimensional vector set.Thus, matching problem will be changed so as to find an appropriate label () and make pixel  in first image matching pixel  + () in second image.
In order to get the proper labels of pixels, an energy function of label is constructed by two matching rules.First, the pixels' data of matched pixels cannot have too much differences.Second, adjacent pixels have similar disparity values.The former is called the data constraint and the latter is called smooth constraint.The energy function is shown as follows: where  is denoted as a set of adjacent pixel pairs.As to minimize (), an undirected network  is constructed, which makes the label function set and the Graph Cut of the network  to form a bidirectional mapping, and the energy () is the corresponding cutting capacity of each label .Thus, the minimization cutting solving problem of the undirected network  is the minimization problem of the energy function () and the minimum cut problem is attributed to the computation of the maximum flow.We use the Potts model [30] which is described as follows: where  x and  y are the corresponding index of image point (, ), respectively.Data supporting item  x ( x ) denotes support degree of real data when the index of point  is  x .Generally, the support degree is computed by points' similarity of two images.Experimental conclusion shows that the corresponding support degree will be smaller when the features of two points are more similar and the energy function will be less subsequently.Here, we define support degree of index   as follows: where  x is the corresponding WFD feature vector and the setting of region of neighborhood can make the accuracy reach subpixel level. fwd is the error minimum between all points coming from points set {y i } whose distance holds f ± 1/2 with position x and other feature vectors.
Similarly, it needs to rectify position  in reference map.That is, Comprehensively considering (18) and (19), we can obtain final form of data support degree as follows: (x, y, ) = (min { fwd (x, y, ) ,  rev (x, y, ) , const}) Because of 2D searching area, the accuracy 0.5 is denoted as four neighborhoods of this point.In addition, we need to  assign noninteger index  to integer index sets {  }.So (20) will simplify as follows: (x, y,   ) = min ∈   (x, y, ) .
In summary, we give the corresponding definitions of Graph Cut parameters.Subsequently, some stereo matching effects will be represented in detail in Experimental Results.

Experimental Results
In this section, we show the efficiency detection results of our feature point descriptor.We compare it with traditional scale-invariance feature transform (SIFT), speeded up robust feature (SURF), and normalized cross-correlation (NCC) and estimate the change ranges of corresponding baseline in available accuracy.

Comparisons of Different Descriptors.
Figure 8 shows the efficiency comparisons of other descriptors including SIFT, SURF, and NCC.The upper-left image is the reference image and the corresponding depth image can be computed by using image pairs which combined the reference and each image at the right side.In spite of using the same Graph Cut algorithm in wide baseline, their differences only rest with different feature descriptors.On the other hand, matching range of each point is the corresponding epipolar of another image in order to improve algorithm efficiency and the corresponding baseline increases in turn from left to right.Meanwhile, we provide the corresponding depth image in narrow baseline as standard image.The second row image is the depth image in narrow baseline and the third to the sixth row images are the depth of weak feature description (WFD), SIFT, SURF, and NCC.The dot point is denoted as the difference between each depth image and standard image.
Figure 9 shows the curve of depth error which includes the error rate between the standard depth image and the depth image obtained by different descriptors at baseline changing situation.Here, we, respectively, give curve changing maps in five and ten percent of depth value deviation.For the sake of detecting algorithm efficiency of WFD, we chose many weak texture images as testing map.It obviously surpasses other descriptors from comparisons results.In addition, when the error set of WFD is 5 percent, the accuracy rate is higher and their results are very close when the error set of WFD and SIFT is 10 percent.
On the other hand, we use different colors to present the right and wrong matching points.Figure 10 shows an analysis example selected from a set of image pairs.There, the depth matching error threshold is 5 percent.
In view of comprehensive understanding of our algorithm, we give an example of the complete process using a pair of images which is shown in Figure 11.The left column in Figure 11 shows the ground truth image processing and the right column displays the processing of image captured in another baseline.The first line gives original images of ground truth and another situation.The second line represents the feature points obtained after SIFT processing and the third line shows the feature descriptor using our proposed algorithm.The fourth line is to find the depth of the two images through matching with our feature descriptor.The left side of the fifth line is the depth image computed by narrow baseline algorithm, while the right is a combination method between Mean-Shift and Graph Cut global optimization algorithm.

Depth Image in Different
Baselines.Figure 12 shows depth images on different image pairs using WFD operator and the depth image on diagonal line is standard image in narrow baseline.Table 1 gives the accuracy rate changing of depth image in different baselines.On the other hand, we use six visual angles to capture images and each visual angle changes 10 degree in horizontal shown in Figure 12.  Figure 13 describes the average accuracy rate changing curve of WFD in different baselines.Here, the depth image deviation threshold is 10 percent.That is, the right depth matching point is less than 10 percent deviation between depth image and referent image.It is not difficult to see that the accuracy is higher when the capture angle changes between plus and minus 20 percent.

Robustness of WFD in Different Image Changing.
From Figure 8, we can find that our descriptor has strong robustness on image scale, resolution, and luminance according to the error deviation of depth image in different baseline.It also spelled out strong robustness from the error rate curve changing in Figure 9. Otherwise, in order to test real effect in real scene, we get some depth images of several image pairs (see Figure 14) by our algorithm.Because of matching difficulty of weak texture area in a large range, it generally uses high resolution image (like 3027 * 2048) to do stereo matching.Here, we use 684 * 512 resolution image and also get well corresponding stereo matching image.

Conclusions and Future Work
This paper presents an algorithm using feature point extraction and description to solve the depth image acquirement problem in wide baseline.Initially, we introduce a Mean-Shift-based feature descriptor which can be used to build corresponding feature description vector according to its texture feature point distribution situation.The approach makes it describe the point in weak area and realize right location in wide baseline.On the other hand, we expand disparity map of narrow baseline to wide baseline situation.Utilizing Graph Cut algorithm to optimize disparity information, the depth for arbitrary image pairs can be extracted without calibration.Experimental data shows that WFD descriptor can get good depth image in low resolution image which means adapting the algorithm to video stream.Otherwise, we apply Mean-Shift algorithm to WFD and get texture feature of each point.So the corresponding feature descriptor ensures strong robustness for image changing and it allows defining arbitrary pattern according to situations and results in strong flexibility.
This research can be extended in two directions.Firstly, because of various conditions restriction, we do not have access to high resolution images and it does not result in perfect performance of our approach on a large range of weak texture area.We hope that the subsequent work can be carried out on the basis of the high resolution image to expand our stereo matching strategy.Secondly, it needs to dynamically adjust the window scale of Mean-Shift to obtain a stable feature descriptor and this is directly responsible for a huge amount of algorithm run time.In the latter stage, it will be more suitable for the practical engineering application if it can decrease the frequency of dynamic adjustment and appropriately fix the window scale in low dynamic changing.

Figure 1 :
Figure 1: Texture information around the same name image point under different angles.

Figure 2 :
Figure 2: Flowchart of the proposed approach and its application.

Figure 4 :
Figure 4: Mean-Shift tracking parts of object.

whereFigure 5 :
Figure 5: As the feature area pattern gives on (a), Mean-Shift will provide corresponding feature points on (b).

Figure 6 :
Figure 6: (a) Dotted lines with arrow denote the paths which are coming from the feature points to the center of the fan area.And the center is decided by those SIFT feature points' positions in the fan area.(b) The differences (image gradient feature is used to present each point's characteristics) between real feature and linear interpolation of each point.

Figure 7 :
Figure 7: (a) Disparity map in narrow baseline; (b) disparity map in wide baseline.

Figure 8 :
Figure 8: Depth image in different baselines.

Figure 9 :
Figure 9: Accuracy rate of feature descriptor in wide baseline changing.

Figure 10 :
Figure 10: Depth image based on WFD feature.

Figure 11 :
Figure 11: An example of the complete process.

Figure 12 : 12 MathematicalFigure 13 :
Figure 12: The depth image based on WFD feature descriptor in wide baseline.

Nomenclature
SIFT: Scale-invariance feature transform SURF: Speeded up robust feature NCC: Normalized cross-correlation WFD: Weak feature description Harris: Harris corner detector BAND: Binary affine invariant descriptor.

Figure 14 :
Figure 14: The stereo matching effect map using WFD for wide baseline and low resolution image.

Table 1 :
Depth image accuracy estimation percentage.