The Dynamic Model Embed in Augmented Graph Cuts for Robust Hand Tracking and Segmentation in Videos

Segmenting human hand is important in computer vision applications, for example, sign language interpretation, human computer interaction, and gesture recognition. However, some serious bottlenecks still exist in hand localization systems such as fast hand motion capture, hand over face, and hand occlusions on which we focus in this paper.We present a novel method for hand tracking and segmentation based on augmented graph cuts and dynamic model. First, an effective dynamic model for state estimation is generated, which correctly predicts the location of hands probably having fast motion or shape deformations. Second, new energy terms are brought into the energy function to develop augmented graph cuts based on some cues, namely, spatial information, hand motion, and chamfer distance. The proposed method successfully achieves hand segmentation even though the hand passes over other skin-colored objects. Some challenging videos are provided in the case of hand over face, hand occlusions, dynamic background, and fast motion. Experimental results demonstrate that the proposedmethod is muchmore accurate than other graph cuts-based methods for hand tracking and segmentation.


Introduction
There are four main kinds of object tracking methods which are points, skeleton, contour, and silhouette tracking in recent papers [1,2].As an important branch of tracking, hand tracking is a critical step in computer vision systems, such as human computer interaction (HCI) [3], sign language interpretation [4], and gesture recognition [5].Besides, visionbased hand gesture recognition [3] is a meaningful direction to enable computers to understand the meaning in robot systems where the first key step is to achieve robust hand tracking.Hence, we concentrate on silhouette tracking which means that hand silhouette or region should be split from cluttered backgrounds.
In the last decade [6], human hand motion capture has gained widespread interest in pattern recognition area.For example, Yang et al. [5] presented a method to obtain hand trajectories based on pixel matches with affine transformations.Then an optical flow-based method [7] is proposed for hand tracking.Although the method [7] can capture quick motion and fast hand shape deformations, it still fails to hand tracking when hands and skin-colored objects are occluded.Some other works [4,8] try to use linear quadratic estimation model (e.g., Kalman filter) or sequential Monte Carol model (e.g., particle filter) to hand track trajectory.Later on, a realtime hand tracking method is applied in a mechanical device by the authors [9] who utilized the advantages of particle filter and mean shift (MS).They incorporate MS optimization into particle filter to improve the sampling efficiency considerably.Though these approaches have delivered promising results, they are difficult to handle occlusions.
In recent years, graph cuts-based methods have been applied in tracking or segmentation systems.Xu and Ahuja [10] firstly proposed a method to track object contour by graph cuts.They dilate object contour into a narrow band and construct a graph only on this band.Nevertheless, it cannot deal with large displacements because there is no dynamic model to estimate object location.Freedman and Turek [11] presented a method based on graph cuts to track objects when the illumination drastically changes.Yet, they do not 2 Mathematical Problems in Engineering achieve object segmentation from their experimental results.Later, Malcolm et al. [12] incorporated a distance penalty into graph cuts to realize object segmentation and used a simple filter to estimate the location of interested objects.Although this method can achieve multiobject tracking, it still cannot deal with occlusions.Bugeau and Pérez [13] proposed a method based on optical flow and graph cuts to simultaneously track and segment objects.However, this method needs a reference background image that would restrict its application and popularization.In the work of [14], the authors managed to track objects in live videos via reseeding strategy.And Papadakis and Bugeau [1] presented that the interested object is comprised by visible and occluded parts which are tracked, respectively.Regardless of the fact that those methods have achieved success in some areas, they still have some drawbacks in some situations, such as hand over face and hand occlusions.
Hand tracking is a challenging problem because the hand presents 27 degrees of freedom (DOFs), including 21 DOFs for the joint angles and 6 DOFs for orientation and location [6].Therefore, hand shape and motion are more arbitrary than rigid objects.In this paper, we present an effective approach to track and segment hands even though hands have arbitrary shape deformations.Similar to the methods of [12,13], a dynamic model and graph cuts are used.However, compared with these methods, the key contributions of our method are summarized as follows.
(i) To avoid the degeneracy problem of interest points [12,13], we combine the resampling strategy and optical flow algorithm to robustly track interest points from hand regions.(ii) An augment graph cuts method is introduced to track and segment hand regions and different hands labelled with different colors.(iii) The proposed method can track and segment hands on some challenging environments, such as hands overlap, hand fast motion, and hand over face.Also the proposed method can track and segment hands in dynamic backgrounds where some skin-colored objects may be present.
The framework of our method is shown in Figure 1, which consists of optical flow estimation and augmented graph cuts introduced in Section 3.
This paper is a substantial extension of our conference paper [15].Compared with [15], further details of our method are presented, and more extensive performance evaluation is conducted.We also give a more comprehensive literature review to introduce the background of our method and make the paper more self-contained.Therefore, this paper provides a more comprehensive and systematic report of our work.The rest of the paper is organized as follows.We describe basic notions of multiobject tracking based on graph cuts in Section 2. The proposed method is described in Section 3. Section 4 shows the experimental results and the performance evaluation.The conclusion is given in Section 5.

Notion of Traditional Graph Cuts
Here, we describe the basic principle of graph-cuts based methods for object tracking and segmentation.We review image segmentation via graph cuts at first.Then, object tracking is described via graph cuts and dynamic model.

Segmentation via Graph Cuts.
We briefly outline multilabel graph cuts technique.The detailed information can be found in [16,17].The simple segmentation of the background and objects can be obtained by minimizing the following energy with respect to the labelling function in (1): where data term   evaluates the likelihood   () of a pixel  belong to the th object and   is defined as where (, ) is delta function (equal to 1 if  =  and 0 otherwise);  represents an image;  is the number of tracked objects;   () is calculated by a normalized histogram of the th object.
The smooth term   () evaluates the penalty for assigning two neighboring pixels to different labels.  () is defined as where In [12], it assumes that the mean velocity is known for each object, the authors translate the current object o   at time  to have a prediction o +1|  at time  + 1.A new term called distance term   is introduced, which discourages pixels to be associated with the th object, when the pixels do not belong to the predicted set o +1|  .  is defined as where .  () can be quickly calculated with fast matching algorithm [18].Therefore, the energy function is reformulated as Although the methods [12,13] have achieved to track and segment objects which are partly occluded in some occasions, they cannot access to track overlapped objects when these objects are similar colors (e.g. the hands and face are overlapped).We give an example to illustrate the limitations of these methods in Figure 2. In the initialization step at time  = 0, the result of the left/right hand is labelled blue/green as shown in Figure 2(a).At time  = 1, in Figures 2(b) and 2(c), we can see that the colors are confused between the left and right hand by [12,13].Besides, some pixels in the red circle are wrongly labelled by [13] as shown in Figure 2(b) while pixels in the background are correctly segmented by the method [12] (see Figure 2(c)).That is, because   is added in [12] to constrain the estimation to be in the spatial neighbourhood of the prediction.However, the method [12] still does not distinguish pixels of each hand (see Figure 2(c)).

The Proposed Method
Suppose that  hands are tracked and each hand is totally visible at time  = 0.This means that o 0 1 ∩ o 0 2 = , 1 ̸ = 2.The initialized segmentations o 0  labelled different colors are provided by manual operation at time  = 0.At time  > 0, our approach can sequentially process the frames for simultaneously hand tracking and segmentation.

State Prediction. When the segmentation result o 𝑡
is correct at time , the prediction set o +1|  at time  + 1 is estimated by the mean velocity V   using (9) as To compute unknown mean velocity V   , some methods (such as autoregression model [19] and interest points detector [20,21]) have been proposed in the past decades.Compared with these methods, optical flow delivers excellent results on fast moving objects with a high computational efficiency [7].Therefore, we choose optical flow (the same as the method [13]) based on pyramid Lucas-Kanade multiresolution scheme [22] as our dynamic model.However, there are two problems shown in the methods [13].The first problem is that some interesting points may be wrongly detected by optical flow as shown in the first row of Figure 3, and the second is the degeneracy problem which perhaps happens in the second row of Figure 3.In our dynamic model, two strategies are introduced for avoiding these two problems.
To compute the unknown velocities, a set of interest points is considered.At time  = 0, the interest points { 0  } =1,..., 0  ∈ o 0  are found by good-feature-to-track [7,21] which suggests seeking a steep brightness gradient along at least two directions for promising feature candidates.Then, at time  > 0, {   } =1,...,   can be detected [22].So the velocity is computed between two successive frames as And the mean velocity V   at time  is calculated as Figure 2: (a) Initialization at time  = 0, (b) results by [13] with   = 5 at time  = 1, and (c) results by [12] with   = 10,   = 1 at time  = 1.From (10), we know that every detected point has contributions to the mean velocity.When some points are beyond the scope of hand region (see the first row of Figure 3), the mean velocity may have a bias to true velocity.So a distance penalty in ( 12) is created to eliminate outlines.Here, we only consider points when their displacements are less than a given threshold  1 as In order to capture fast hand motion, we can set a large value to  1 .In our experiments,  1 = 80 is well suitable for all test videos.
As time goes on, the number of interest points may goes down via optical flow (see the second row in Figure 3).For the sake of avoiding the degeneracy problem, the second strategy is to resample interest points.When the number of interest points is below a given threshold  2 , we can redetect new interest points using good-features-to-track [21].After these two strategies, the detected interest points are shown in the third row of Figure 3.

Error of Prediction.
In this work, we accept the idea of the work [12] to handle the error prediction problem.The prediction error is the distance between the predicted centroid c  and the actual centroid    at time .The scaling function is defined as Here,  is a threshold based on empirical motion, which controls the change rate of penalty .If  is large,  will slowly change.As mentioned in [12], in practice,  = 3.5 is quite robust to our model.When the actual sets o   are off o |−1  ,  is lowered to hopefully still capture motion. can automatically rise when prediction errors decrease by (13).

Augmented Graph
Cuts.Now we explain how to define new terms and incorporate them into energy function.Those new terms are the core principle in augmented graph cuts.

Spatial Constraint.
Owing to the similar color of human skin, it is difficult to eliminate the effect of each hand by the works [12,13] as shown in Figure 2. Here, we introduce a new energy term called spatial term   : where  |−1  denotes the centroid of the predict set o   .  > 0 is the parameter value.The penalization is made through the function (⋅): where ‖ −  |−1  ‖ is the Euclidean distance from the location of a pixel to the centroid , the value   becomes a small value which indicates that the pixel  is encouraged to assign the th object.
As illustrated in Figure 4(a), when hands are visible (o 1 ∩ o 2 = ), then (,  1 ) < (,  2 ) which means that the pixel  is inclined to assign o 1 .Therefore, when hands are totally visible in the same scene, spatial term can distinguish each hand.Nevertheless, when hands overlap together (o 1 ∩ o 2 = ), it will be ambiguous to assign the pixel  to o 1 or o 2 in Figure 4(b).This means that spatial term   is suitable for o

Motion Constraint.
In (8), the energy function does not consider the situation in which hands pass over other skincolored objects, such as face.Therefore, a new energy term called motion term   is given to handle this situation: where   > 0 is a weight parameter.The function  is defined as where  is the motion parameter mentioned in (13).
Using the motion information allows to reject some bad segmentations in the case of hands over skin-colored objects.When a pixel  is from o |−1  1 with the velocity V   , it assigns   ( 1 , ) = 0 to o |−1  1 and the value   (  , ) > 0 to the other sets o   ,  ̸ = 1 according to (16).  ( 1 , ) <   (  , ) means that the pixel  is intended to assign the  1 th object.The motion term can keep good segmentation when hands and other skin-colored objects overlap (e.g., hands over face).

Chamfer Distance.
The above defined terms are based on motion information and the prediction set o |−1  .However, spatial and motion terms still cannot deal with hand occlusions (see Figure 4(b)).Therefore, a new term called chamfer term  Ch is introduced to deal with hand occlusions. Ch is defined as where (⋅) is the function of chamfer distance transform and  Ch > 0 is the weight parameter.Before computing the chamfer distance, we should get the binary image from the frame   at time  (e.g., using canny edge detection [23]).
Then the value of chamfer distance can be fast calculated in two passes over the frame [24] as shown in Figure 5.  Ch encourages to keep discontinuous in the image boundary.In particular, when hands overlap, we can set a large value to  Ch for rejecting bad segmentation in the areas of occlusion o 1 ∩ o 2 .

Final Energy Function.
We merge all of the mentioned terms.Therefore, the hand tracking problem consists of six terms to minimizing the following energy function: Compared with the energy function equations ( 4) and (8), our model can handle hand occlusions, hands over face, and fast hand capture.After building the graph by (19), we can apply the -expansion algorithm [16] to minimize the energy function.

Overview of the Proposed Method.
We have described the principle of our method to track and segment hands in different circumstances.We use four steps to achieve hands tracking and segmentation.At first, initialization segmentations for all tracked hands are provided by manual operation  at time  = 0. Then at time  > 0, the prediction o |−1  can be estimated by the dynamic model.Later on, we construct the graph by the augmented graph cuts and use -expansion to obtain final segmentation results.Finally, we judge whether the number of interest points is larger than a given threshold.If the number of interest points is below a given threshold, we can resample the interest points.An overview of our algorithm is given in Algorithm 1.

Experimental Results
To validate and evaluate the proposed approach, we afford four videos (three videos were captured by our webcam and one video is an American sign language (ASL) video provided by Purdue ASL database [25]).All the videos have the same frame rate with 30 fps.In this paper, we only provided four challenge videos, but more results (e.g., four hand tracking and segmentation) can be found in the website: http://joewan.weebly.com/my-research.html.

Results
. The proposed method is implemented in Microsoft Visual Studio 2008.All the videos we have offered are tested on a Core 2 Duo P8600 Processor with 2 GB RAM.The initialization segmentations (at time  = 0), the tracking results, and the different parameters are given in our experiments.Every tracked hand is labelled with different color.Although there are some methods which are similar to us, we only compare the proposed approach with the methods [12].That is because the methods [13] require a reference background image for background subtraction to obtain external observations.It is not suitable to hand track in dynamic background.Papadakis and Bugeau [1] proposed a framework for object tracking.But the method [1] has a strong assumption that the occluded part of an object is a subset of the prediction of the whole object, which is not appropriate for self-occlusions that commonly happen on hands motion, especially fingers movement.To compare with the methods [12], the parameters   ,   , and  Ch are set to zero, as they can recover the original energy function equation (8).
In Figure 6, let me firstly analyze the results which are shown in the first row by the method [12].We can see that the two hands are labelled green color at  = 20, which means that the right hand is wrongly segmented.Additionally, when two hands are partially overlapped at  = 66, the left hand fails to track.Nevertheless, the hands are well recovered after hand occlusions by our method as shown in the second row of Figure 6 which shows that our approach is able to solve two principal problems: dealing with hand occlusions and rejecting oversegmentation.

Hand over
Face.Now we give an example to demonstrate that our method can achieve hand segmentation even though hands pass over skin-colored objects, such as face.The video called video 2 is recorded outdoors including 106 frames.The frame size is 640 * 480 pixels.Our parameters are as follows:   = 5,   = 1.8,   = 1.5,   = 1.8, and  Ch = 2.The parameters of the method [12] are as follows:   = 5,   = 1.8,   = 0,   = 0, and  Ch = 0.
As shown in Figures 7 and 8, when the hands move from the left to the right, hand over face occlusion occurs at time  > 6. Figure 7 shows the results by the method [12], which reveals the failure to accurately track and segment hand when hands pass over face.In Figure 8, the hand segmentation is quite well achieved along the sequence by our method.Owing to the motion constraint in (19), when the hands pass over the face, our method still can reject the bad segmentations which may occur in the face region.

Fast Hand
Tracking in Sign Language.The video called video 3 is from Purdue ASL database [25].It involves fast hand motion (entire frames), hand over face ( = 216), partly hand occlusions ( = 182, 199), and hand shape deformation (the entire video).This video includes 265 frames with the frame size of 640 * 480 pixels.Our parameters are   = 5,   = 1.8,   = 1.1,   = 0.8, and  Ch = 2.As shown in Figure 9, our method is robust to segment and track hands.From the results in Figures 6-8, the method [12] cannot deal with hand occlusions and hand over face.So we only give the results by our method.4.1.4.Dynamic Background.In order to further evaluate the effectiveness of the proposed method under complex situations, we test our method in dynamic background.The video called video 4 was captured in lab environment including 174 frames with frame size 320 * 240.The moving pedestrian as the dynamic background walk and happen occlusions when the tracked hand is in motion.The parameters of our method are as follows:   = 5,   = 1.8,   = 0.1,   = 1.0, and  Ch = 3.9.Final results are given in Figure 10 which displays that good performance can be achieved in dynamic background by our method.

Discussion and Adjusting
Parameters.The energy minimizing function in (19) is composed of six different terms.It has eight parameters to be tuned (three dynamic model parameters, five graph cuts parameters).However, most of them can be fixed in our experiments.For dynamic model parameters, the three parameters are given constant value because those parameters are not sensitive to our model.We set  1 = 80,  2 = 140, and  = 3.5.where   denotes the number of false detected pixels,  represents the frame size, and  is the number of frames in a video.Note that the false detection happens in two situations: (1) pixels in the background are detected as the hand region; (2) pixels in one hand is treated as other hands or background.Figure 12(a) shows MPEs for four videos.As shown in Figure 12(a), we can get two conclusions as follows.
(i) When hands and other skin-colored objects are in the same scene (hand over face, hand occlusions), MPEs by our method are much lower than the method [12].In particular, the MPEs of videos 3 and 4 by the method [12] are very high (>10%) due to wrong labeled face region, when hands and the face are overlapped.
(ii) The MPE (0.1714%) of video 4 by our method is close to ground truth (0%), which proves that the proposed method is well suitable for hands tracking and segmentation in sign language video.
Next, we give the running times of both the proposed method and the method [12] as shown in Figure 12(b) where the average execution time (AET) for every frame in all test videos is given.We can see that the AETs by our method are approximate to the method [12], although the truth is AETs by the proposed approach is slight higher that the method [12] about 20 to 30 milliseconds pre frame.That is because additional terms (  ,   ,  Ch ) are incorporated into energy function, which leads to a slight high complexity.Meanwhile, the proposed method can successfully track and segment hands when the face and hands are partly occluded.AET depends on the frame size and the number of tracked number (see Figure 12(b)).In our future research, we will consider a narrow band around the prediction sets [10] to decrease the computational cost.The study of this band will be the subject of future works for real-time purpose.

Conclusion
In this paper, we present a method based on augmented graph cuts and the dynamic model for hand tracking and segmentation in different environments.The proposed algorithm can resolve three problems: fast hand motion capture, hand occlusions, and hand over face.In our method, we reformulate the energy function by adding some new energy terms which are more robust to hand tracking and segmentation.Additionally, the new terms can deal with occlusions and obtain accurate segmentation.
Meanwhile, there are a lot of perspectives that can be improved.At first, we can develop a method to automatically extract hand region instead of manually segmented hands in initialization step.For instance, we can apply AdaBoost algorithm [26] to detect the region of interest (ROI) of hands and use grab cut [28] in ROI to achieve hand segmentation.Second, some prior knowledge can be incorporated into the proposed method to handle totally occlusion.Moreover, another important point is the tuning of the parameters in energy function.In our future research, we will focus on these problems.

Figure 1 :
Figure 1: The framework of our method for hand tracking and segmentation.

a scaling function explained in Section 3 ,
() = min ∈o +1|  ‖ − ‖ which constraints a new estimate to be in the spatial neighborhood of the prediction.For example, if a pixel  is in the mask of predicted object o +1|  , then   () = 0.If a pixel  is out of the mask of o +1|  ,   () is equal to the nearest distance between  and other pixels , ∀ ∈ o +1|

Figure 3 :
Figure 3: (a) Some points are out of hand range via optical flow at  = 75, 125.(b) The degeneracy problem occurs (the number of points is drastically reduced at time  = 125).(c) Results by our method.

Figure 5 :
Figure 5: (a) Source image and (b) result by chamfer distance transform.

Figure 11 :
Figure 11: Results with a small value   (initialization at time  = 0).