This paper illustrates the hand detection and tracking method that operates in real time on depth data. To detect a hand region, we propose the classifier that combines a boosting and a cascade structure. The classifier uses the features of depth-difference at the stage of detection as well as learning. The features of each candidate segment are to be computed by subtracting the averages of depth values of subblocks from the central depth value of the segment. The features are selectively employed according to their discriminating power when constructing the classifier. To predict a hand region in a successive frame, a seed point in the next frame is to be determined. Starting from the seed point, a region growing scheme is applied to obtain a hand region. To determine the central point of a hand, we propose the so-called Depth Adaptive Mean Shift algorithm. DAM-Shift is a variant of CAM-Shift (Bradski, 1998), where the size of the search disk varies according to the depth of a hand. We have evaluated the proposed hand detection and tracking algorithm by comparing it against the existing AdaBoost (Friedman et al., 2000) qualitatively and quantitatively. We have analyzed the tracking accuracy through performance tests in various situations.
In the past decade, there have been intensive studies on the automatic analyses of human behaviors. Among the study areas, the human-computer interaction field has attracted the most attention, and there have been many studies on human gesture recognition. A gesture is an effective nonverbal communication tool that helps in complex human interactions with its ability for simple communication. Hand gesture recognition is applied to many fields from a sign language system for the hearing impaired to smart devices for effective interactions. Various gesture recognition approaches that involve hand region detection, hand feature extraction, and learning and recognition methods have been reported. The existing studies include the use of a data glove to analyze hand images [
The methods that use color images use such information as skin color or edges. Suk and Sin [
Many studies that combine it with depth data have been carried out to nullify the weakness of the color data method with respect to environmental changes. Park et al. [
Although the methods that combine color and depth data improve the detection of hand region, they are still limited because of the color dependability. There are also studies that did not require color data and used only depth data. Mo and Neumann [
Figure
System flow.
The rest of this paper is organized as follows: Section
The purpose of features is to ease the fast and accurate detection of the hand region in a depth image [
Algorithm
for for end end
The feature value array Fv.
Figure
Example of extracting features.
The function
The computing time for the feature values can be reduced using an integral image. A cascade method can further reduce the computing volume for positive and negative decisions, enabling real-time detection. However, scale invariance is another issue to be resolved for object detection. For example, in [
Equation (
A learning phase needs features to be used when building a boosting classifier. The pool of features is made with an algorithm illustrated in Algorithm
Example of feature extraction.
The values of
When a pool of features is provided, a learning process can operate through a boosting classifier. Most classifiers try to increase a detection rate (number of detected positive samples over total number of positive samples), while they try to reduce a false positive rate (number of detected negative samples over total number of negative samples). In general, these two criteria contradict. The AdaBoost algorithm proposed by Viola et al. determines the threshold value with the least error rate to form a weak classifier. Then, a strong classifier combines weak classifier so that it could meet some desired detection rate. If the strong classifier could not meet satisfactory false positive rate, the next stage of the cascade structure is carried out. That is, each stage of the cascade structure works as a strong classifier. It results in that a very large number of weak classifiers are needed in the overall cascade structure.
Because a hand region is rather regular and has a simple pattern, we rather choose a weak classifier that acts like a strong classifier of Viola. A single weak classifier is to perform the role of a strong classifier of Viola by setting its threshold value so that it could meet a desired detection rate. In other words, we enforce a kind of overfitting at each weak classifier. A satisfactory false rate is then handled in the succeeding stages of a cascaded structure in a sequential manner. This strategy could speed up the detection procedure and reduce the number of weak classifiers dramatically:
Equations (
Method for calculating the threshold.
Figure
Structure of classifier generated using cascade.
Figure
Algorithm
while (1) Select the best classifier satisfying (2) If (3) Evaluate current cascaded classifier and update (4) (5) Include false positive samples and positive samples in set
Once a feature is selected, the cascade classifier, a combination of the selected feature and the previously selected features, computes
Figure
Process of hand region detection.
Algorithm
for if for if if if distance if intersect end if end if end if end if end Result end if End
TempR denotes the memory space where the data for many quadrangles are stored. In other words, we test the depth differences and the overlap regions between the center points of the quadrangles, which are hand region candidates, and any quadrangle that passes the test is added to the memory space TempR. When another quadrangle is tested, the depth difference and the degree of overlap with the average quadrangle in TempR are compared for the merge operation. The final product of the merge process is the average quadrangle, which becomes the final hand region. The average quadrangle is found by the average coordinate of the vertex of each quadrangle. Furthermore, in the next tracking stage, the center point of the average quadrangle is set to the initial point of the tracking.
Figure
Result of merge process.
In the previous section, the hand region and the center point of the hand region are detected. In the tracking process, the nearest point in the next frame is to be detected on the basis of the center point of the hand region in the current frame [
Figure
Transition of tracking point.
The term nearest point means the closest point in
When the nearest point
Equations (
Algorithm
Input: Initialize:
while for for for if else if else end if end if end end end Output: VB, UVB
Figure
Result of region growth.
In the tracking process, the nearest point is detected from the previous tracking point, the region is grown, and the hand region is detected. For stable tracking, a point that meets certain conditions needs to be tracked. Hence, we have defined a point that converges to the center of the contour line using the DAM-Shift algorithm as the tracking point. The DAM-Shift is defined in a manner similar to the Mean Shift [
Equation (
An image of the hand region with the five fingers open is acquired from various distances, and the center point of the palm is determined manually. The distance from the center point to the farthest middle finger and the depth value of the central palm are collected. The depth value is assigned to
Figure
Decision process for tracking point.
While tracking, many unexpected things may happen. A user may want to finish a hand movement or a hand may touch other objects. To handle situations where tacking is impossible or unnecessary, we need to judge the success or failure of tracking, whenever the region growing is completed. After many experiments, we have found that the tracking point moves inappropriately when a hand moves very swiftly, or hand is positioned behind the face or overlapped with another object. Hence, we include a module that detects such invalid tracking cases.
As shown in Figure
Results of region growing and boundary detection.
Equation (
We used depth images of
Figure
Fifty selected features.
Table
Comparison of detection rate and false positive rate of various AdaBoost algorithms.
MFPR (maximum acceptable false positive rate) | ||||||||
---|---|---|---|---|---|---|---|---|
0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 | 0 | |
Discrete AdaBoost | ||||||||
DR | 0.96 | 0.99 | 0.98 | 0.99 | 1 | 0.97 | 0.99 | 0.96 |
FDR | 0.72 | 0.73 | 0.74 | 0.87 | 0.71 | 0.85 | 0.56 | 0.13 |
Computation (ms) | 25.76 | 20.13 | 19.82 | 23.70 | 32.60 | 36.18 | 39.63 | 123.56 |
Real AdaBoost | ||||||||
DR | 0.98 | 0.99 | 0.97 | 0.96 |
|
0.99 | 1 | 0.94 |
FDR | 0.47 | 0.39 | 0.27 | 0.45 |
|
0.26 | 0.20 | 0.01 |
Computation (ms) | 15.63 | 15.77 | 15.42 | 16.85 |
|
21.36 | 24.18 | 82.83 |
Gentle AdaBoost | ||||||||
DR | 0.96 | 0.99 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 |
|
FDR | 0.46 | 0.49 | 0.40 | 0.33 | 0.59 | 0.37 | 0.24 |
|
Computation (ms) | 16.22 | 18.28 | 20.62 | 18.84 | 18.71 | 18.39 | 26.53 |
|
Proposed method | ||||||||
DR |
|
|||||||
FDR |
|
|||||||
Computation (ms) |
|
In general, our method surpasses the other boosting methods in every criterion. Gentle AdaBoost exhibits the best result where the stage-allowed MFPR is set to zero. As shown in the table, the detection rate in this case is 0.98, and the false discovery rate is 0.01. However, the detection speed is twenty times slower than that of our method. This was attributed to the fact that the stage has to have all the possible weak classifiers since MFPR was set to zero. The increase in the number of weak classifiers results in the decrease in the speed. Our method does not select any incorrect hand image at a detection rate of 97% and has the fastest computation speed compared to other algorithms. This is because our classifier contains only 50 features and forms a cascading structure.
Figure
Computational performances.
Original image
RAB (
Proposed method
Figure
Detection results.
Table
Average error of tracking with various distance conditions (UNIT: pixels).
1 m | 1.5 m | 2 m | |
---|---|---|---|
“a” | 5.7883 | 4.2890 | 2.1928 |
“b” | 4.3927 | 3.9835 | 2.3920 |
Spiral | 4.6908 | 3.2395 | 2.9803 |
Tracking results.
Figure
Figure
Tracking results of gestures “a” and “b” under various velocity conditions.
Table
Average error of tracking with various velocity conditions (UNIT: pixels).
Slow | Normal | Fast | |
---|---|---|---|
“a” | 3.7559 | 4.1664 | 8.1443 |
“b” | 2.3446 | 3.4423 | 3.3796 |
Figure
Example of tracking confidence.
Table
Computing time (Unit: ms).
Detection stage | Tracking stage | |
---|---|---|
Computing time | 5.85 | 25.96 |
We have proposed the hand detection and tracking method that works very well in a real world environment. For hand detection, we have developed very effective features and the cascade structure of a classifier. The features are generated based on dynamic depth differences. The cascade structure is constructed with selective employment of features according to their discriminating power with the strategy of minimizing a false positive rate at each stage.
For tracking a hand, we have developed DAM-Shift algorithm which is a variant of CAM-Shift algorithm. DAM-Shift algorithm varies its search area according to the depth of a hand. Our 2nd polynomial model works well to predict the size of a hand, which plays an important role in confining a search area. To handle situations where tacking is impossible or unnecessary, we have developed the judgment module which detects an inappropriate tracking. The judgment module can decide whether current tracking is valid or not.
We have evaluated the proposed hand detection and tracking algorithm by comparing it against the existing AdaBoost algorithms qualitatively and quantitatively. We have analyzed the tracking accuracy through performance tests in various situations. Current study shows that the proposed methods surpass the existing other methods in terms of accuracy and computation time.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the Seoul R&BD Program (SS110013) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A1A2012012).