Hand Detection Using Cascade of Softmax Classifiers

1Shenzhen Key Laboratory of Virtual Reality and Human Interaction Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China 2Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen 518055, China 3Swanson School of Engineering, The University of Pittsburgh, Pittsburgh, PA 15261, USA 4CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China


Introduction
Hand detection refers to determining the hands location and their shapes.It works as a prerequisite step for various hand gesture recognition systems [1,2] that have been widely studied, due to their potential application in entertainment and virtual reality [3], medical systems, and assistive technologies, as well as in crisis management and disaster relief [4].However, hand detection is never an easy task due to the hand deformation [5], the sensitivity of skin colors to lighting conditions [6], and the complicated environments for practical applications.As a result, robust and efficient hand detection remains a challenging task in computer vision community.
Multiclass hand posture detection is worthy of investigation for several reasons: different users may be habituated to using different postures for interaction, many application systems require multiple postures to realize different functions, and robust detection of human hand from multiple viewpoints can be achieved through multiclass hand detection by letting different posture categories represent postures captured under different viewpoints.One way to deal with multiclass hand detection is first to locate the human hand and then to determine the hand shape by classification.Such methods are usually of low accuracy.For example, to locate human hand using skin color cues can be easily affected by the lighting condition and the skinlike background, which will lead to high miss and false rates and will degrade the follow-up classification accuracy/speed in detection.Another example is to train binary classifier for sliding-window-based hand localization, in which all predefined postures are treated as a positive class and the background is regarded as negative class.In this method, the difference in posture shapes increases the pattern complexity of positive space and resultantly leads to low excluding rate for background.Another way for multiclass hand detection is to build independent detector for each predefined posture and perform multiposture detection by sequentially detecting each of the predefined postures with the corresponding posture detectors [7,8].The disadvantage of such practice contains several aspects: (a) the computing cost is high, because multiple rounds of detections are required to find the postures of multiple categories; (b) a window image may be predicted into multiple posture categories, which would result in heavy overlapping detection results; and (c) the multiple detectors are trained independently rather than jointly and in collaboration, which causes confusion detection between different postures easily.
To improve the performance of multiclass hand posture detection system, here in this work, we provide a softmaxbased cascade detector that integrates several SftB classifiers at early stages and a SftM classifier at the last stage.Advantages of this proposed method include the following: (a) the softmax-based structure makes it possible to perform multiclass posture detection in parallel; (b) the cascade structure helps decompose the complexity of background pattern space and therefore improve the detection accuracy; (c) the pass-rate of postures and the false rates of background can be adjusted easily by using the binary SftB classifiers (adapted from softmax models) in the first few stages; (d) the SftB-based binary classification is actually made based upon the multiple decision surfaces implied by the softmax model and has a stronger background excluding ability than the binary classifiers trained with examples of all defined posture categories as a single positive class; and (e) with cascaded softmax scheme, the prediction probability across multiple stages can be merged to make final decisions, which helps to reduce the confusion rates between posture categories.Moreover, stage-classifiers of increasing stage-levels will take the HOG features of increasing resolutions to balance the detection accuracy and efficiency.To sum up, the major contribution of this work can be concluded as follows: (1) A softmax-based cascade architecture is proposed to perform multiclass hand postures detection in parallel and meanwhile to decompose the complexity of background pattern space to improve the detection accuracy.
(2) The SftB classifier is proposed to better distinguish the predefined postures from the background regions, since it could decompose the complexity of multiclass posture pattern space by the multiclass decision boundaries that are learned jointly.
(3) The cascade is designed to take low-resolution HOG features at the lower stages and to use HOG features of higher resolutions for stage-classifiers of higher levels, which helps to balance between the detection accuracy and efficiency.
The remainder of this paper is organized as follows.Section 2 briefly reviews the existing work on vision-based human hand detection problem.The proposed softmaxbased cascade architecture is described in Section 3 in detail.Experimental results and discussions are provided in Section 4. Conclusions and future work are offered in Section 5.

Related Work
The vision-based hand detection methods can generally be separated into two groups: the appearance-based methods and the 3D-model-based methods [2,7].The appearance methods carry out the detection by directly comparing the image features with prebuilt appearance models.These methods are usually of high efficiency, but their performance can be easily affected by viewpoint variation and hand deformation.The 3D methods adopt a kinematic model with high degree of freedom [5,8].Such methods offer a richer hand description and therefore could deal with more posture categories, but they are usually computationally expensive due to the complex model matching algorithms.Here in this work, an appearance method is explored to perform the multiclass hand posture detection in parallel.
The key of appearance methods is to seek effective features for hand posture representation as well as to develop an efficient and expressive posture classification model.The frequently used appearance features include the Haar-like [2,7,9], HOG [10][11][12], SIFT [13,14], and BRIEF [14,15].However, such features are seriously affected by the cluttered backgrounds that introduce noise to features encodings.For this reason, recently there are trends to adopt the combination of multiple feature descriptions, such as the integration of HOG and skin features in [16] and the association of Haar-like and HOG in [2].However, the accuracy improvements for such multifeature methods are usually gained at the expense of considerable increase in computing cost.To improve the efficiency, a classifier of two levels is presented in [1], in which the possible presence of hands is determined from a global perspective in the first level, and then hand regions are precisely delineated at pixel level by a probabilistic model in the second level.And, in [17], the saliency map generated by a Bayesian model is firstly thresholded to localize the hand regions, and then shape and texture features are extracted from the saliency map of hand regions for hand posture recognition.More recently, the deep learning (DL) methods are also investigated for hand posture detection, such as the integration of CNN scheme with fast candidate generation [18], the multiscale deep feature approach [19], and the deep architecture with three networks of sharing convolution layers [20].However, the speeds of DL-based methods are much lower than those of the classical methods if the algorithms are running on a machine without advanced GPUs.
Multiclass posture detection problem is often addressed by two-stage methods [20][21][22][23], in which hand region proposals are firstly obtained by techniques like skin, motion, or saliency detection which are robust to hand deformation and viewpoint variation, and then these regions are classified by multiple binary models or single multiclass model to achieve the final posture recognition.For such methods, precise region proposals are prerequisite to achieve satisfactory recognition rates, while obtaining precise proposals is never an easy job in itself if no specific posture models are utilized.As a result, the misdetection is often relatively high for such methods.The sliding-window-based methods usually perform the multiclass posture detection with multiple posture-specific detectors [9,24].Such methods may have relatively high recall rates.But they lack efficiency since each window needs to be classified by multiple detectors and suffer from heavy confusion detections because the detectors for different categories are trained independently rather than in a coordinated manner.Besides, there are works that adopt treetype structure [7], but practical experiments show that there is no significant improvement in accuracy or efficiency.Here in this work, we propose a softmax-based cascade detector to perform multiclass hand posture detection simultaneously rather than category by category.Moreover, owing to the multiclass objective function, the decision boundaries are essentially obtained by seeking a balance among all categories and therefore can help reduce the confusion rates among different posture categories.

The Proposed Methodology
In this section, the softmax model is firstly presented for multiclass classification.Then, the softmax-based cascade architecture is introduced for multiclass hand posture detection.And, finally, we will show how to apply multiresolution features to the cascade architecture to balance the detection accuracy and efficiency.

Multiclass Hand Posture Classification by Softmax Regression.
Instead of utilizing multiple independent binary classifiers, here in our method, the softmax model [25] is applied to discriminate among the background category and multiple hand posture categories.To be specific, given the feature vector   of image , the distribution of class label () ∈ {}  =0 can be modeled as where Θ = {{ ()  }  =1 }  =0 are model parameters and {  (⋅)}  =1 represent basis functions used for feature transformation.() ∈ {1, ⋅ ⋅ ⋅ , } means that  is an image of the th posture category, and () = 0 indicates that  is an image of background or undefined postures.In this work, the identity basis functions are adopted; that is, there is () = .For kernelized softmax model, there is () = ((,  1 ), ⋅ ⋅ ⋅ , (,   )), where (⋅, ⋅) is the kernel function and {  }  =1 are the features for the training examples.To facilitate the subsequent discussions, the ground-truth label of  is reformulated into a ( + 1)-dimensional vector as  = () ∈ {0, 1} +1 , where its th element   (0 ≤  ≤ ) is equal to 1 if () =  and   = 0 otherwise.Moreover, we use (⋅; Θ) to denote the softmax model with parameter Θ and use (; Θ) to denote the vector ( 0 (; Θ), ⋅ ⋅ ⋅ ,   (; Θ)) for simplicity.With these notations, the distribution for label vector  can be formulated as The model parameter Θ can be obtained by maximal likelihood estimation (MLE) [25,26].To be specific, given the training set {z = {  }  =1 , t = {  }  =1 }, under the assumption of identical and independent distributions, the likelihood for parameters Θ can be formulated as where   is the feature representation for   ,   is ( + 1)dimensional label for example   , and   is the th component of   .In implementation, Θ is acquired by minimizing the negative log-likelihood as follows: Since the loss function in (4) remains unchanged as all elements in Θ change in the same proportion, the penalization on Θ should be added to the objective function to suppress the magnitude of model parameters.Therefore, in practice, we take the loss function with regularization term as follows: where and  is the regularization coefficient.Finally, we take the efficient iterative BFGS algorithm [27,28] to find the solution of ( 5).Once the model parameters Θ are obtained, the prediction of () can be made based upon the softmax model by This prediction formula will be slightly modified in the next subsection to carry out two-class classification.

Softmax-Based Cascade Architecture for Human Hand
Detection.For multiscale sliding-window-based hand detection, the background pattern space is highly complicated because of the varied background window images.
That is to say, for stage , the window  can be accepted if and only if the maximal probability of posture categories is larger than the probability of background category by at least   .The parameters {  }  =1 are set to the values so that most windows that properly contain the defined postures can get through, and they are determined at the training stage based upon the settings for posture example pass-rates (for in ascending order to produce vector  ∈ R   , and take the value   = (floor((1 −   )  )) as threshold, where   are the preset posture examples pass-rates for the th stage SftB during training period).The SftM classifier  +1 (⋅) with output in {}  =0 is of the formulation as described in (6), and it is mainly used to discriminate among the ( + 1) categories including the  classes of defined posture and the difficult backgrounds.To speed up the classification, the classifier {  (⋅)} +1 =1 can be replaced by the classifiers {  (⋅)} +1 =1 defined as follows: The threshold {  }  =1 can be determined in a similar way to that in which the threshold in ascending order to produce vector  ∈ R   , and take the value   = (floor((1−  )  )) as threshold, where   are the preset posture examples pass-rates for the th stage SftB during training period).
The classification of window image  is achieved by a twostep decision process.In the first step, the class label of  is predicted as where  ()  represents the feature representation used by   (⋅).The range of l() is {0, 1, ⋅ ⋅ ⋅ , }.When l() is 0, the window  will be directly excluded, and the second step will not be carried out any more.In the second step, the class label of window image  accepted by ( 9) is reidentified as where () is the ( + 1)-dimensional score vector calculated using the softmax models at the high-level stages: In the experimental part,  0 is set at 2. For ease of understanding, the flowchart for the window image classification is provided in Figure 1.

Multiresolution HOG Feature for Different Stage-Classifiers.
For sliding-window-based hand detection, there are tens of thousands window images to be classified in single frame, which makes the detection system lack efficiency.To improve the efficiency, here in this work, the multiresolution HOG features are adopted for posture representations [24].The cascade is designed so that the HOG features with low resolutions are utilized by classifiers of lower stage-levels, and HOG features with high resolutions are utilized by classifiers of higher stage-levels.The varied feature resolutions can be achieved by adjusting the density of cell splits in window images as discussed in [24].With such multiresolution scheme, a large number of background windows can be excluded by the classifiers using low-resolution HOG features.And only few difficult background windows need to be further classified by the HOG features of high resolutions which are more discriminative and more computationally costly.In this way, the detection speed can be greatly improved without sacrificing the detection accuracy.Concretely, let   denote the time consumption for single window classification with   (⋅), and denote the percentage of windows through the th stage as follows:   = number of windows through the first k stage-classifiers/number of all windows generated from the full-sized image.Then, based upon the proposed multiresolution and cascade scheme, the average time expense for classifying one window image is  1 =  1 +∑  =1    +1 .However, if the detection system adopts a single softmax with HOG features of the highest resolution, the time expense would be  +1 , which is usually several times as much as  1 .
(1) Prepare multiclass posture example set X and the full-sized background images set Z. Specify the control factors {  ,   }, the stage number  + 1, the HOG resolutions for different stages, the posture samples pass-rates for the first  stages {  }  =1 , and the size of train samples  × ℎ.Set the current stage level as  = 1, the set of stage-classifiers as Q = {}.Note that, all sub-images cropped from full-sized background images are of size  × ℎ in training process.
(2) Train the first stage classifier as follows: (2.1) Set X = X, Z = {  sub-images randomly cropped from images in Z}, and S = ⌀.
(2.2) Train a softmax model with sample sets X, Z and HOG of specified resolution, and modify the model into two SftB classifiers  1 (⋅) (Eq.( 7)) and  1 (⋅) (Eq.( 8)) based upon the pass-rate  1 .To promote the understanding, details of the training process for the proposed method are described in Algorithm 1.In Step (1), the training data is prepared and some hyperparameters are defined to control the training process.In Step (2), the first stage-classifier is trained, while the rest of stage-classifiers are trained one by one in Step (3).During training of the first stage, the initial   negatives are randomly cropped, and all the rest are acquired using hard example mining techniques (Step (2.4)).Such strategy could enhance the discriminative ability of the first stage-classifier.For stage larger than 1, all the   negatives are directly mined based upon the previous stage-classifier (Step (3.2)).In the th stage, the multiclass softmax model is firstly learned, and then based upon the predefined pass-rate hyperparameters   , the modified SftB classifiers   (⋅) and   (⋅) can be generated.Once the stage reaches the predefined  + 1, the procedure could stop and return the set of cascade components Q.

Experimental Results and Discussions
The proposed method is evaluated on a dataset that is collected under various scenarios with complex background and challenging light conditions.In this section, we firstly describe the dataset and experimental settings.Then, performances of the proposed SftB classifier and softmax-based cascade are evaluated.And, finally, influences of the settings for posture example pass-rates are discussed.pixels.The samples are obtained by cropping hand regions from the full images that are collected from ten subjects under various backgrounds and lighting conditions.The negative samples are generated during training process by randomly cropping image regions from 500 extra complicated pictures of full size.These full-sized images comprise various undefined hand postures but contain no hand posture of predefined categories.Except for the training samples, we also prepare 4000 full-sized images to evaluate the performance of the proposed method, and each image contains at least one predefined posture instance.Examples for the defined posture categories are presented in Figure 2.

Experimental Settings.
In the experiment, training samples are normalized into the resolution of 80×80 pixels.HOG features of various resolutions are utilized for classification, where different resolutions are achieved by adopting different cell splits.Cell splits for the adopted 3 resolutions are illustrated in Figure 3. Parameters for HOG features of all resolutions are fixed as unsigned gradient orientation, 9 equally distributed angle bins, 2 × 2 cells per block, and block steps equaling to cell size.Totally four stages-classifiers are incorporated into the softmax structure.The first three are SftB classifiers, and the last one is SftM classifier.Feature configuration for each stage-classifier is presented in Table 1.In addition, to improve the detection efficiency, changing window size is employed for multiscale search rather than resizing the image itself (e.g., we could take window size of 64 × 64, 80 × 80, 96 × 96, and 120 × 120 to detect hands of different scales in the frame.For window size of  × , the region of cell (, ) will be taken as [ + 1,  + 2,  + 1,  + 2], where (, ) are the top left coordinates of this window image, 1 = floor(( − 1) * /) + 1, 2 = ceil( * /), 1 =
floor(( − 1) * /) + 1, 2 = ceil( * /), and  is the cell number at horizontal or vertical direction (totally  *  cells as shown in Figure 3).To sum up, the cell size changes with the window size.Although such calculation for cell location is not so accurate when  is not divisible by , the feature is still effective.In video-based detection, if the application scenario requires the users to be near the camera, the window sizes should be larger, while if the users are required to stay far away from the camera, the window sizes should be smaller) and the window step is set as 0.05 times of the window size.
For live hand detection, the web-camera is set so that image with 320 × 240 resolution could be captured.All experiments are conducted on a PC equipped with Intel(R) Pentium(R) G3220 @3.00GHz CPU, 4.00GB RAM, and under the visual studio 2013 platform.In this way, the ROC curve for SftB can be produced).Similarly, we can train  (,LR) (1 ≤  ≤ 3).In this way, totally six ROC curves are produced based upon { (,SftB) ,  (,LR) } 3 =1 .In addition, an extra ROC curve is also generated for a SftB classifier based upon { (2,SftB)  train ,  (2,SftB)   test } and using HOG features of the first resolution.All the seven ROC curves are displayed in Figure 4, where the notations "stage2&Reso2&LR" and "stage2&Reso2&SftB," respectively, represent the LR and SftB classifiers trained with HOG features of the second resolution.Other notations can be explained in a similar way.

Effectiveness of the Proposed SftB
From Figure 4, we can see that, with the same HOG resolution and for fixed TPR (Table 4), the FPR (Table 4) under SftB classifier is much smaller than that calculated with LR classifier.This is because that the SftB is modified from a multiclass classifier, which essentially provides the decision boundaries among different posture categories and therefore can decompose the complex space formed by multiclass posture examples.Moreover, we find that the classifier "Stage2&Reso1&SftB" seriously underperforms the others, which indicates that increasing the resolution of HOG features is crucial to guarantee the classification accuracy.In addition, the histograms for outputs from R 2 (⋅) (see (8)) are calculated and presented in Figure 5, so that more knowledge can be gained about the proposed softmax-based binary classification.In the illustration, the upper histogram is calculated based upon the background examples and the bottom one is calculated based upon the predefined hand posture examples.

Effectiveness of the Proposed Softmax-Based Cascade
Detector.To fully evaluate the proposed method, we compare the performance of softmax-based cascade and noncascade detectors based on their confusion matrices.The three compared noncascade softmax detectors are trained, respectively, with each of the three HOG feature resolutions as illustrated in Figure 3.For the cascade detector, posture passrates for the first three stage-classifiers are set to 98.0%, 98.5%, and 99.0%, respectively.In practice, the multiclass posture detection is carried out on the full-sized testing images with each of the four detectors (one cascade and three noncascade) and based on the multiscale sliding window scheme.For each detector, all rectangular regions that are classified into a same category will be postprocessed by the nonmaxima suppression techniques to determine the final locations for posture instances.The (+1)×(+1) confusion matrix  for a detector  is computed from the final results produced by detector .With zero-based indexes, the elements of  are defined as follows: The four confusion matrices corresponding to the four detectors are presented in Table 2, where the Sof-tmax+Resolution1, Softmax+Resolution2, and Softmax+Res-olution3, respectively, represent the confusion matrix computed from the three noncascade detectors.Note that the confusion matrix here is different from that for classification problem.In fact, for sliding-window-based detection, one target instance may be covered by many windows, and the postprocessing is only applied to windows that are classified into the same category.As a result, one region can be finally predicted into more than one posture category.For this

FPPI
the number of mis-detected regions f rom pure background images the number of all full-sized pure background images used for evaluation

FPPW
the number of misclassified window images the total number of window images being classified

Mean correct rate
the number of full-sized images which comprise no false detections the number of all full-sized images used for evaluation Detection rate the number of defined posture instances that are predicted into any of the defined posture categories the total number of posture instances that belong to the defined posture categories Case1 the case in which pass-rate of higher stage is larger than the pass-rate of lower stage Case2 the case in which the pass-rate of higher stage is smaller than the pass-rate of lower stage TPR the abbreviation of true positive rate FPR the abbreviation of false positive rate LR the abbreviation of logistic regression reason, the sum of elements in each row does not necessarily equal to one.
From Table 2, we can see that the hand detection with noncascade softmax detectors may cause high false detection rates at the background areas and high confusion rates among different posture categories.By contrast, the proposed softmax-based cascade could significantly suppress all kinds of false detections without sacrificing the recall rates.This is because the complexity of background space can be effectively decomposed by the usage of multiple stage-classifiers, and therefore it becomes much easier for the final multiclass softmax model to discriminate among the predefined postures and the minorities of remaining backgrounds.
To make more direct and intuitional comparisons, multiple performance values based on summary measures are also computed and provided in Table 3.The measures mean recall rate and mean correct rate, respectively, represent the averaged recall rates and the averaged confusion rates among the four predefined posture categories.For the definition of FPPI and mean correct rate, please refer to Table 4.
From Table 3, we can see that the detection accuracy with Softmax+Resolution3 is the highest among the three noncascade classifiers.However, by comparison, the proposed  multiclass cascade detector further improves the mean recall rate from 0.9225 to 0.9448 and boosts the mean correct rate from 0.5155 to 0.8475.Meanwhile, the mean confusion rate is reduced from 0.2182 to 0.0515, and the FPPI is reduced from 0.2248 to 0.0169.In addition, the proposed detector is faster than Softmax+Resolution3 by almost 4 times.
Figure 6 shows some hand posture detection result based on a normal web-camera.From the results, we can see that the proposed method can detect the defined hand postures under various environments.And the system can reach a real-time running speed of 27 FPS under our experimental setup.4) are computed based upon each of these cascade detectors, and the best group of settings is selected by comparing the values of all FPPW and detection rates.Note that the detection rate does not necessarily equal to the mean correct rate, since confusion detections may exist among different posture categories.
The notation [97%#98%#99%] means that, for the first three stage-classifiers, the pass-rate of posture examples is successively set to 97%, 98%, and 99%.Each group contains exact three pass-rate settings because there are exact four stage-classifiers in each cascade detector, while the fourth stage is a multiclass softmax model that will not be modified.The curves for variation relations of FPPW with the stagelevel are presented in Figure 7, and the detection rates are illustrated in Figure 8. Except that FPPW and detection rate are both increasing with the product of three pass-rates, we have another important observation.That is, when the product of the three pass-rates is fixed, the detection rate in Case1 (Table 4) is significantly higher than that in Case2 (Table 4), while the FPPW in both cases are very close.This indicates that the detectors trained in Case1 are more discriminative than those trained in Case2.This observation suggests that, to achieve good performance, it is better to set low pass-rates for classifiers at low stages and set higher passrates for classifiers at higher stages.

Figure 1 :
Figure 1: The flowchart of window image classification using softmax-based cascade classifier.

4. 1 . 1 .
Datasets.The experimental dataset comprises four predefined posture categories.For each category, there are around 2000 positive examples with normalized size of 80×80

Figure 2 :
Figure 2: Examples for the four hand posture categories used in our experiments.From the first to the fourth row, the four posture categories are, respectively, denoted as vict, close, open, and fist.
Classifiers.To evaluate the proposed SftB classifier, we, respectively, use the softmax and logistic regression (LR) techniques to train the first three binary stage-classifiers to produce the final four-stage cascade.During the SftB cascade training period, all samples prepared for the th stage-classifiers  (,SftB) (⋅) are divided into the training set  (,SftB) train and the testing set  (,SftB) test . (,SftB) (⋅) is learned from dataset  (,SftB) train , and ROC curve for  (,SftB) (⋅) is calculated based on the testing set  (,SftB) test (the ROC describes the variation relation between false positive rates (FPR) and true positive rates (TPR).Different TPR of  (,SftB) test are achieved by adjusting the value of threshold   .And varying   can in return produce varying FPR on  (,SftB) test .

Figure 4 :
Figure 4: ROC curves for different stage-classifiers which are calculated from the test set generated during the training period.

Figure 5 :
Figure 5: Histograms of output values from R 2 (⋅) in the second stage of SftB classifier.The upper one is calculated from the background examples and the bottom one is from the hand posture examples.
0) fl      S ()            S ()        (0, ) fl      S ()            S ()       , 1 ≤  ≤   (, 0) fl      S fl {posture instances that belong to the th posture category but are predicted into the th category}, S ()  fl {posture instances from the th posture category but they are not predicted into any of the defined categories}, S ()  fl {all posture instances from the th posture category}, S ()  fl {all background regions that are predicted into the th posture category}, S ()  fl {all full-sized pure background images used for evaluation}, and S ()  fl {all fullsized pure background images that do not contain false detections}.The pure background image refers to the image that does not contain instances of the predefined posture categories.And a detected region  is the correct detection to an instance O if and only if the following exist: (a) the predicted class of  is just equal to the ground-truth class of O and (b) the overlap ratio between  and the ground-truth region of O is larger than 0.6.
(a) Detection results after postprocessing (b) Detection results before postprocessing

Figure 6 :
Figure 6: Detection results based upon the proposed softmax-based cascade detector.Different predictions are marked with rectangles of different colors.
Settings for Posture Example Pass-Rates.Performance of the proposed cascade is directly affected by the thresholds   of its stage-classifiers as shown in (8).The thresholds affect not only the detection results but also the training process, since the background samples for the th stage are acquired by the previous (p-1) stageclassifiers.These thresholds are determined based upon the settings for pass-rates of posture samples (for   (⋅), Υ ()  = {  (   )}   =1 could be computed based upon the posture examples set {   }   =1 which is used for learning Θ  .Sort Υ ()  in ascending order to produce vector  ∈ R   , and take the value   = (floor((1 −   )  )) as threshold, where   are the preset posture examples pass-rates for the th stage SftB during training period) which are set at the training stage to control the training process.To acquire better cascade detector, we prepare multiple groups of settings for the pass-rates and then train the four-stage cascade classifier with each group of settings.After that, the FPPW and detection rate (Table =1 with outputs in {0, 1} are mainly used to distinguish the defined hand postures from the background window images, where SftB   (⋅) is formulated as To decompose the complexity of background space, a softmaxbased cascade architecture is introduced, which comprises a set of softmax-based binary (SftB) classifiers {  (⋅)}  =1 and a softmax-based multiclass (SftM) classifier  +1 (⋅).These classifiers are obtained based on the ( + 1) softmax regression models {(⋅; Θ  )} +1 =1 which are learned with a cascade training procedure.The classifiers {  (⋅)}

Table 1 :
The feature configuration for each cascade stage.

Table 2 :
The confusion matrices for detection results computed with single-resolution-based softmax detectors and multiresolution-based cascade detector.Note that row elements of matrix do not need to sum to 1 for confusion matrix of detection problem.

Table 4 :
List of acronyms, definitions, and terminology interpretation.