Object Detection and Tracking-Based Camera Calibration for Normalized Human Height Estimation

This paper presents a normalized human height estimation algorithm using an uncalibrated camera. To estimate the normalized human height, the proposed algorithm detects a moving object and performs tracking-based automatic camera calibration. The proposed method consists of three steps: (i) moving human detection and tracking, (ii) automatic camera calibration, and (iii) human height estimation and error correction.The proposedmethod automatically calibrates camera by detectingmoving humans and estimates the human height using error correction. The proposed method can be applied to object-based video surveillance systems and digital forensic.


Introduction
A large-scale video analysis using multiple cameras is gaining attractions in visual surveillance applications.In particular, as the use of the object-based video analysis increases, the demand for extraction of object information is growing up.Since an object information is changed depending on the camera parameters such as a location of the installed camera, viewing angle, and focal length, for this reason, various normalized object feature extraction methods were proposed.
Lao et al. proposed a human motion analysis method for consumer surveillance system [1].This method estimates human moving trajectories by tracking and recognizing the human motion.Del-Blanco et al. proposed a multiple object detection and tracking framework for the automatic counting of object numbers in a video surveillance application [2]. Lee et al. detected the object region and estimated the depth information using multiple color-filtered apertures (MCA) [3].Chantara et al. proposed fast object tracking method using adaptive template matching [4].Chu and Yang detected a moving object using a background model and estimated the object velocity using the object with a previously known length [5].Maik et al. train the typical poses in both the 2D image and 3D space and represent the located poses as a silhouette for the human pose estimation [6].Kang et al. proposed the human gesture detection and tracking method by using the real-time stereo matching [7].However, this method uses two or more cameras for the depth estimation.In order to estimate the 3D information using single camera, the camera calibration methods [8][9][10][11] are proposed.Arfaoui and Thibault used a diffractive virtual grid to estimate camera parameters for a fish-eye lens camera [12].Neves et al. corrected the fish-eye distortion using parallel lines and then calibrated a static and pan-tilt-zoom (PTZ) cameras using an object height [13].Zhang et al. tracked homography based on the model plane and then estimated camera parameters using maximum likelihood (ML) approach [14].Bell et al. used digital display to generate feature points for out-of-focused camera calibration [15].Kual-Zheng proposed an object height estimation method that extracts feature points and estimates vanishing points using a special pattern such as a cubic box [16].Gallagher et al. proposed a method to analyze a human age and gender by calibrating a camera and analyzing distances among eyes, nose, and mouth in a human face [17].Since this method uses a special pattern board for the camera calibration, successful analysis is difficult when the size of a face is small.Shao et al. calibrated a camera using an optical flow method and normalized the object height to estimate a moving object at the cost of increased computational complexity [18].Zhao and Hu used a pure translation to calibrate a camera [19], and Li et al. reduced control points given the intrinsic camera parameters to calibrate a pantilt camera [20].Andaló et al. estimated vanishing points by clustering lines in an image and then calculated the object height [21].However, this method cannot accurately estimate vanishing points when the background does not include sufficient pairs of parallel lines.User input of the human height is another burden of this method.
To solve the abovementioned problems, the proposed method calibrates a camera by detecting and tracking the object region.In addition, a projective matrix, which is a result of the camera calibration, is applied to the proposed human height estimation method, and then estimated human heights are accumulated and corrected using the Random Sample Consensus (RANSAC) algorithm.As a result, the proposed method can estimate the normalized human height using an uncalibrated camera for a visual surveillance system.This paper is organized as follows.Section 2 describes the camera projective model, and Section 3 presents the proposed camera calibration and human height estimation algorithms.Experimental results are shown in Section 4, and Section 5 concludes the paper.

Camera Projection Model-Based Calibration: A Review
An object is projected onto a two-dimensional image with different sizes depending on the distance between the object and a camera.In order to estimate the human height using a single camera, the projective relationship between the 3D space information and the 2D image plane is needed.The pinhole camera projective model [22] is given as where  represents the coordinate in the 3D space, matrix  contains intrinsic camera parameters,  represents the camera rotation matrix,  represents the camera translation vector,  represents the coordinate in the 2D image plane, and  represents the scale factor.The camera intrinsic parameter is determined by focal length (  ,   ), principal point (  ,   ), skewness skew, and aspect ratio  as To simplify the camera calibration process, the proposed method assuming that   =   , the principal point is the center of the image, skew = 0, and  = 1.In the same manner, the camera rotation with regard to -axis is zero and the translations with regard to -axis and -axis are also zero.
Using the vanishing points and lines, Liu et al. [23] compute the camera parameters as where  represents the focal length,  the rolling angle in degree,  the tilt angle in degree, ℎ  the camera height, V  the horizontal vanishing line the vertical vanishing point, ℎ  the object height in the world coordinate,   the object foot position,  ℎ the object head position, and (, ) the distance measure between two points  and .
In order to estimate the physical size of an object in the 3D space using the object size in the 2D image, the proposed method detects the moving human to estimate the 3D space information.To estimate the human height, the proposed method assumes the foot position on the flat ground plane.As a result, the foot position in the 2D image plane is inversely projected into the 3D space to obtain the human height information.

Normalized Human Height Estimation
The proposed human height estimation algorithm is an extended version of Jung et al. 's work [24] and consists of three steps: (i) moving human detection and tracking, (ii) automatic camera calibration, and (iii) reference object-based human height estimation with error correction.Figure 1 shows the block diagram of the proposed human height estimation method, where   represents the th input frame,  the moving human region,  the human tracking region,  the projective matrix,   the th height estimation result of the human, and  the error corrected height estimation result.

Moving Human Detection and Tracking.
The proposed method first detects a moving human to estimate its height.If the detected human region includes the background region or if the region loses some part of the human body, an accurate estimation of the human height is difficult.For this reason, the proposed method generates a background using the Gaussian mixture model (GMM) [25,26] and then detects and labels the foreground image.The regions that do not have enough pixels in the foreground image are removed to reduce the noise.
The detected foreground regions include not only a single human region but also a group of human region possibly with nonhuman objects, which make human tracking difficult, and as a result, human height estimation error increases.
For that reason, the proposed method classifies each region according to whether it is a human region or not.The proposed classification method uses the combined histogram of oriented gradients and local binary pattern (HOG-LBP) and a support vector machine-(SVM-) based human detection method [27].Using the detected human information, each foreground region is classified into two regions.The first region is a single human region that has only one human object.The second region is a single nonhuman region that has either none or multiple humans.Figure 2 shows the moving human region detection and classification results.The proposed method tracks the human and estimates the height in a video using the detected single human region.Although the Kalman filter tracker [28] is a popular stochastic tracking method, it cannot track a nonlinearly moving object.To solve this problem, the proposed method uses a particle filter tracker [29].In a surveillance input video, human information, such as size and shape, changes while the human is walking.For this reason, the model-based tracking [30] method models the target human using a color histogram to deal with the dynamic characteristics of the moving human.In the proposed method, the HSV color histogram is used to represent the human region to reduce the sensitivity to the illuminance.The particle filter tracking results may include a probabilistic error and cannot detect the entire human region.Moreover, if the number of particles increases to reduce the tracking error, the time complexity also increases.To solve these problems, the proposed method detects the tracked human region by matching the detected human region with the tracked human regions as where   represents the th human region,   the th moving human region that is detected using the background model,   the number of pixels in the moving human region   , and   the tracking region about the th human.After matching, the proposed method uses additional trackers for unmatched single human regions.Figure 3 shows the human tracking results using the proposed method.In Figure 3, the red box represents the particle filter tracking result about the moving human, and the white box represents the optimal rectangular region that encloses the detected human region.

Vanishing Point and Line Estimation Using Human Information.
The normalized human height can be estimated in meters in the 3D space by estimating camera parameters.For the automatic camera calibration, the vanishing points and line should be estimated using the parallel lines in the image.Li et al. estimated the vanishing line and point by extracting the lines from the background structure [31].However, this method cannot calibrate the camera if the background structure does not have a sufficient number of parallel lines.For automatic calibration without using parallel lines, the proposed calibration method uses the moving human information [23].More specifically, the proposed method detects both foot and head positions of the human in the 2D image as where ℎ 2D represents the human head position in the 2D image,     the -axis coordinate of the th pixel in the th human region   ,  2D the foot position in the 2D image,    () the sample foot region that consists of 10% pixels of the th human region, and    the number of pixels in the sample region.
Both vanishing points and line can be estimated using the detected foot and head positions.The vertical vanishing point can be estimated using the intersection between the foot-to-head lines that include both the foot and head points from the corresponding human region.The horizontal line is estimated using two or more horizontal vanishing points, and the horizontal vanishing point is using the intersection between the foot-to-foot and head-to-head lines.The footto-foot and head-to-head lines, respectively, include foot and head points.Both vanishing points and line are estimated using the RANSAC algorithm to reduce the estimation error.Figure 4 illustrates the human-based vanishing point and line estimation process.

Human Height Estimation and Error Correction.
The proposed method computes the foot point in the 3D space for the normalized human height estimation using multiple videos acquired by different cameras.To calculate the foot point in the 3D space, the foot point in the 2D image is inversely projected into 3D space.The 3D point is on the line that connects the human foot point in the 3D space with the corresponding image sensor.Since the camera height is estimated based on the ground plane that includes the human foot points, the 3D foot point can be obtained by normalizing the inversely projected point with respect to the -axis.As a result, the foot point in 3D can be calculated as where  3D represents the foot point in the 3D space,  2D the foot point in the 2D image,  the projective matrix, and  the -axis coordinate of the point that is inversely projected from the foot point in 2D image.
The reference head point in the 3D space can be estimated by translating the foot point to the vertical direction of the ground plane.Using the reference head point in the 3D space, the corresponding head point in the 2D image is given as where ℎ 2D ref represents the reference head point in the 2D image and ℎ 3D ref the corresponding head point in the 3D space.
Using the reference head point, the human height can be estimated as where   represents the estimated human height,  ref the reference height,  2D ref ℎ the -axis coordinate of the reference head point in the 2D image,  2D ℎ the -axis coordinate of the human head point in the 2D image, and  2D  the -axis coordinate of the human foot point in the 2D image.In this work, the reference height of 1.8 meters was used.Figure 5 shows the human height estimation model using reference height.
The accuracy of human height estimation depends on the detected human region.To reduce the human height estimation error, the proposed method accumulates the estimated human heights in each frame and corrects the errors using the RANSAC algorithm.In the first step of the RANSAC algorithm,   sample heights are randomly extracted.Next step computes the sum of squared differences (SSD) between the average height and each estimated human height.The first and second steps repeat   times to obtain the error corrected height.

Experimental Results
The proposed human height estimation results are shown in this section.The test video was acquired using an uncalibrated camera viewing down the ground plane at the height in between 2.2 and 7.2 meters.Each video sequence has the size 1280 × 720 and includes moving humans.In addition, the performance evaluation of tracking and surveillance (PETS) 2009 dataset [32] was used to test the proposed algorithm.
Figure 6 shows the result of human height estimation using the proposed method.Although the size of the human in a 2D image looks different by scenes, the resulting normalized height is correctly estimated from all different scenes using a prespecified height of the reference object for the camera calibration.
Figure 7 shows the results of error correction of the estimated height.Figure 7(a) shows the human height estimation error caused by the human pose change.The human height estimation error is corrected using the proposed method as shown in Figure 7(b).
Figure 8(a) shows the human height estimation error caused by an occlusion.The height estimation error of the occluded human is reduced by the proposed method as shown in Figure 8(b).
Figure 9 shows estimation results of multiple human heights.As shown in Figure 9(a), the height of separated human is estimated.In Figures 9(b Figure 10 shows the estimated human height in video frames, where the ground truth of the object height is 1.75 meters shown as the solid curve.The dotted and daggered curves, respectively, show the estimated human height without and with error correction.As shown in Figure 10, the human height estimation error is reduced by 0.027 to 0.012 meters using the proposed error correction method.

Conclusions
An automatic calibration method is presented using object detection and tracking by multiple, uncalibrated cameras.As a result, the camera parameters including the camera     height are estimated.Moreover, the proposed algorithm can be applied to estimate the normalized human height.As a result, the normalized human height is estimated using multiple uncalibrated cameras.The proposed method can be applied to object tracking and recognition in a very-large area video surveillance system.

Figure 1 :Figure 2 :
Figure 1: Block diagram of the proposed human height estimation method.
Figures 2(b) and 2(c), respectively, show the foreground image and human detection result of a video shown in Figure 2(a).Figure 2(d) shows the region classification result, where single human and single nonhuman regions are, respectively, represented by red and black boxes.

Figure 3 :
Figure 3: Results of human tracking using the proposed method: (a) the 2522th frame, (b) the 2557th frame, (c) the 2743th frame, and (d) the 2819th frame.

Figure 4 :
Figure 4: Vanishing line estimation process using the human information.
), 9(c), and 9(d) some humans adjoined and made multiple human regions.As shown in Figures9(b), 9(c), and 9(d), the proposed method estimates the human height by classifying each region.

Figure 5 :Figure 6 :
Figure 5: Human height estimation model using reference height.

Figure 7 :Figure 8 :
Figure 7: Height estimation error correction results of the walking human: (a) human height estimation error caused by the human pose change while the human is walking and (b) error corrected human height estimation results of (a).
without error correction Height estimation results with error correction

Figure 10 :
Figure 10: Comparison between the estimated height and ground truth.