This paper presents a normalized human height estimation algorithm using an uncalibrated camera. To estimate the normalized human height, the proposed algorithm detects a moving object and performs tracking-based automatic camera calibration. The proposed method consists of three steps: (i) moving human detection and tracking, (ii) automatic camera calibration, and (iii) human height estimation and error correction. The proposed method automatically calibrates camera by detecting moving humans and estimates the human height using error correction. The proposed method can be applied to object-based video surveillance systems and digital forensic.
Ministry of Science, ICT and Future PlanningB0101-15-0525IITP-2016-H8501-16-10181. Introduction
A large-scale video analysis using multiple cameras is gaining attractions in visual surveillance applications. In particular, as the use of the object-based video analysis increases, the demand for extraction of object information is growing up. Since an object information is changed depending on the camera parameters such as a location of the installed camera, viewing angle, and focal length, for this reason, various normalized object feature extraction methods were proposed.
Lao et al. proposed a human motion analysis method for consumer surveillance system [1]. This method estimates human moving trajectories by tracking and recognizing the human motion. Del-Blanco et al. proposed a multiple object detection and tracking framework for the automatic counting of object numbers in a video surveillance application [2]. Lee et al. detected the object region and estimated the depth information using multiple color-filtered apertures (MCA) [3]. Chantara et al. proposed fast object tracking method using adaptive template matching [4]. Chu and Yang detected a moving object using a background model and estimated the object velocity using the object with a previously known length [5]. Maik et al. train the typical poses in both the 2D image and 3D space and represent the located poses as a silhouette for the human pose estimation [6]. Kang et al. proposed the human gesture detection and tracking method by using the real-time stereo matching [7]. However, this method uses two or more cameras for the depth estimation. In order to estimate the 3D information using single camera, the camera calibration methods [8–11] are proposed. Arfaoui and Thibault used a diffractive virtual grid to estimate camera parameters for a fish-eye lens camera [12]. Neves et al. corrected the fish-eye distortion using parallel lines and then calibrated a static and pan-tilt-zoom (PTZ) cameras using an object height [13]. Zhang et al. tracked homography based on the model plane and then estimated camera parameters using maximum likelihood (ML) approach [14]. Bell et al. used digital display to generate feature points for out-of-focused camera calibration [15]. Kual-Zheng proposed an object height estimation method that extracts feature points and estimates vanishing points using a special pattern such as a cubic box [16]. Gallagher et al. proposed a method to analyze a human age and gender by calibrating a camera and analyzing distances among eyes, nose, and mouth in a human face [17]. Since this method uses a special pattern board for the camera calibration, successful analysis is difficult when the size of a face is small. Shao et al. calibrated a camera using an optical flow method and normalized the object height to estimate a moving object at the cost of increased computational complexity [18]. Zhao and Hu used a pure translation to calibrate a camera [19], and Li et al. reduced control points given the intrinsic camera parameters to calibrate a pan-tilt camera [20]. Andaló et al. estimated vanishing points by clustering lines in an image and then calculated the object height [21]. However, this method cannot accurately estimate vanishing points when the background does not include sufficient pairs of parallel lines. User input of the human height is another burden of this method.
To solve the abovementioned problems, the proposed method calibrates a camera by detecting and tracking the object region. In addition, a projective matrix, which is a result of the camera calibration, is applied to the proposed human height estimation method, and then estimated human heights are accumulated and corrected using the Random Sample Consensus (RANSAC) algorithm. As a result, the proposed method can estimate the normalized human height using an uncalibrated camera for a visual surveillance system. This paper is organized as follows. Section 2 describes the camera projective model, and Section 3 presents the proposed camera calibration and human height estimation algorithms. Experimental results are shown in Section 4, and Section 5 concludes the paper.
2. Camera Projection Model-Based Calibration: A Review
An object is projected onto a two-dimensional image with different sizes depending on the distance between the object and a camera. In order to estimate the human height using a single camera, the projective relationship between the 3D space information and the 2D image plane is needed. The pin-hole camera projective model [22] is given as(1)sx=AR∣tX,where X represents the coordinate in the 3D space, matrix A contains intrinsic camera parameters, R represents the camera rotation matrix, t represents the camera translation vector, x represents the coordinate in the 2D image plane, and s represents the scale factor. The camera intrinsic parameter is determined by focal length (fx,fy), principal point (px,py), skewness skew, and aspect ratio a as(2)A=fxskewpx0fypy00a.
To simplify the camera calibration process, the proposed method assuming that fx=fy, the principal point is the center of the image, skew=0, and a=1. In the same manner, the camera rotation with regard to z-axis is zero and the translations with regard to x-axis and y-axis are also zero.
Using the vanishing points and lines, Liu et al. [23] compute the camera parameters as(3)f=a3a2-pyvy-py,ρ=atan-vxvy,θ=atan-vx2+vy2f,hc=ho1-doh,vlof-v/dof,vloh-v,where f represents the focal length, ρ the rolling angle in degree, θ the tilt angle in degree, hc the camera height, vl the horizontal vanishing line a1x+a2y+a3=0, v=vxvyT the vertical vanishing point, ho the object height in the world coordinate, of the object foot position, oh the object head position, and d(A,B) the distance measure between two points A and B.
In order to estimate the physical size of an object in the 3D space using the object size in the 2D image, the proposed method detects the moving human to estimate the 3D space information. To estimate the human height, the proposed method assumes the foot position on the flat ground plane. As a result, the foot position in the 2D image plane is inversely projected into the 3D space to obtain the human height information.
3. Normalized Human Height Estimation
The proposed human height estimation algorithm is an extended version of Jung et al.’s work [24] and consists of three steps: (i) moving human detection and tracking, (ii) automatic camera calibration, and (iii) reference object-based human height estimation with error correction. Figure 1 shows the block diagram of the proposed human height estimation method, where Ik represents the kth input frame, R the moving human region, O the human tracking region, P the projective matrix, Hk the kth height estimation result of the human, and H the error corrected height estimation result.
Block diagram of the proposed human height estimation method.
3.1. Moving Human Detection and Tracking
The proposed method first detects a moving human to estimate its height. If the detected human region includes the background region or if the region loses some part of the human body, an accurate estimation of the human height is difficult. For this reason, the proposed method generates a background using the Gaussian mixture model (GMM) [25, 26] and then detects and labels the foreground image. The regions that do not have enough pixels in the foreground image are removed to reduce the noise.
The detected foreground regions include not only a single human region but also a group of human region possibly with nonhuman objects, which make human tracking difficult, and as a result, human height estimation error increases. For that reason, the proposed method classifies each region according to whether it is a human region or not. The proposed classification method uses the combined histogram of oriented gradients and local binary pattern (HOG-LBP) and a support vector machine- (SVM-) based human detection method [27]. Using the detected human information, each foreground region is classified into two regions. The first region is a single human region that has only one human object. The second region is a single nonhuman region that has either none or multiple humans. Figure 2 shows the moving human region detection and classification results. Figures 2(b) and 2(c), respectively, show the foreground image and human detection result of a video shown in Figure 2(a). Figure 2(d) shows the region classification result, where single human and single nonhuman regions are, respectively, represented by red and black boxes.
Moving human region detection and classification results: (a) an input frame, (b) the detected and labeled foreground regions, (c) the pedestrian detection results, and (d) classified regions.
The proposed method tracks the human and estimates the height in a video using the detected single human region. Although the Kalman filter tracker [28] is a popular stochastic tracking method, it cannot track a nonlinearly moving object. To solve this problem, the proposed method uses a particle filter tracker [29]. In a surveillance input video, human information, such as size and shape, changes while the human is walking. For this reason, the model-based tracking [30] method models the target human using a color histogram to deal with the dynamic characteristics of the moving human. In the proposed method, the HSV color histogram is used to represent the human region to reduce the sensitivity to the illuminance.
The particle filter tracking results may include a probabilistic error and cannot detect the entire human region. Moreover, if the number of particles increases to reduce the tracking error, the time complexity also increases. To solve these problems, the proposed method detects the tracked human region by matching the detected human region with the tracked human regions as(4)Oi=argmaxR1njIRj,Ti,where Oi represents the ith human region, Rj the jth moving human region that is detected using the background model, nj the number of pixels in the moving human region Rj, and Ti the tracking region about the ith human. After matching, the proposed method uses additional trackers for unmatched single human regions.
Figure 3 shows the human tracking results using the proposed method. In Figure 3, the red box represents the particle filter tracking result about the moving human, and the white box represents the optimal rectangular region that encloses the detected human region.
Results of human tracking using the proposed method: (a) the 2522th frame, (b) the 2557th frame, (c) the 2743th frame, and (d) the 2819th frame.
3.2. Vanishing Point and Line Estimation Using Human Information
The normalized human height can be estimated in meters in the 3D space by estimating camera parameters. For the automatic camera calibration, the vanishing points and line should be estimated using the parallel lines in the image. Li et al. estimated the vanishing line and point by extracting the lines from the background structure [31]. However, this method cannot calibrate the camera if the background structure does not have a sufficient number of parallel lines. For automatic calibration without using parallel lines, the proposed calibration method uses the moving human information [23]. More specifically, the proposed method detects both foot and head positions of the human in the 2D image as(5)h2D=argminoyOi_k,f2D=1ni_s∑k=1ni_sOisk,where h2D represents the human head position in the 2D image, yOi_k the y-axis coordinate of the kth pixel in the ith human region Oi, f2D the foot position in the 2D image, Ois(k) the sample foot region that consists of 10% pixels of the ith human region, and ni_s the number of pixels in the sample region.
Both vanishing points and line can be estimated using the detected foot and head positions. The vertical vanishing point can be estimated using the intersection between the foot-to-head lines that include both the foot and head points from the corresponding human region. The horizontal line is estimated using two or more horizontal vanishing points, and the horizontal vanishing point is using the intersection between the foot-to-foot and head-to-head lines. The foot-to-foot and head-to-head lines, respectively, include foot and head points. Both vanishing points and line are estimated using the RANSAC algorithm to reduce the estimation error. Figure 4 illustrates the human-based vanishing point and line estimation process.
Vanishing line estimation process using the human information.
3.3. Human Height Estimation and Error Correction
The proposed method computes the foot point in the 3D space for the normalized human height estimation using multiple videos acquired by different cameras. To calculate the foot point in the 3D space, the foot point in the 2D image is inversely projected into 3D space. The 3D point is on the line that connects the human foot point in the 3D space with the corresponding image sensor. Since the camera height is estimated based on the ground plane that includes the human foot points, the 3D foot point can be obtained by normalizing the inversely projected point with respect to the z-axis. As a result, the foot point in 3D can be calculated as(6)f3D=1ZPTP-1PTf2D,where f3D represents the foot point in the 3D space, f2D the foot point in the 2D image, P the projective matrix, and Z the z-axis coordinate of the point that is inversely projected from the foot point in 2D image.
The reference head point in the 3D space can be estimated by translating the foot point to the vertical direction of the ground plane. Using the reference head point in the 3D space, the corresponding head point in the 2D image is given as(7)href2D=Phref3D,where href2D represents the reference head point in the 2D image and href3D the corresponding head point in the 3D space. Using the reference head point, the human height can be estimated as(8)Ho=yref_h2D-yf2Dyh2D-yf2DHref,where Ho represents the estimated human height, Href the reference height, yref_h2D the y-axis coordinate of the reference head point in the 2D image, yh2D the y-axis coordinate of the human head point in the 2D image, and yf2D the y-axis coordinate of the human foot point in the 2D image. In this work, the reference height of 1.8 meters was used. Figure 5 shows the human height estimation model using reference height.
Human height estimation model using reference height.
The accuracy of human height estimation depends on the detected human region. To reduce the human height estimation error, the proposed method accumulates the estimated human heights in each frame and corrects the errors using the RANSAC algorithm. In the first step of the RANSAC algorithm, ns sample heights are randomly extracted. Next step computes the sum of squared differences (SSD) between the average height and each estimated human height. The first and second steps repeat ni times to obtain the error corrected height.
4. Experimental Results
The proposed human height estimation results are shown in this section. The test video was acquired using an uncalibrated camera viewing down the ground plane at the height in between 2.2 and 7.2 meters. Each video sequence has the size 1280 × 720 and includes moving humans. In addition, the performance evaluation of tracking and surveillance (PETS) 2009 dataset [32] was used to test the proposed algorithm.
Figure 6 shows the result of human height estimation using the proposed method. Although the size of the human in a 2D image looks different by scenes, the resulting normalized height is correctly estimated from all different scenes using a prespecified height of the reference object for the camera calibration.
Height estimation results of the same person in three different scenes: (a) scene 1, (b) scene 2, and (c) scene 3.
Figure 7 shows the results of error correction of the estimated height. Figure 7(a) shows the human height estimation error caused by the human pose change. The human height estimation error is corrected using the proposed method as shown in Figure 7(b).
Height estimation error correction results of the walking human: (a) human height estimation error caused by the human pose change while the human is walking and (b) error corrected human height estimation results of (a).
Figure 8(a) shows the human height estimation error caused by an occlusion. The height estimation error of the occluded human is reduced by the proposed method as shown in Figure 8(b).
Height estimation error correction results of the occluded human: (a) human height estimation error caused by an occlusion and (b) error corrected human height estimation results of (a).
Figure 9 shows estimation results of multiple human heights. As shown in Figure 9(a), the height of separated human is estimated. In Figures 9(b), 9(c), and 9(d) some humans adjoined and made multiple human regions. As shown in Figures 9(b), 9(c), and 9(d), the proposed method estimates the human height by classifying each region.
Figure 10 shows the estimated human height in video frames, where the ground truth of the object height is 1.75 meters shown as the solid curve. The dotted and daggered curves, respectively, show the estimated human height without and with error correction. As shown in Figure 10, the human height estimation error is reduced by 0.027 to 0.012 meters using the proposed error correction method.
Comparison between the estimated height and ground truth.
5. Conclusions
An automatic calibration method is presented using object detection and tracking by multiple, uncalibrated cameras. As a result, the camera parameters including the camera height are estimated. Moreover, the proposed algorithm can be applied to estimate the normalized human height. As a result, the normalized human height is estimated using multiple uncalibrated cameras. The proposed method can be applied to object tracking and recognition in a very-large area video surveillance system.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (B0101-15-0525, Development of Global Multitarget Tracking and Event Prediction Techniques Based on Real-Time Large-Scale Video Analysis), by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2016-H8501-16-1018) supervised by the IITP (Institute for Information & Communications Technology Promotion), and by the Technology Innovation Program (Development of Smart Video/Audio Surveillance SoC & Core Component for Onsite Decision Security System) under Grant 10047788.
LaoW.HanJ.de WithP. H. N.Automatic video-based human motion analyzer for consumer surveillance system200955259159810.1109/TCE.2009.51744272-s2.0-68949175641Del-BlancoC. R.JaureguizarF.GarciaN.An efficient multiple object detection and tracking framework for automatic counting and video surveillance applications201258385786210.1109/TCE.2012.63113282-s2.0-84867320220LeeS.LeeJ.HayesM. H.PaikJ.Adaptive background generation for automatic detection of initial object region in multiple color-filter aperture camera-based surveillance system201258110411010.1109/TCE.2012.61700612-s2.0-84859070840ChantaraW.MunJ.ShinD.HoY.Object tracking using adaptive template matching2015411910.5573/ieiespc.2015.4.1.001ChuH.-C.YangH.A simple image-based object velocity estimation approachProceedings of the 11th IEEE International Conference on Networking, Sensing and Control (ICNSC '14)April 2014Miami, Fla, USA10210710.1109/icnsc.2014.68196082-s2.0-84902383586MaikV.ParkJ.KimD.PaikJ.Model-based human pose estimation and its analysis using hausdorff matching2015225110.15323/techart.2015.05.2.2.51KangS.RohA.EemC.HongH.Using real-time matching for human gesture detection and tracking201411606610.15323/techart.2014.02.1.1.60LeeS.JeongS.YuH.KimG.KwakH.KangE.LeeS.Efficient image transformation and camera registration for the multi-projector image calibration2016313810.15323/techart.2016.02.3.1.38KimH.KimD.PaikJ.Automatic estimation of spatially varying focal length for correcting distortion in fisheye lens images201326339344Poulin-GirardA.ThibaultS.LaurendeauD.Influence of camera calibration conditions on the accuracy of 3D reconstruction2016243267810.1364/OE.24.002678TengC.-H.ChenY.-S.HsuW.-H.Camera self-calibration method suitable for variant camera constraints200645468869610.1364/AO.45.0006882-s2.0-33645035345ArfaouiA.ThibaultS.Fisheye lens calibration using virtual grid201352122577258310.1364/AO.52.0025772-s2.0-84876572729NevesJ. C.MorenoJ. C.ProençaH.A master-slave calibration algorithm with fish-eye correction20152015842727010.1155/2015/4272702-s2.0-84943411198ZhangY.ZhouL.LiuH.ShangY.A flexible online camera calibration using line segments2016201616280234310.1155/2016/28023432-s2.0-84955514283BellT.XuJ.ZhangS.Method for out-of-focus camera calibration20165592346235210.1364/ao.55.002346Kual-ZhengL.A simple calibration approach to single view height estimationProceedings of the 9th Computer and Robot Vision (CRV '12)May 2012283010.1109/CRV.2012.29GallagherA. C.BloseA. C.ChenT.Jointly estimating demographics and height with a calibrated cameraProceedings of the IEEE 12th International Conference on Computer Vision (ICCV '09)September 2009Kyoto, Japan11871194ShaoJ.ZhouS. K.ChellappaR.Robust height estimation of moving objects from uncalibrated videos20101982221223210.1109/TIP.2010.2046368MR28141372-s2.0-77954707597ZhaoB.HuZ.Camera self-calibration from translation by referring to a known camera201554257789779810.1364/AO.54.0077892-s2.0-84942371643LiY.ZhangJ.TianJ.Method for pan-tilt camera calibration using single control point2014221156163AndalóF. A.TaubinG.GoldensteinS.Efficient height measurements in single images based on the detection of vanishing points2015138516010.1016/j.cviu.2015.03.0172-s2.0-84938744809LiY.ZhangJ.TianJ.Camera calibration from vanishing points in image of architectural scenesProceedings of the British Machine Vision ConferenceSeptember 1999Nottingham, UK382391LiuJ.CollinsR.LiuY.Surveillance camera autocalibration based on pedestrian height distributionsProceedings of the British Machine Vision ConferenceJanuary 2011111JungJ.KimH.YoonI.PaikJ.Human height analysis using multiple uncalibrated camerasProceedings of the 2016 IEEE International Conference on Consumer Electronics (ICCE '16)January 2016Las Vegas, Nev, USA21321410.1109/icce.2016.7430585ZivkovicZ.Improved adaptive Gaussian mixture model for background subtraction2Proceedings of the 17th International Conference on Pattern Recognition (ICPR '04)August 2004Cambridge, UK283110.1109/ICPR.2004.479KimY.JeongS.OhJ.LeeS.Fast MOG (Mixture of Gaussian) algorithm based on predicting model parameters2015214145ParkW.-J.KimD.-H.SuryantoLyuhC.-G.RohT. M.KoS.-J.Fast human detection using selective block-based HOG-LBPProceedings of the 19th IEEE International Conference on Image Processing (ICIP '12)September 2012Orlando, Fla, USA60160410.1109/icip.2012.64669312-s2.0-84875848953PathanS. S.Al-HamadiA.MichaelisB.Intelligent feature-guided multi-object tracking using Kalman filterProceedings of the 2nd International Conference on Computer, Control and CommunicationFebruary 20091610.1109/ic4.2009.49092602-s2.0-70349131911ZhangT.FeiS.LiX.LiH.An improved particle filter for tracking color object2Proceedings of the International Conference on Intelligent Computation Technology and Automation (ICICTA '08)October 2008Hunan, China10911310.1109/icicta.2008.1832-s2.0-57849115328RheeE. J.ParkJ.SeoB.ParkJ.Subjective evaluation on perceptual tracking errors from modeling errors in model-based tracking20154640741210.5573/ieiespc.2015.4.6.407LiB.PengK.YingX.ZhaH.Vanishing point detection using cascaded 1D Hough transform from single images20123311810.1016/j.patrec.2011.09.0272-s2.0-80054999177Performance evaluation of tracking and surveillance 2009, http://www.cvg.reading.ac.uk/PETS2009/a.html