We propose an approach about multiview markerless motion capture based on a 3D morphable human model. This morphable model was learned from a database of registered 3D body scans in different shapes and poses. We implement pose variation of body shape by the defined underlying skeleton. At the initialization step, we adapt the 3D morphable model to the multi-view images by changing its shape and pose parameters. Then, for the tracking step, we implement a method of combining the local and global algorithm to do the pose estimation and surface tracking. And we add the human pose prior information as a soft constraint to the energy of a particle. When it meets an error after the local algorithm, we can fix the error using less particles and iterations. We demonstrate the improvements with estimating result from a multi-view image sequence.
1. Introduction
The detection and recovery of human shapes and their 3D poses in images or videos are important problems in computer vision area. There are many potential applications in diverse fields such as motion capture, interactive computer games, industry design, sports or medical purpose, interfaces for human computer interaction (HCI), surveillance, robotics. Model-based motion capture is especially suited to markerless motion capture since it provides a way to constrain the search space by the degrees of freedom of the skeleton. Initialization of human motion capture always requires the definition of the human models approximate the shape, kinematic structure, and initial pose. Most of the approaches built a scan model of the person to be tracked. The majority of algorithms for human pose estimation are using an initialized general model with limb lengths and shape manually. Accurately detailed human model can be used for markerless motion capture to track a subject individual model, which includes information on both body shape and pose. Detailed 3D human shape and pose estimation from multi-view images is still a difficult problem that does not have a satisfactory solution.
We all know that the local optimization methods are faster, but if there are visual ambiguities, or fast motions, the tracker might fail. To get more robust result, global optimization methods can be used, like particle filter technique, because they can represent uncertainty by a rigorous Bayesian paradigm. The problem is that so many particles are needed to get the right predicted result in the dimension of human pose parameter space with usually more than 20 degrees of freedoms. Gall et al. [1] propose an approach combining local and global algorithm using skeleton and surface information. But it needs an accurate 3D scan model and it is sensitive to noise of silhouettes. As we all know that getting a 3D scan human model is very expensive, we use a 3D morphable model like Jain et al. [2] to generate an individual human model. Our parametric representation of human body is based on a 3D morphable human model with an underlying predefined skeleton. Kinds of human models can be generated based on the deformable human model that is learned from a scan database [3] of over 550 full body 3D scans taken from 114 undressed subjects. The estimated refined shape and skeleton pose from multi-view images serve as initialized model for the next frame to be tracked. For most of the controlled environments, the local and global method can get the right result. But particle filtering costs a much higher running time to search the whole pose state space, and it needs lots of particles and iterations to predict. We add an energy function to constrain each particle by silhouettes and additional pose prior information instead of relying on only a large number of particles to search the whole pose space. It makes use of much fewer particles according to the gradient of an evaluation function.
The remaining sections of this paper are organized as follows. In Section 2, we will present the relevant previous work on model-based markerless motion capture. In Section 3, we describe the 3D morphable human model and skeleton defined information. In Section 4, we will describe the optimization and pose soft constraints algorithm in detail. In Section 5, we will show the estimation result. We will conclude this paper in Section 6.
2. Related Works
The papers [4–6] present comprehensive survey of existing related techniques in motion capture research area of computer vision. A model-based markerless motion capture system can be divided into four steps: initialization, tracking, pose estimation, and recognition. The initialization step is concerned with two things: the initial pose of a subject and the model representing the subject. Shape and pose initialization can be obtained by manual adaptation or using automatic methods; the latter methods still have some limitations, such as the requirement of specific pose or predefined motion style. The prior model can be of several kinds: kinematic skeleton, shape, and color priors. Many approaches employ kinematic body models; it is hard for them to capture motion, let alone detailed body shapes. For improved accuracy in tracking, an articulated model which approximates the shape of a specific subject is needed. Because of few images, people cannot get the accurate body shape information, furthermore, the shape of the subject can differ from person to person. Our approach is based on a 3D morphable model of human shape and pose similar to [2]. Jain et al. using the morphable model to estimate human shape and pose simultaneously, they designed both shape particles and pose particles as the search space, its computational time is very high and it just used in reshape the human in 2D images.
A lot of model-based pose estimation algorithms are based on minimizing an error function that measures how well the 3D model fits the images. A popular parametric model SCAPE (Shape Completion and Animation for PEople) [7], which is a data-driven method for building body shapes with different poses and individual body shape. This model has recently been adopted as morphable model to estimate human body shape from monocular or multi-view images [8–12]. Bǎlan et al. [8] have adopted this model closer to observed silhouettes to capture more detailed body deformations; however, it cannot capture skeleton joint parameters.
Most recently, the approach has been used to infer pose and shape from a single image. Guan et al. [9] have considered more visual cues, shading cues, and internal edges as well as silhouettes to fit the SCAPE model to an uncalibrated single image with the body height constrained. Sigal et al. [10] describe a discriminative model based on mixture of experts to estimate SCAPE model parameters from monocular and multicameras image silhouettes. Chen et al. [11] proposed a probabilistic generative method that models 3D deformable shape variations and infers 3D shapes from a single silhouette image. They use nonlinear optimization to map the 3D shape data into a low-dimensional manifold, expressing shape variations by a few latent variables. Pons-Moll et al. [13] proposed a hybrid tracker approach that combined correspondence based local optimization with five inertial sensors placed at human body; although they can obtain a much accurate and detailed human tracker, they need additional sensors. The existing research about model-based tracking approaches the problem using a Bayesian filtering formulation or as an optimization problem. Gall et al. [1, 14, 15] introduce an approach for global optimization that is for human motion capturing called interacting simulated annealing (ISA), which is based on a particle filter and simulated annealing. While global optimizations are capable of running fully automatic, the computation time is very high. Recently, several papers [16, 17] implement it and improve it. In our system, we get the individual human model with the same pose as the subject we are going to track automatically. Human shape and pose are captured by multiple synchronized and calibrated cameras. The overview of our system is showed in Figure 1.
Pipeline of our method.
3. 3D Morphable Model
Principal Component Analysis (PCA) is a popular statistical method to extract the most salient directions of data variation from large multidimensional data sets. Our morphable model is based on scan database [3] of over 550 full body 3D scans taken from 114 undressed subjects. All subjects are scanned in a based pose, some subjects are scanned in 9 poses chosen randomly from a set of 34 poses. They also defined semantic correspondence between the scans. We learn what the PCA model contains for each subject and the modeling shape variations of human body via PCA method. Therefore, a human model is given by
(1)M(α)=m0+∑i=1nαimi,
where the human shape parameter is α={α1,α2,…,αn}, mi is the ith eigen human and m0 is the mean or average human model. Similar to [1, 2], the morphable model is a combination of bone skeletons and joints. Like Jain et al. [2] and Gall et al. [1], we drive the body pose by a defined underlying skeleton. Shows in Figure 2. And the shape parameters can be described by PCA parameters. In our paper, we define 20 human PCA components like [2]. We define a kinematic chain; the motion of body model can be parameterized by the joint angles. For many years, kinematic chains are widely used human tracking and motion capture systems. The mesh deformation can be controlled by linear blend skinning (LBS) technique. If ωi,j is the position of vertex i,Tj is the transformation of the bone j, and ωi,j is the weight of the j bone for vertex i, LBS gives the position of the transformed vertex i as:
(2)vi′=∑jωi,j(Tjvi).
The bone weights for the vertices mean how much each bone transformation affects each vertex. These weights are normalized such that ∑ωi,j=1. We rig the skeleton with 22 degrees of freedom including the six degrees for the global position and orientation of the model. And we rig the model using the autorig method of Baran and Popović [18], attaching the weights value in Maya software.
The template human model with the underlying skeleton.
3.1. Human Adaptation for Multiview
We change the human pose according to the real pose from the visual hull model which is reconstructed from multi-view silhouettes. Then, we estimated the shape parameters from multisilhouettes. For the detailed information, we refer to our former paper [19]. And the adaptation results shown in Figures 3 and 5.
The morphable model adaptation result for the first frame. (a) the result of morphable model projected to image, (b) the scan model overlaid with the estimated model fit in, and (c) the difference between two model with corresponding mesh points,the unit is millimeter, the average distance is 0.075 mm and standard deviation is 16.16 mm.
3.2. Human Kinematic Chain
The position of 3D vertex vi which is associated with kinematic chain ki and influenced by nki, the rigid body motion was represented as a twist. A joint of a body limb can be modeled by a twist θξ^. Every 3D rigid motion can be represented in an exponential form of the homogeneous matrix as follows:
(3)M=exp(θξ^)=exp(ω^ν03×10),
where θξ^ is the matrix representation of a twist ξ^∈se(3)={(ν,ω^)∣ν∈R3,ω^∈so(3)} with so(3)={A∈R3×3∣A=-AT}.
The coordinates of the transformed point can be described as
(4)TχVi=∏j=0nkiexp(θlki(j)ξ^lki(j))Vi,
where lki is a mapping that represents the order of the joints in the kinematic chain, ki is the limb associated with Vi, nki are the joints influencing the position and rotation of limb ki. For further details we refer to [20]. We denote by the joint angles state vector θ=(θ1,…,θn) and the 6 parameters of the twist ξ0 associated with the model reference system. We define a vector that represents the state of the human model as
(5)χ=(θ0ξ^,θ).
In Section 4, we will describe how to compute vector χ that makes the 3D morphable model pose to fit the pose of the person in images.
4. Optimization and Surface Tracking
In the multiple camera way, each camera has its own coordinates. We should transform the human body model local coordinates (x,y,z) into the image coordinates (x′,y′), this can be done by three steps: the first step is to transform the local coordinates into the world coordinates, the second is to transform the world coordinates into the camera coordinates, and the last is to project the camera coordinates into the image coordinates.
4.1. Local Optimization
For the local optimization, we use contour correspondences and the texture correspondences to get the right estimation result.
According to the extracted image body contour and the silhouette of the projected surface mesh, the closest point correspondences between these two contours can be used to define a set of corresponding 3D rays and 3D points in order to minimize the error between the 2D and 3D point of a correspondence. For the texture correspondences, we use SIFT features between two frames taken from the same camera (Figure 4). 3D point-line based pose estimation is modeled as a 3D plücker line L=(d,l) with the direction vector d of the line and moment l. The error of a pair of 3D-2D points can be given by the norm vector between the transformed 3D point TχVi and the 3D ray line Li=(di,li),
(6)∥∏(TχVi)×di-li∥2
while the transformed point is in homogeneous coordinates and the plücker coordinates are not, ∏ is the projection from the homogeneous to nonhomogeneous coordinates.
The sift feature of neighbour frames.
The result of shape and pose initialization from 8 view images.
For the accurate result, we have to minimize the alignment error between the body contour of the image and the projected surface mesh.
To find the vector χ, the sum of errors over all correspondences should be minimized. Assume that have N correspondences, the error to be minimized can be described by the following equation:
(7)argmin12∑iNwi∥∏(TχVi)×di-li∥22,
where wi is a weight parameter of each correspondence. We linearise the model using the first order Taylor approximation as follows:
(8)exp(θξ^)=∑k=0∞(θξ^)kk!≈I+θξ^=I+θ0ξ^0+⋯+θjξ^j,
where I is the identity matrix, and if we insert the Taylor approximation into (7), we can get the following equation:
(9)argmin12∑iNwi∥∏((I+θ0ξ^0+∑j=1nkiθlki(j)ξ^lki(j))Vi)×di-li∥22.
Although the local optimization has converged to a minimum value, it is not guaranteed that the result is right. For example, we have a low error, but some limbs are still misaligned. We calculate this error for each limb; if a misaligned limb is detected, then all subsequent limbs will be labelled as misaligned. Then, we continue the global optimization step to fix it.
4.2. Filter Particle with Soft Constraints
The local optimization methods are faster and they can get accurate results, but if there are visual ambiguities or fast motions, the motion track will fail. To deal with these problems of local optimization system, we use particle filters to represent uncertainty through a rigorous Bayesian paradigm. The global optimization methods use a set of particles to estimate the pose. If whole body pose has to be estimated, the computation time will be very large. The problem with the global optimization is the distribution of the optima in search space. Every particle has its own state and a weight. After each iteration, the particle generates a new state. It is constructed by a linear interpolation between the predicted pose and the estimated pose from the previous frame. Usually, the compute time is very large; it is determined by the number of particles and iterations. When all particles have been updated, they will be resampled. The particles that are far from the correct solution will be discarded. The weights of particles are evaluated according to the following equation and normalised to a sum of 1. For reducing the number of particles and iterations, we add the pose prior information to constrain the particle. So, in order to find the optimal value for pose χ, we define the energy function as follows:
(10)E(χ)=ES(χ)+λER(χ~)+Ep(χ),
where the first term measures the silhouette error between the projected surface model and the silhouette image. The second term is a penalty for strong deviations from the predicted pose, and λ is the weight factor of the penalty; we set it to 0.01. The third term is a human pose prior constraint of the predicted pose. The silhouette error for pose χ for view camera C calculates the pixelwise differences between the projected surface model and the silhouette image. It is generated by projecting the surface model according to the pose of the particle; the error of a particle pose for view C can be given by
(11)ESC(χ)=1|SCp|∑p∈SCp|DCp(χ)(p)-DC(p)|+1|SC|∑q∈SC|DCp(χ)-DC(q)|,
where SCp is the estimated projected surface and SC is the correct binary silhouette images, DCp and DC are their Chamfer distance transforms, and the sums over p and q are to show that the differences are only the pixels located inside the silhouette area of projected surface model SCp and the silhouette SC, not the background areas. So, it is very expensive for every particle. We set each pixel inside the projected surface to zero; the silhouette energy term ES(χ) is defined as the average of ESC(χ) over all views.
The second term of (10) describes a smoothness constraint in the lower dimensional parameter space as follows:
(12)ER(χ~)=∥χ~-P(χ^)∥22.
The third term is a soft human pose constraint of the energy function. It contains the human anatomical constraints and the pose probability density from the training samples:
(13)EP(χ)=Eprior(χ)+Elearned(χ).
For the human motion, the joint angle should abide to human anatomical rule, so
(14)Eprior(χ)=∑imax2(0,χimin-χi,χi-χimax)σ2,
where the joint angle bounds(χimin,χimax), like papers [21], we learning the various pose probabilities from a set of training samples. We use a nonparametric density estimate by Parzen-Rosenblatt estimator [22, 23] with a Gaussian kernel like Brox et al. [15, 20] as follows:
(15)plearned(χ)=12πσT∑j=1Texp(-(χj-χi)22σ2),
where T is the number of training samples and σ is the kernel width parameter, which is learned from motion capture data, and is the maximum nearest neighbor distance between all training samples. The pose prior information can provide human anatomical constraints and right pose parameters of normal degrees of freedom (DOF). We use about 100 samples from different motions for the physical constraints by
(16)Elearned(χ)=-log(plearned(χ)).
5. Experimental Results
We test our system using a database of MPI08 [13, 24] (hb data set) provided by the University of Hannover Germany; the person is captured with 8 HD cameras with a resolution of 1004 * 1004 pixels. For the 3D mesh model, considering the difficulty to get the laser scan model, we use our 3D morphable model generated by PCA method according to the multi-view image silhouettes. Due to complex motion, the correspondences between the model and the silhouettes cannot provide enough information to estimate the correct pose.
The local optimization is capable of tracking the person, but it cannot recover from errors. These errors usually happen for smaller parts of the body, while bigger parts are less prone to error. The error’s number highly depends on the visibility of the body parts; the frame rate and the speed of the moving body parts are the reason too. The global optimization is only initiated after misalignment. But global optimization algorithms are hard to use pose estimation because of the high computational cost of approaches. When we add the pose prior information, the particles can be optimized. Gall et al. [1], 15 iterations and (20 * 15 = 300) a maximum of 300 particles are sufficient to estimate the pose correctly. We just use 25 pose particles and 10 iterations by pose prior information constraint. In the MPI08 data set, no ground truth data is available. For the experiments carried out using this data set, there is no concrete evaluation to be given. Therefore, we will use our own vision to determine whether an estimation result is visually correct or incorrect. Show as Figures 5, 6, 7, 8, and 9. Figure 5 show the result of pose and shape initialization. For testing the validity of the morphable, we use the visual hull mesh model to track. The visual hull model is very rough, and the skeleton cannot get well rig with mesh vertices especially in shoulder parts. When the skeleton deforms, the mesh will make an error, as shown in Figure 7. Figure 8 shows the estimation result using morphable model from view camera 2. Figure 9 shows the local and global estimation result.
The estimation result from camera view 1.
The estimation result using visual hull mesh mode from view camera 2: for the limit of camera numbers, the visual hull model is not very exact and the irregular triangle mesh especially in the shoulder parts, when the mesh deforms the pose, there will generate error like the frame 45, and the following frames will be all error. The morphable model can get right estimation results.
The estimation result using morphable model: (d, f) the estimation result projection to original image; (b, e) the silhouette difference of the estimated projected view and the original view; the green parts belong to the original silhouette, and purple parts belong to the estimated projected silhouette. We can see that we almost get the right estimation result; (c, f) the output model mesh with estimated pose.
The local estimation Result. When the motion is slow, the estimation result using local optimization method is correct (frame no. 1–60), Once it changes faster, the tracking will make error no. 70 frame, this will need global optimization to fix. Using annealed particle filtering, we need 100 particles and 15 iterations, after we add the pose prior constraint for the particles, we just need 25 particles and 10 iterations.
5.1. Computation Time
The computation time depends on several factors, such as the quality of the model, number of camera views, and the quality of the original images. The global optimization costs the most computation time of the whole system. The computation time depends on the number of particles and iterations. The computer is used with an Intel Core 2 Duo processor at 3.0 GHz, and a 2.0 G memory. We make the program by multi-threaded. For each frame, local optimization of our system may cost 8–10 seconds, after misalignment, the global optimization part may cost 180–300 seconds.
6. Conclusion
In this work, a robust and accurate human motion capture method [1] has been investigated and improved. Both the local optimization and global optimization methods are all based on image silhouettes and so a proper background subtraction is required. For most model-based methods, they all need a 3D scan model. We have presented a method for estimating human pose and shape from multi-view imagery. The approach based on a learned 3D morphable human model using PCA method and a pose prior information as a soft constraint concludes anatomical constraints and the pose probability density from the training samples.
Good initial pose and shape are very important point to get the right pose estimation result. It is not suitable in applications in which a real-time pose estimation is required. Beside the body pose, we can also provide the human mesh model instead of the 3D scan model. This gives additional information about the pose and the person. When we get the approximate body shape and pose, then we can obtain detailed human model shapes with full correspondence. The shape we computed can replace the 3D scan model in motion capture areas. The subject should be in tight clothes in multi-view cameras. In the future work, we will consider single view-person pose estimation.
Acknowledgments
The authors would like to thank the anonymous reviewers for their constructive comments. they would like to thank Hasler [3], Gall [1] and Pons-Moll [13, 24] for providing their database for research purpose. This work is supported by the National Key Technology R&D Program of China (2012BAH01F03), the National Natural Science Foundation of China (60973061), the National 973 Key Research Program of China (2011CB302203), the Ph.D. Programs Foundation of the Ministry of Education of China (20100009110004), and Beijing Natural Science Foundation (4123104).
GallJ.StollC.De AguiarE.TheobaltC.RosenhahnB.SeidelH.-P.Motion capture using joint skeleton tracking and surface estimationProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR '09)June 2009Miami, Fla, USA174617532-s2.0-7045019982610.1109/CVPRW.2009.5206755JainA.ThormählenT.SeidelH.-P.TheobaltC.MovieReshape: tracking and reshaping of humans in videos20102962-s2.0-7865086259210.1145/1866158.18661741866174HaslerN.StollC.SunkelM.RosenhahnB.SeidelH.-P.A statistical model of human pose and body shape20092823373462-s2.0-6304914016910.1111/j.1467-8659.2009.01373.xMoeslundT. B.GranumE.A survey of computer vision-based human motion capture20018132312682-s2.0-003527112810.1006/cviu.2000.0897MoeslundT. B.HiltonA.KrügerV.A survey of advances in vision-based human motion capture and analysis20061042-3901262-s2.0-3374999078010.1016/j.cviu.2006.08.002Pons-MollG.RosenhahnB.2011New York, NY, USASpringerAnguelovD.SrinivasanP.KollerD.ThrunS.RodgersJ.DavisJ.SCAPE: shape completion and animation of people2005243241253BǎlanA. O.SigalL.BlackM. J.DavisJ. E.HausseckerH. W.Detailed human shape and pose from imagesProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07)June 2007Minneapolis, Minn, USA182-s2.0-3514888820710.1109/CVPR.2007.383340GuanP.WeissA.BalanA. O.BlackM. J.Estimating human shape and pose from a single imageProceedings of the IEEE 12th International Conference on Computer Vision (ICCV '09)2009SigalL.BalanA.BlackM. J.Combined discriminative and generative articulated pose and non-rigid shape estimationProceedings of the 21st Annual Conference on Neural Information Processing Systems (NIPS '07)December 20072-s2.0-84858766233ChenY.KimT.-K.CipollaR.Inferring 3D shapes and deformations from single viewsProceedings of the 11th European Conference on Computer Vision2010300313ZhouS.FuH.LiuL.Cohen-OrD.HanX.Parameter reshaping of human bodies in images2010294Pons-MollG.BaakA.HeltenT.MüllerM.SeidelH.-P.RosenhahnB.Multisensor-fusion for 3D full-body human motion captureProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10)June 2010San Francisco, Calif, USA6636702-s2.0-7795598620910.1109/CVPR.2010.5540153GallJ.RosenhahnB.SeidelH. P.An introduction to interacting simulated annealing200836Springer319345Understanding, Modeling, Capture and Animation, Computational Imaging and VisionGallJ.RosenhahnB.BroxT.SeidelH.-P.Optimization and filtering for human motion capture: AAA multi-layer framework2010871-275922-s2.0-7514916710210.1007/s11263-008-0173-1OlgaD.2010Eidgenossische Technische Hochschule Zurich; INFK; Computer Vision and Geometry GroupLiuY.StollC.GallJ.SeidelH.-P.TheobaltC.Markerless motion capture of interacting characters using multi-view image segmentationProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11)June 2011Providence, RI, USA124912562-s2.0-8005290057510.1109/CVPR.2011.5995424BaranI.PopovićJ.Automatic rigging and animation of 3D characters2007263ZhangD. Y.MiaoZ. J.ChenS. Y.Human model adaptation for multi-view markerless motion capture20132013756421410.1155/2013/564214BroxT.RosenhahnB.CremersD.Contours, optic flow, and prior knowledge:cues for capturing 3D human motion in videos200736Spring, Computational Imaging and Vision265293Understanding, Modeling, Capture and AnimationSigalL.BalanA. O.BlackM. J.HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion2010871-24272-s2.0-7514915023510.1007/s11263-009-0273-6ParzenE.On estimation of a probability density function and mode19623310651076MR014328210.1214/aoms/1177704472ZBL0116.11302RosenblattM.Remarks on some nonparametric estimates of a density function195627832837MR007987310.1214/aoms/1177728190ZBL0073.14602BaakA.HeltenT.MuellerM.Pons-MollG.SeidelH. P.RosenhahnB.Analyzing and evaluating markerless motion tracking using inertial sensors2010139152