^{1}

^{1}

This paper studies the notion of hierarchical (chained) structure of stochastic tracking of marked feature points while a person is moving in the field of view of a RGB and depth sensor. The objective is to explore how the information between the two sensing modalities (namely, RGB sensing and depth sensing) can be cascaded in order to distribute and share the implicit knowledge associated with the tracking environment. In the first layer, the prior estimate of the state of the object is distributed based on the novel expected motion constraints approach associated with the movements. For the second layer, the segmented output resulting from the RGB image is used for tracking marked feature points of interest in the depth image of the person. Here we proposed two approaches for associating a measure (weight) for the distribution of the estimates (particles) of the tracking feature points using depth data. The first measure is based on the notion of spin-image and the second is based on the geodesic distance. The paper presents the overall implementation of the proposed method combined with some case study results.

In this paper, a framework is proposed which can be used to explore the information flow and sharing in a distributed Bayesian tracking using both RGB and depth sensors. The proposed hierarchical (cascaded) particle filter first tracks the human body in the RGB image by exploiting the notion of importance sampling [

Tracking the overall movements of the human body combined with tracking specific points of interest located on the tracked body has many applications. These applications can range from virtual/augmented reality (V/AR), surveillance, and motion analysis to human-robot/environment interaction. However, there also exist various challenges associated with tracking when using various types of ambient sensors. Firstly the human body shape and movements are highly variable and the body parts have a number of degrees of freedom. Secondly the tracking environment is usually very complex and can be under different illumination and background conditions. Such environment is a common source of ambiguity which would influence the stability of only RGB based tracking. In our study, we have taken advantage of combined depth and RGB sensors. Time-of-flight sensors are able to provide dense depth measurements at high frame rate, for example, Microsoft Kinect [

In tracking the overall coarse shape of gait, [

There also exists a large body of literature aiming at tracking selected limbs of human body. Majority of these approaches model the whole body as articulated interconnected segments. However, existing drawbacks in using articulated model are the high dimensionality of the configuration space and the exponentially increasing computational cost. There are a number of effective 2D tracking methods which have been proposed [

In the more recent works, some researchers tended to combine different features which can complement each other in order to implement robust tracking. For example, Xu et al. in [

This paper proposes a framework for implementing levels of detail in tracking based on chained particle filter. Particle filter (PF) uses a dynamic model to guide the propagation of the state estimation within limited subspace of target measurement [

Similar hierarchical framework has been proposed in [

Our implementation of the levels of detail consists of two main cascaded layers of particle filter. The first layer is an enhanced color-based rectangular region tracking and the second is a depth-based particle filter tracking of selected feature points in the body within the bounding volume obtained from the first level. The experimental set-up uses both the color and depth sensors from a single Microsoft Kinect II. The sensor is positioned on the top of a tripod and directly facing the background as well as the person to be tracked. An adult is asked to walk under normal illumination condition and in a natural cluttered environment. Simultaneously, the color and the depth video sequences are captured by Kinect II sensor. The hierarchical tracking is implemented in C++ and runs for single target on a PC with Intel 3.20 GHz CPU.

At the initialization stage, we synchronize RGB frame and depth frame. However, the original frame size of color frame is

For the first level, we extend the results of [

Since the incremental velocity and direction of a person can be estimated through the movement dynamic model, it can also be utilized to enhance the prior distributing of particles, that is, in the expected direction of movement. More precisely, defining the relationship between velocity, position variation, and property of bounding box is the main idea of the hierarchical sampling which is utilized in this paper. The view of the tracking area is based on perspective geometry. In general and in projected geometry, various movements and activities of tracking human can be regarded as a combination of two different types of motion cases, that is, movements along horizontal (

For example, if the person is moving along

(a) Result of tracking a walking person using the CPF in frame number is 70, 75, 80, 95, 115, and 120. (b) Samples propagated using 100 particles.

In the most general case, the person’s movement is a combination of the above two cases. A relationship between parameters

The second layer of tracking consists of several major steps. The segmented input depth data inside ROI is first filtered in order to reduce the dominant noise in the data and to obtain consistent surface point cloud. Here we use nearest-neighbor interpolation algorithm [

The integrated depth segmentation algorithm is demonstrated in the following. Given the bounding box which is acquired to represent the location and coarse spatial range of the human body in the previous layer, the foreground segmentation fully utilized this result and incorporated it with depth information to decrease the computation cost. The key idea of this step is designed to check the depth continuity of neighboring pixels and return all the separated depth clusters inside the ROI. To start with, we performed the depth-first searching (DFS) algorithm to these points and identified the largest depth cluster inside the bounding box area, which is considered to be the human body. DFS is an algorithm which can find the largest connected component in an undirected graph. So after running DFS algorithm, we label each pixel which belongs to the background as 0 and label those pixels as a whole belonging to the human body surface as 1 (Figure

An example sequence associated with two-stage depth segmentation, that is, background subtraction and DFS in order to find the largest connected component.

In order to enable further appearance-based body part matching between successive occurrences of the tracked person, we extracted surface mesh on a local patch of point cloud. This body division method is explored in order to generate local mesh on the patch including the extremity regions. We initiate the mesh generation by first detecting the head region. This can be done by first finding the minimum width associated with the silhouette of the body. By assuming the natural position of the head region of a walking person, we first deduce the searching area to be the top 1/3 of the entire segmented region. We use the silhouette width along the horizontal direction, to generate the human body silhouette width curve. The variation in the silhouette width curve, representing a silhouette histogram, is shown in the middle column of Figure

Human head region detection.

The computed pixel location in the depth image is defined with respect to a local sensor coordinates

(a-b) Different views of calibrated point cloud of human body in world space; (c) mesh generation on the segmented and calibrated point cloud.

Using the definition of the 3D bounding box coordinate system, a polygonal surface mesh can be generated as a 3D undirected graph. The undirected graph in this depth segmentation algorithm is defined as

The principle behind the depth-based particle filter (DPF) is similar to the implementation in the first layer. Here the idea is to track points of interest which are initially defined on the point cloud of the segmented body, for example,

The spin-image associated with a certain point of a 3D object is used as a reference to weigh samples in the DPF [

In this equation,

An example of spin-image generated from local depth patch on the segmented body.

In order to measure similarity between images, correlation coefficient is defined and utilized by [

Comparison of spin-images using correlation map between spin-images of similar and unsimilar points. (a) Spin-image generated from point

Each particle sample in the DPF of this level is represented by a 3D point, whose state vector,

Figure

Example of the hierarchical implementation of particle filter. (a) State estimation of the designated point of interest on the body (solid circle). (b) Generated samples (circle point) propagated by the proposed DPF.

In the previous section, given a point of interest on the body, we utilized spin-image in order to associate weight with the sample distribution. The objective of the second method of implementation of DPF is to defined and another approach for associating weight with the sample distribution is studied. Here, the desired feature points on the tracked body are first mapped to the vertices of the constructed surface mesh. During the tracking the distance between these points of interest will remain unchanged along the surface mesh. Hence, during the sample propagation, one can associate a weighting factor with the sample distribution on how much they deviate from the reference geodesic distance that is calculated at the initial time state of the tracking process. Such approach can result in a method robust against mesh deformations, translations, and rotations. Traditional local color-based approaches for defining features are very sensitive to such local deformation. Model-based tracking in defining features are also very restricted mainly due to their high computational cost since the human body can be modeled as an articulated object with high degrees of freedom. Since Euclidean distance between two feature points can vary widely with body movement in 3D space (also being inspired by the concept of Accumulative Geodesic Extrema [

Constructing surface mesh from point cloud of the whole body allows us to measure geodesic distances between any feature points selected on the body. Geodesic distance [

(a) Visualization of geodesic distance computation from the point cloud. (b) Example of surface mesh generation using adjacency relationships.

Figure

(a) Estimated state of the extremity point (solid circle). (b) Samples (circle point) propagated by the cascaded PF.

In this paper, we proposed an approach for tracking movements of a person in a cluttered environment. The method is based on the notion of a hierarchical particle filter which incorporates two layers consisting of coarse-to-fine tracking subsystems. In the first layer and by considering the computational time needed to converge to the true state, we proposed a sequential approach by defining importance sampling. This method is implemented by modeling the relationship between the movement of the person and method of populating the particles in the system dynamic model. In the preprocessing stage of the second layer, we synchronized depth and color frames, extract the human body inside the bounding box, and construct a surface mesh from it. In this layer, we also proposed and utilized two types of measures to associate and weight the sample propagation in the depth-based particle filter implementation for tracking points of interest on the body.

In the second step of our cascaded framework we attempted to use two different features to implement the depth-based tracking. Each feature owns its unique property which has both advantages and drawbacks under the different situation. The depth-based PF using spin-image is a simple and fast adaptive tracking procedure, since this feature is view-invariant and robust with respect to object posture rotation and translation. When the subjects have relative smooth gait patterns the performance of tracking is satisfactory. However, one drawback of this method is that it is hard to capture complex situation patterns when subjects have a significant change in their walking patterns. For the depth-based PF using geodesic distance, one main advantage of this feature is that it is largely invariant to surface mesh deformations and rigid transformations. More precisely, the geodesic distance from the left hand to the right hand of a person along the body surface is unaffected by her/his posture. However, when the surface is not connected, that is, there is a self-occlusion, it fails to calculate the distance along its surface mesh and consequently the focus of the target may fall into other positions on the surface mesh.

Moreover, in our implementation it was observed that spin-image is sensitive to noise generated from the computation of surface normal, and computation of the geodesic distance might result in inconsistent labeling under more complex scenarios such as self-occlusion. In one of our future works, we would like to extend the proposed framework to a network of sensors and also utilize some hardware accelerators (e.g., GPU) in order to achieve a robust tracking of multiple people.

This paper is an extended version of our previous work [

The authors declare that there are no conflicts of interest regarding the publication of this paper.