Real-Time Pedestrian Tracking and Counting with TLD

This paper describes a solution to solve the issue of automatic multipedestrian tracking and counting. First, background modeling algorithm is applied to actively obtain multipedestrian candidates, followed by a confirmation step with classification. Then each pedestrian patch is handled by real-time TLD (Tracking-Learning-Detection) to get a new predication position according to similaritymeasure. Further TLD results are comparedwith classification list to determine a new, disappeared, or existing pedestrian. Finally single line counting with buffer zone is employed to count pedestrians. Experiments results on the public database, PETS, demonstrate the validity of our solution.


Introduction
Pedestrian tracking and counting is an extremely significant research in the field of computer vision.It plays a crucial role in many applications including the intelligent monitoring and traffic safety.Nevertheless, some challenges such as great variation of the pedestrian posture, background clutter, partial occlusions, and illumination changing complicate the issue.
Current state-of-the-art tracking algorithms can be roughly divided into two categories: generative and discriminative pattern.Generative methods [1,2], generally, describe the characteristics performance via the generated model and then minimize the reconstruction error by searching the candidate targets.Compared with generative methods, discriminative approaches [3][4][5] distinguish target and background by trained classifiers to find a decision boundary between object and background.By adequately using both of the target and background information, this approach achieves higher tracking accuracy.In recent years, discriminative approaches have been obtaining vigorous developments.
TLD is a popular discriminative method [5].It integrates an online detector to obtain a good ability of target redetect, which is important to realize continuous tracking in the case of target disappearance in tracking process.In addition, TLD can solve long-term accurate tracking by online learning.
TLD has been used in different tracking tasks.Zhang et al. [6] use TLD to realize dynamic gesture tracking.To initialize TLD algorithm, a specific gesture is manually marked by a bounding box in the first frame.Chen et al. [7] improved TLD to track a human in unconstrained environments, and the object needs to be initialized manually as well.Crane tracking and monitoring with stable TLD also show strong robustness and accuracy [8].However all of these studies are under single target tracking conditions.
Pedestrian flow statistics is used to track pedestrians and count their number in the video [9], such as the surveillance of crossroads.Compared with above studies, many people may appear at the same time, leading to the request of multiple object tracking.Moreover, any pedestrian may randomly appear.As a result, the targets cannot be extracted manually and automatic searching for tracking objects is also a key for multipedestrian tracking and counting.
We propose a new method called improved TLD (ITLD) by introducing background subtraction to automatically obtain multiobject and then use the updating mechanism of tracking list to realize multiobject management.Figure 1 shows the framework of our system, integrating with counting module.

Multipedestrian Patch Obtaining
There are several ways to realize multiple objects obtaining.Cao et al. [10] use Haar features and AdaBoost algorithm to automatically detect human faces.Zhou et al. [11] proposed PE-TLD, after the process of ViBe and variance filter, hog features, and SVM classifier were used for automatic targets recognition.Besides, S. Sharma et al. [12] let users manually select the desired tracking objects in the initialization process.
In our system a coarse to fine strategy is employed to obtain multitarget.Firstly, a dynamic average background model followed by Ostu algorithm is used to extract candidate pedestrian patches.Next, we utilize pedestrian detection combining Haar-AdaBoost [13,14] and Hog-SVM [15] to further exclude those nonpedestrian patches.

Extraction of Candidate Pedestrian Patch.
Our dynamic average background modeling is based on Gauss statistics, considering the distinction of three color channel variances.In order to reduce modeling error, we make the different statistics in RGB space for all pixels.After the construction of background model, a simple subtract operation between current frame and latest background is employed to get a rough foreground part.In this step Ostu algorithm is used to obtain segmentation threshold.Then morphological processing is used to further extract those small regions and fill the gaps among foreground part.Finally, minimum circumscribed rectangle is used for getting possible pedestrian targets.The detailed process is shown in [16].
The recurrences of background model updating are shown in the following:

Pedestrian Confirmation.
Haar and HOG are two excellent features to describe pedestrians.In each frame, to obtain higher performance, we integrate HOG feature with SVM classifier and Haar feature with AdaBoost classifier (shown in Figure 2) to determine if an image patch includes a pedestrian or not.An image patch will be thought to include a pedestrian, if either classifiers output yes.

Multipedestrian Tracking with TLD
The output  () of pedestrian classification servers as the input of TLD algorithm, which can solve the automatic selection of multiobject.Since pedestrian presence or disappearance from the camera is random, the management of multiobject list, including inserting, deleting, or maintaining, becomes important.

Single Pedestrian Tracking with TLD.
Single pedestrian tracking with TLD include three components, namely, tracking, learning, and detection [5].Data preparation is the first step to implement TLD framework.Given a target   .Add the best 10 patches with maximum overlap ration (larger than 0.6) into the positive sample set.For those patches with overlap ration smaller than 0.1, we put them into the negative samples set.Before adding, each patch is resized into normalized size.
Thus  ℎ object model in frame  can be expressed as   is the positive sample added last so far.After learning initialization, we can use the TLD framework shown in [5].In model update process, relative similarity S(a,b) is used to measure the similarity between objects a and b.The similarity between the image patch   and set ()  is calculated as ( 4)- (7).
+ (  , − (  , where  + (  ,   (7) The output of integrator is measured by conservative similarity shown in the following. where ) is the similarity of the first 50% of the positive patches.
For each  (−1) , we use single TLD framework to get its new position if it is visible in frame j.If n objects are visible in current frame, we can get a new list and record it as  () = {  (, , , ℎ),  = 1, 2, ...., }.Here x, y, w, and h are the position and size of the bounding box given by TLD.

List Updating for Multiobject
Tracking.Because of the random of pedestrian's appearance and disappearance, the dynamical maintenance of pedestrian list becomes a key in multipedestrian tracking and counting.For example, how to decide a target is a new or an existing one, especially for the case of occlusion or out of vision and then back again.Suppose every target is independent; we design a mechanism for the tracking list updating.
We record the information such as position and size and lifespan from the first appearance to current frame j as the trajectory of a tracking object.Tracking list T is the set of all tracking object.For current trajectory T, we note its  ℎ tracking target as T i ={, tbold, len, vlen, ivlen, rd, }.Here,  is the position and size of current output by TLD in frame j.  is the last record of position and size before current.And len, vlen, and ivlen are, respectively, the length of total frames (from first appearance to current frame), visible frames, and consecutive invisible frames. and  are the labels to describe if a pedestrian passes the left or right border of buffer zone (introduced in Section 4).By comparing current  () and the existing object trajectory list T, a new tracking list can be updated.
Our tracking list updating can be divided into two steps: correlation matching and list adjustment.For each trajectory, correlation matching step is to find the most similar patch in the detection set  ()  List adjustment process is to update .For those matched trajectories we update their records with new parameters.For those unmatched patches in  () , we regard them as new pedestrian targets.Then we add them into  as new trajectory elements.As the trajectories disappeared or unmatched for requirement threshold, we incline to view them as vanishing from the video and then remove them from the set .The detail updating flow of tracking list is shown in Algorithm 1.

Pedestrian Flow Counting
Tracking trajectory is the source of pedestrian counting.So far, single or double line counting is the popular way to count pedestrians (moving left or right).Considering the acquirement of real-time, stability, and accuracy, we choose single line counting with buffer zone shown in Figure 3 to count pedestrians.
To avoid the direction statistical error caused by the wandering pedestrian near the counting line, buffer zone is introduced.An image frame only has one buffer which is a small range centered on the counting line (the center line of the video).And only when a target crosses the whole zone, does the direction counting continue.
Let the margins of the buffer range be   and   .The way to count right or left moving pedestrians is shown in the following. where Once target   has been counted, moving label  and  will be reset to zero immediately.

Experiments and Analysis
Two experiments are designed to test the performance of our multiobject tracking and pedestrian counting on the database of PETS [21].PETS is a public dataset which consists of multisensor sequences containing different crowd activities.It has been widely used to test the performance of new or existing systems of pedestrian detection and tracking within a real-world environment.We select ten clips for pedestrian counting from the PETS.The longest clip has 1189 frames and the shortest is 276 frames.

The Evaluation of Multitarget
Tracking.MOTA (multiple object tracking accuracy) and MOTP (multiple object tracking precision) [22] are adopted to evaluate tracking process.MOTP evaluates the alignment of tracks with the ground truth.It measures the precision with which objects are located using the intersection of the estimated region with the ground truth region.And MOTA combines all missed targets, false positives, and identity mismatches and is normalized with the total number of targets (100% corresponds to no errors) [17].Table 1 presents the comparison results of our method and some of the state-of-the-art approaches for multiobject tracking.
From Table 1, we can observe that our tracking approach can achieve a competitive result, especially the metric value of MOTP.For [17,18], MOTP and MOTA cannot achieve good values simultaneously.When one measure increases, the other metric decreases obviously.Moreover, the MOTA of Berclaz approach is the highest, because it uses probabilistic occupancy map (POM) to create background and detection data.Compared with our basic foreground segmentation process without considering any prior real data, POM model uses some empty background images in PETS.
As [19], it can obtain a good balance between MOTA and MOTP, but its metric value is a little lower than that of our method.The reason may be attributed to the performance of TLD algorithm.TLD combines learning, detection, and tracking together for a tracking task with parameters online updating.To achieve high detection accuracy, P-N learning paradigm is used to exploit the temporal and spatial structure in data, leading to the mutual compensation of missed Input: Detected targets  () = { () 1 , ... ()  } in current frame j; Tracking list ; Current bounding box set  () = { 1 ,  2 , ...,   } given by TLD Output: updated tracking list  1 Trajectory Initialization.Initialize T according  () .If frame number is 1 then return after initialization.
= 1 ℎ   .=  (1)   ,   .= 1,   .V = 1,   .V= 0,   .= {0}    .=  ()  2 Correlation computation.For each trajectory   , compute the distance similarity to each patch in the detection set  () , then find its most similar patch  according to (10) Algorithm MOTP MOTA Andriyenko [17] 69.0% 63.7% Berclaz [18] 63.0% 77.0%Milan [19] 67.2% 67.0%Jin [20] 72.4% 72.1% Our method 69.9% 71.4% detection and false alarms and the promotion of tracking accuracy and precision.We can also see that [20] can get a bit higher performance than our method.To use the holography information among multiple view, [20] integrates the crowd simulation into traditional single camera method.However, the two metric values in [20] are computed by manually assigning the size and position of the initial patch for each tracking.In contrast, our method extracts foreground and initializes each tracker automatically.This fully automatic step may decrease the performance of our tracking system to some extent.In order to improve the performance, in the initialization stage, we can use more proper detection method and better background modeling method in our future work. 2 displays the results of our statistics of pedestrian flow.L and R mean the number of pedestrians walking toward left or right.In our 10 clips, the illumination of 1 and 2 is inadequate, clips 7 and 8 have many pedestrians, and 9 and 10 have serious occlusion problem.The other clips are in normal condition.

Statistics of Pedestrian Flow. Table
From Table 2, it is easy to be observed that our system obtains higher accuracy during normal condition (especially 3 and 5).For those clips with crowed pedestrians or frequent occlusion (clips 7, 8, 9, and 10), the accuracy is about 81% to 84%.For those clips (clips 1 and 2) with inadequate illumination but less objects and occlusion, the accuracy is over 86%.And for the all clips, our average statistical accuracy is 87.4%.

Conclusions
To realize multiobject tracking, our system combines TLD tracking algorithm and dynamic average background modeling.The former can track objects with long term and robustness in real-time.The latter with confirmation module can automatically localize candidate multiobject, which are further tested by pedestrian detection.By comparing the TLD output and pedestrian detection results, we can manage pedestrian records easily, such as updating parameters, inserting new objects, and deleting those disappeared ones for long time.Counting with buffer zone can decrease the influence of the wandering of pedestrians around counting line.The accuracy and stability of our system have been proved by several experiments and analysis.Future work will focus on how to improve the tracking accuracy for high-density crowd and better robustness.

Figure 1 :
Figure 1: The framework of our proposed system.
overlap area   under different scales to choose positive and negative samples.Within each scale, we can get a collection of patches from top to bottom, left to right.Then calculate the overlap ratio  =  0 /( ()  .*  ()  .ℎ+ℎ − 0 ) between a patch and  () ()− 2 , ...,  ()−  }.  ()+ 1 is the first positive patch added to the set of  ()  .And  ()+ . The similarity (, ) between patches a and b is measured by the Euclidean distance between their centroids.If the similarity (  ., () ℎ ) satisfies (9), patch  () ℎ is accepted as the most resemble appearance of   . (  .,

ℎ
is found, then update the trajectory   based on  () ℎ

Table 2 :
The statistics result of pedestrian flow counting.