Moving Camera-Based Object Tracking Using Adaptive Ground Plane Estimation and Constrained Multiple Kernels

. Moving camera-based object tracking method for the intelligent transportation system (ITS) has drawn increasing attention. The unpredictability of driving environments and noise from the camera calibration, however, make conventional ground plane estimation unreliable and adversely aﬀecting the tracking result. In this paper, we propose an object tracking system using an adaptive ground plane estimation algorithm, facilitated with constrained multiple kernel (CMK) tracking and Kalman ﬁltering, to continuously update the location of moving objects. The proposed algorithm takes advantage of the structure from motion (SfM) to estimate the pose of moving camera, and then the estimated camera’s yaw angle is used as a feedback to improve the accuracy of the ground plane estimation. To further robustly and eﬃciently tracking objects under occlusion, the constrained multiple kernel tracking technique is adopted in the proposed system to track moving objects in 3D space (depth). The proposed system is evaluated on several challenging datasets, and the experimental results show the favorable performance, which not only can eﬃciently track on-road objects in a dashcam equipped on a free-moving vehicle but also can well handle occlusion in the tracking.


Introduction
Currently, video-based traffic surveillance plays an important role in intelligent transportation systems (ITSs). And as more and more people use the dashcam during driving, video analysis based on dascam has thus become a very important research area, and object tracking such as pedestrians and vehicles is a crucial and unavoidable task in this field. By tracking pedestrians or vehicles, their movement trajectories can be collected in the video for advanced analysis, such as human or vehicle flow estimation, collision avoidance of abnormal behavior, and criminal tracking. erefore, researchers are motivated to develop an effective tracking system, which not only can track objects in the scene but also is able to collect the information for higherlevel analysis.
Tracking vehicle and pedestrian in moving cameras is quite challenging due to several reasons. First, the appearance of these objects may change greatly due to nonrigid deformation, different viewing perspectives, and other visual attributes. Second, frequent occlusion by other objects in the scene will cause severe identity switches. Last but not least, object tracking in moving camera is more challenging than that in static cameras, because of the combined effects of rapidly changing lighting conditions, blur, and the issues mentioned above. Moreover, many robust and effective object tracking techniques used in static cameras cannot be directly applied in moving camera, such as background subtraction and constant ground plane assumption, thus making the problem more difficult. Unlike using background-based methods to extract moving objects blobs under static cameras, object detection is widely used in video analysis under moving camera. erefore, the challenge becomes to successfully detect objects in the moving cameras and then apply tracking techniques to track the detected ones, which are so-called tracking-by-detection schemes. However, when the object is partially or fully occluded, the detection cannot work well and thus affect the tracking result. Hence, the constrained multiple kernel (CMK) tracking technique was further adopted in the proposed system and facilitated with the estimated ground plane and Kalman filter, to overcome the occlusion issue during the tracking.
In this paper, we extend our previous work [1] and propose an efficient and robust 3D object tracking system based on adaptive ground plane estimation, which also successfully integrates structure from motion (SfM), object detection, CMK tracking, and Kalman filter framework. e proposed system begins with object detection and structure from motion for estimating camera pose. en, the adaptive ground planes are estimated based on the camera motions, and the 3D location of the objects relative to the cameras can be inferred. By taking 3D information into account, the CMK tracking method is used to overcome the occlusion issue during the tracking. Hence, the proposed system can not only handle the occlusion but also estimate a reliable ground plane simultaneously. Figure 1 shows an example of the tracked objects on the estimated ground plane (the red squares on the ground). e number above the bounding box represents the distance of the detected objects from the camera. e remaining of this paper is organized as follows: Section 2 gives a brief survey on the related work. In Section 3, we describe the proposed tracking system. e depth CMK tracking which includes depth map construction, CMK tracking, hypothesized association, and Kalman filter are described in Section 4, and Section 5 depicts the adaptive ground plane estimation algorithm. e experimental results are demonstrated in Section 6. Finally, the conclusion of this work is given in Section 7.

Related Work
Recently, ground plane estimation-based tracking methods [2][3][4][5][6] have attracted a lot of attention. By applying the ground plane estimation method to each frame of a video sequence for detecting a reliable ground plane, the relative 3D location of the camera and the objects can be inferred, thereby making the object tracking more robust.
In general, the existing ground plane estimation approaches can be roughly divided into two categories: 2D or 3D approaches based on the sensor type. Within 2D approaches, homography is the most popular approach for ground plane estimation, which based on feature correspondence to calculate every pair of consecutive frames and the first requisite is to find a set of reliable feature points lying on the ground plane. Usually, corner detectors such as Harris are used to extract features, followed by a robust estimation technique in which the dominant homography is estimated. Arróspide et al. [7] used Kalman filtering and Conrad and DeSouza [8] used modified expectation maximization to build confidence in the ground plane transformation across successive frames. Both of the two methods assumed the camera can only see the ground plane with objects above it, and the roll angle of sensors is zero. With the homography decomposition results combined with contour searching [9] or a Bayes filter [10] to estimate the ground plane in 2D images, homography has also been successfully used as a first step. However, again the ground plane is assumed to be the area in front of the camera, or the single color ground plane is assumed to occupy the majority of the FOV. e other 2D approaches used depth-image data or histogram of the disparity map [11] instead of traditional RGB image data [12,13], and Jin et al. [14] proposed a ground plane detection method based on depth map driven, which grows a plane from the largest area having similar depth values in the depth map, and the largest plane is considered to be the ground plane. Kircali and Tek [15] estimated the ground plane by comparing the depth map of new coming frame with a precalibrated depth map in which the ground plane was predefined. Skulimowski et al. [16] used the gradient of the V-disparity pixel values to detect ground plane which has an arbitrary camera roll angle. Furthermore, Cherian et al. [6] reconstruct the depth map from a single RGB image by applying multiple texture-based filters with a Markov random field and estimate the ground plane based on texture-based searching segmentation. Due to the intrinsic features of the algorithm, this approach assumed the ground plane has a unique texture and the camera is parallel to the ground plane. Dragon et al. [17,18] formulate the ground plane estimation problem as a hidden Markov model (HMM) based on temporal sampling and decomposing of homography. e decomposition of the homography with the highest probability indicates the orientation and ego motion of the camera's movement. Man et al. [19] develop a ground plane estimation approach based on monocular images with a predefined region of interest, which requires a known pitch angle of the camera. e ground plane estimation method in 3D commonly utilizes the depth sensors as LIDAR [20] or TOF [21] to get the 3D point cloud data, which can provide the 3D structure of the environment and then be used as an effective way to estimate the ground plane. Borrmann et al. [22] use all points of 3D point cloud to calculate, which has high computation cost. RANSAC-like approaches [23,24], which can then be used as an effective way to estimate the ground plane, are unlimited to number of iteration. us, processing time cannot be guaranteed. A less expensive alternative to generate 3D point clouds is the use of a stereo camera in which the ground plane can be estimated from disparity [25]. Assuming that the scene is static, monocular approaches for simultaneous localizing and mapping (SLAM) can also be used to extract the 3D shape and then the ground plane can be estimated [26,27]. Zhang and Czarnuch [28] proposed a perspective ground plane estimation approach which combines the robustness of 2D and 3D data analysis. Other 3D approaches [29][30][31] use the 3D normal vector for each raw data point rather than estimation of the raw points directly. However, we assume that the camera roll and pitch angles are zero. More recently, machine learning technique has been used in ground plane estimation, which requires minimal orientation variations (i.e., 0 ∼ 15 ∘ ) [32].
Although the above approaches can successfully detect the ground plane and achieve good experimental results, they are specifically designed to only produce one single ground plane based on the available data and not suitable for the unpredictability of dynamic road conditions. In addition, these approaches do not utilize the estimated camera pose information. In addition, the camera's pose is the most significant factor for representing the ground plane in the scene.
e reliability and accuracy of the ground plane estimation can thus be improved by taking advantage of the camera pose information.
Our proposed tracking system is inspired by the approach in [33], which also has mounted the monocular dashcam on a free-moving vehicle. However, due to the driving road condition is continuously changing, if the ground plane is only estimated in the beginning may not be applicable for the entire video sequence, therefore, it is very useful to take advantage of the camera's pose information estimated from the essential matrix calculation phase. In contrast to the most existing ground plane estimation methods, our approach introduces the estimated camera yaw angle as a feedback to estimate ground plane adaptively, which aims to overcome the deficiency of the previous methods caused by fixed frame window for smoothing the results. Based on the reliably estimated ground plane, we can locate the detected objects in 3D space and combine CMK tracking with the 3D information, so as to deal with the partial or fully occlusion issues during tracking.

Overview of the Proposed System
e proposed tracking system is shown in Figure 2. After converting the video from the dashcam to image sequences, there are two parallel procedures launched simultaneously. In the structure from motion phase, the proposed system extracts the Harris corner features in the current image at time step t and matches them to the features observed in the previous N frames. By using the singular value decomposition (SVD), we can estimate the camera's essential matrix for each image frame. en, according to the camera essential matrix, the ground plane for the entire image sequences can thus be estimated adaptively, where we assume the dashcam is mounted on the vehicle with a fixed height. Meanwhile, a pretrained object detector is adopted to detect desired objects such as vehicle and pedestrian in the image sequences. In the pose estimation stage, the 2D locations of detected objects can be back-projected to 3D locations by using the estimated ground plane. Once the 3D locations of the detected objects is obtained from the pose estimation stage, the depth CMK tracking is applied to track them in the Kalman filter framework. First, for each target, the 3D locations of its candidate are predicted by the Kalman filter predication. en, the CMK tracking is applied to relocate the candidate's 3D locations by maximizing the similarity between candidates and target. e Kalman filter continually updates and finally gets the reliable tracking result. Besides, based on the object's 3D information relative to the camera motion, a depth map can be constructed to represent the relative 3D locations of all the detected objects. erefore, with the help of depth information between the targets, the proposed system not only is able to effectively track objects but also can overcome occlusion during the tracking.

Robust Feature Extraction.
e ideal ground plane estimation largely depends on the selected image feature detector, which should contain the invariance of rotation, scale, and image noise. Scale-invariant feature transform (SIFT) [34] feature is a very effective scale-space feature, but it can be very time-consuming for real-time applications. As for the speeded-up robust features (SURFs) with lower computational complexity, its stability is a major problem because it often detects unstable features even after edge suppression as a post treatment. e Harris corner feature detector is thus introduced to solve the above issues, which has also been widely studied in the previous works [35][36][37][38]. Firstly, its feature extraction execution speed can be used in real-time applications with reasonable robustness in accuracy. Secondly, to robustly estimate the ground plane, more corner points on the ground plane are welcome to participate in the calculation of the camera parameters. Figure 3 shows an example of using the Harris corner feature detector to extract feature points. e detected feature points in the current image are marked with green crosses. Feature points that are detected as outliers during the processing are marked with red crosses. ese points can be matched from one image frame to the next by choosing matches that have the highest cross-correlation of image intensity for regions surrounding the points. e paths of the feature points are drawn in orange here.

Essential Matrix Calculation.
Camera pose plays a crucial role in the ground plane estimation for the entire image sequences, and the computation of the camera yaw angle θ is the key to calculate camera pose. According to the study in [39], there are three camera parameters used to describe two relative poses of a camera moving on a planar surface, i.e., the polar coordinates (ρ, φ c ) and yaw angle θ of the second position c 2 relative to the first position c 1 (see Figure 4).
In addition, we can set ρ � ] · Δt, where ] is the velocity of the vehicle and Δt is the transition time between the two end positions c 1 and c 2 . erefore, only two parameters (φ, θ) need to be calculated. In addition, according to the Ackermann steering principle, a circular motion called the instantaneous center of rotation (ICR) can be used to describe the motion of a camera mounted on a vehicle. e linear driving can be represented along with a circle of infinite radius. With this assumption, we can easily get φ � θ/2. us, there is only one parameter, and the camera yaw angle θ needs to be calculated.
As we all know, the essential matrix can be represented by the rotation matrix R and the translation matrix T, which are related to the camera pose. en, we have where we consider the camera moves on the (x, y) plane and rotates around the z axis. Given two coplanar points, p and p′, which are represented as p � x y z T and p ′ � x ′ y ′ z ′ T in the image coordinates, they must meet the epipolar constraint: where E is the essential matrix defined as E � [T] × R. Note that R is the rotation matrix defined in (1) and [T] × denotes the skew symmetric matrix: en, using the constraint φ � θ/2 and equations (1) and (3), we can obtain the expression of the essential matrix of the camera moving on a planar surface:     Journal of Advanced Transportation By replacing (4) into (2), we can notice that every image points contribute the following homogeneous equation: e rotation angle θ between a pair of successive images can be obtained from (5) as Conversely, given m consecutive image points, θ can be estimated indirectly by solving linearly for the vector [sin(θ/2), cos(θ/2)] using SVD. To this end, a m × 2 data matrix D is first formed, where each row is formed by the two coefficients of equation (5), as follows: en, the matrix D is decomposed by using SVD: where the columns of V 2×2 contain the eigenvectors e i of D T D. And the eigenvector e * � [sin(θ/2), cos(θ/2)] corresponding to the minimum eigenvalue minimizes the sum of squares of the residuals, subject to ‖e * ‖ � 1. Finally, the yaw angle of the camera θ can be estimated from e * .

Object Detection.
Object detection is the first step in the tracking-by-detection schemes, and accurate object detection can roughly determine the quality of the tracking system. Unlike detecting objects under static camera, object detection under moving cameras is more challenging due to the dynamic background, illumination changes, and so on. Because the background is constantly changing, the method based on background extraction is no longer applicable for mobile cameras. erefore, the pretrained object detectors are widely studied in recent years. e work in [40] proposes a human detector by using histogram of gradient (HOG) as the features, which can effectively represent the shape of human. e deformable part model (DPM) [41] extends the concept of [40], which uses a root and several part templates to describe different partitions of the object, and the part templates are spatially connected with the root template according to the predefined geometry, thereby accurately depicting the object. In the latest research, the convolution neural network (CNN)-based object detector has drawn increasing attention and has achieved favorable performance, which can detect hundreds of objects with a high detection accuracy. In this paper, the objects to be detected and tracked are mainly focusing on the pedestrians and vehicles, which should move on the estimated ground plane. In fact, these objects can be any objects on the road, such as bicycles and animals. In order to avoid detecting other false objects in the field of view, we adopt the state-of-the-art pretrained YOLOv3 detector [42], which uses the most advanced CNN technology to help detecting pedestrians and vehicles. e detector can be embedded independently in the proposed system, so as to functionally perform object detection. To efficiently track the object, the tracking procedure is launched only when the object has been detected in five consecutive image frames; otherwise, the detection is considered as a false alarm. Furthermore, the detected objects are refined by morphological operations to accurately locate their positions.

Depth CMK Tracking
In this section, we mainly describe how to track objects with constrained multiple kernels (CMKs) in 3D space under the framework of the Kalman filter. e depth CMK tracking is triggered to track the objects when its 3D locations are obtained from the pose estimation stage (see Figure 2). In other words, we associate the objects in the current frame with the detected objects in the next frame facilitated with the Kalman filtering. On the other hand, with the help of the depth information, we can get the relative 3D locations between the objects to overcome the occlusion in the tracking. By effectively combining depth information and CMK tracking into the Kalman filter framework, the proposed system can not only track objects effectively but also well handle occlusion problems during tracking.

Depth Map Construction.
A depth map can be constructed based on the 3D location of the detected objects, which represent the relative 3D location of all the tracked objects. Figure 5 shows an example of the depth map, where Figure 5(a) shows the result of detect objects and Figure 5(b) shows the corresponding depth map. e depth map depicts the relative distance between the detected object and the camera. e higher intensity (brighter) means that the detected object is closer to the camera. By using the depth map, we can roughly assess whether an object is occluded by other objects based on the visibility v i ∈ [0, 1]: and if v i � 1, it means the i th target is totally visible; if 0 < v i < 1, it implies the i th target is partially occluded; otherwise, it is fully occluded by other targets. As shown in Figure 5(a), all of the five objects are totally visible. So, the visibility should be set to v i � 1.

CMK Tracking.
In traditional kernel-based tracking, a histogram including spatial and color information is usually used to represent the target and candidate model. During the histogram extraction, the contribution of a pixel is determined by the distance between the pixel and the kernel center. In [43], the tracking problem for maximizing the similarity simi(x) is formulated as locating x that maximizes the probability density function (pdf ) f(x): where x is the kernel center; the subscript i represents each pixel location inside the kernel; k(·)is a kernel function with a convex and monotonic decreasing kernel profile. z i and ω i are the position to be considered and the weight of a pixel, respectively; h is the bandwidth of the kernel. After back-projecting the 2D locations to 3D locations of the detected object in the pose estimation stage, we use the depth CMK tracking technique to track them. e objective of depth CMK tracking is to find the candidate model that has the highest similarity to the target model, which is composed of multiple kernels with prespecified constraints in 3D space. For an object described by N k kernels, the total cost function J(X) is defined as the sum of N k individual kernel cost functions J k (X), which is inversely proportional to the similarity: where simi k (X) is the similarity function at the location X ∈ R 3 . In addition, the constraint function C(X) is used to confine the kernels according to their spatial interrelationships, and in order to maintain the relative location of each kernel, the constraint function needs to be set by C(X) � 0. us, the problem is further formulated as However, when the object is occluded by other objects, not all the kernels in the object can be used for matching. To overcome this issue, we assigned an adaptively adjustable weight w k to each kernel within the object. So, the cost function for the i th target is as follows: Taking the depth information into account, the visibility of each object can be set as a weight to handle global optimization. In other words, the total cost function in (11) becomes to where N q is the number of the objects in the q th image frame and w i k is a weight which is proportional to the similarity for the i th target of each kernel N k .
At the same time, the constraint functions C(X) � 0 must be considered to maintain the relative locations of the kernels. Figure 6(a) shows an example of the object was described by 2-kernel layouts in 2D space.
Unlike the work in [44] sets the constraints in 2D space, the constraints set in this paper are based on the 3D geometry. Without loss of generality, we discussed the 2-kernel case as shown in Figure 6(b), but it can be easily extended to the multikernel case. To represent an object in the 3D space, we define an object plane (− n q , π q ) for the object in the q th image frame, where n q is the normal vector of the q th image frames, and π q for the offset of the plane. In order to set the constraints properly, we start to estimate two auxiliary vectors, which are u q � − n q × g q and u 1,2 � X 1 − X 2 . First, the distance between two kernel centers should be remained the same initial distance L, which implies Second, the angle ϕ q between the vector u q and u 1,2 and the angle ς q between − n q and u 1,2 should be kept constant as well: ese constraints can bind the kernels of the object to each other in the 3D space during the tracking. As shown in Figure 7(a), the constraint ϕ q restricts the left-right movement of the kernels, and the constraint ς q restricts the forward-backward movement of the kernels which is shown in Figure 7(b).
In order to gradually decrease the total cost function and maintain the constraints satisfied during the candidate model searching, the projected gradient method in [45] is adopted to iteratively solve the constrained optimization problem. e basic concept of the method is to project the movement vector δ X , i.e., the gradient vector of the J(x), onto two orthogonal spaces. One is associated with decreasing the total cost function, and the other is responsible for satisfying the constraint function C(X) � 0: where α is the size of searching step; I is a 3N q × 3N q identity matrix,C(x) � [c 1 (x), . . . , c m (x)] T consists of m constraint functions, and c j (X): 3N k × 3N k identity matrix, which represents the visibility of kernels in the object; W � an 3N q × 3N q identity matrix, which represents the similarity of the object. As proved in [44], δ A x and δ B x have the following three characteristics. e first one is that δ A x and δ B x are orthogonal to each other. e second one is that moving along the δ A x will decrease the total cost function J(X) while keeping the same values of the constraint function C(x). e last one is that moving along the δ B x can lower the absolute values of constraint function C(x). Owing to these three characteristics, the optimal solution can be reached in an iterative manner. e iteration is stopped until either the cost function and the absolute values of constraint are both lower than some given thresholds ε j and ε c , respectively, or the iteration count is larger than a threshold T (Algorithm 1 in [44]).

Hypothesized Association.
Due to the occlusion or unreliable detection, objects may not be detected within a few frames. erefore, some tracked targets cannot be successfully associated with the detections in subsequent frames. A hypothesized association which has been located by the CMK tracking with the best color similarity was inserted to consistently track a nonassociated target. By inserting hypothetical associations, it not only can improve the detection rate, but it also helps to continuously track the target. When an object is occluded, we can predict the 3D location by taking advantage of its 3D information, and a hypothesized association is thus used to pretend a possible detection. On the other hand, if a tracked target cannot be successfully associated to detection for several frames (empirically set as five frames in this work), then this target is considered as a missed target.

Kalman Filter Prediction and Update.
Kalman filter is a traditional unscented transform-based state estimation method, which is used to approximate the mean and covariance of random variables after a nonlinear conversion. Most of tracking problems can be formulated as a state estimation problem. e tracking target can be regarded as a Journal of Advanced Transportation state, and the tracking problem is to predict and locate where the target (state) will appear in the next time. For this reason, the Kalman filter is widely used to solve tracking problems. e traditional Kalman filter is defined as follows: where x t ∈ R n and y t ∈ R m denote the state and measurement vector at the time step t, respectively; F t is the state transition matrix; H t is measurement matrix; w t− 1 ∼ N(0, Q) and v t ∼ N(0, R) are the system and measurement noise, and these two random variables are uncorrelated Gaussian white-noise sequence, with their covariance matrix Q and R, respectively. In the stage of prediction, the predictions for state and error covariance are as follows: After completing the measurement, the Kalman filter will be updated as follows: e implementation of the Kalman filter algorithm is formulated as follows.

t a t b t T and the measurement vector is defined as y
and (a t , b t ) denote the object position, velocity, and size, respectively. Hence, the initial for the state transition matrix F t and the measurement matrix H t are defined as

State Transition Matrix
Update. In addition, the size of object in the image sequence will probably change when it is moving toward or away from the camera, and the extracted color histogram used for similarity measurement is highly dependent on the kernel size. On the other hand, when the multiple kernel tracking is performed, the result of segmentation might be no longer reliable for estimating the similarity due to occlusion. Hence, the state transition matrix needs to be modified adaptively to reflect the potential size changes. So, we embed the factor of kernel size into the matrix F t : Input: Output: (g k , φ k ) (1) Initial frame number N � 30.
(2) Load a new frame f k , k is the number of input frames.
ALGORITHM 1: Adaptive ground plane estimation. 8 Journal of Advanced Transportation where β is the step size which also contains the smoothing factor; ∇f(h) is the derivative of the pdf with the kernel bandwidth h. Hence, the predict size of the object becomes to If the object is occluded so much that the average similarity value of all the kernels is lower than a certain threshold, the mechanism of state transition matrix update stops and F t returns to the default setting as (22).

Measurement Noise Covariance Matrix Update.
We use the object tracking result as a measurement to update the Kalman filter during the tracking. Although the system is robust under occlusion by using multiple kernels tracking, it still needs a mechanism to avoid the errors caused by incorrect measurements. It can be seen from (19) and (20) that not only does the Kalman gain K t control the tradeoff between using the prediction and the measurement, but also it is inversely proportional to the measurement noise covariance matrix R. Hence, we can adaptively adjust the portion measurement contribution to avoid errors by changing the covariance matrix as follows: where J(X) is the total cost function of all kernels; σ 2 is the predefined variance value, and w and h are the width and height of the kernel, respectively. With the help of the adaptively covariance matrix, if the total similarity between the candidate and the target is high, the diagonal term of the covariance matrix will be small. In this way, the Kalman gain will have a larger value, which will make the updated state closer to a reliable measurement.

Adaptive Ground Plane Estimation
Due to the unpredictability of driving road conditions, the ground plane estimated in the beginning may not be suitable for the entire image sequences. erefore, the ground plane needs to be continuously reestimated based on the dynamic road conditions. In [33], the ground plane is reestimated and parameter smoothened every f g � 200 frames to mitigate the adverse impact by the camera calibration noises. However, using a fixed number of frames for estimating the ground plane can affect the measurement accuracy when the camera is moving on a curve. In this paper, we propose to update the ground plane every single frame, based on an adaptively chosen N frames for parameter smoothing, by taking advantage of the camera rotation yaw angle calculated in the essential matrix calculation phase. e adaptive ground plane estimation algorithm is shown as follows.
In the algorithm, θ k is the camera yaw angle at the k th frame; (g k , φ k ) is the ground plane at the k th frame;g k ∈ R is the normal vector; and φ k ∈ R is the offset of the plane. D is a single 4 × f N matrix, and its elements are f N ground planes, which is estimated by each pair of consecutive frames: Due to the noisy camera calibrations and the unpredictability of road conditions, some ground planes (g q , φ q ) may be unreliable; therefore, the robust principle component analysis (RPCA) [46] is applied to decompose a lowrank 4 × f N matrix A from D. e low-rank matrix's mean vector (g k , φ k ) is considered to be our final ground plane, which is more robust to the noise contributed from the camera calibration and essential matrix calculation stage (see Section 3.2), derived from those f N consecutive frames. Figure 8 shows an example of using a set of ground planes (g q , φ q ) T |q � 1, . . . , f N to estimate the final ground plane (g k , φ k ).
e gray planes are the image sequences converted from the driving recorder, and H is the camera height. e final ground plane for f N consecutive frames (dot-line plane) is obtained from a set of ground planes (solid planes).

Experiment Results
In this section, we show experimental results of the proposed system on the Kitti datasets [47], which are taken with high quality dash cameras with motion pose ground truth and GPS information available. We test eight sequences (see Figure 9(a)), which are relatively short, and most of them are driving on a curvy road. Figure 9(b) shows the relative ground plane estimation results by applying our proposed method. We also test two of self-recorded video sequences captured around the University of Washington (UW) campus using a driving recorder mounted on a fixed height 1650 mm. And a more complex scenario in the ETHMS dataset, which includes multiple pedestrians on one scene, is also tested, and Table 1 shows the configurations of the tested videos.

e Relative Angular and Distance Errors.
To demonstrate the accuracy of our proposed adaptive ground estimation, we compare the performance on the Kitti dataset with the following three different methods: the method in [4] is a stereo algorithm based on graphical model; the method in [17] formulates the ground plane estimation as a state continuous hidden Markov model where the hidden state contains ground plane; the method in [33] adopted the simultaneous localization and mapping (SLAM) technique to estimate the ground plane by using constant frames.
As in the method [17], the average relative angular error and distance error of the camera's motion are applied to evaluate the accuracy of the ground plane estimation. For the performance measurement, we calculate the camera poses and compare them with the given camera pose ground truth. e average relative angular and distance errors, which are normalized by the path length, are given in Tables 2 and 3 separately. Tables 2 and 3 show that the performance of our approach is better than the method [33] in both relative angular errors and comparable relative distance errors.
at is because the estimated ground plane becomes more reliable after applying the adaptive ground plane estimation algorithm. Unlike the method in [33] that uses a constant number of frames to estimate the ground plane, our proposed method takes advantage of the estimated yaw angle in the camera pose to fight the adverse effects of the changing road conditions. Compared to the method used in [17], our proposed scheme also shows better performance, except for the angular error in datasets 1 and 5, similarly except for the distance error in dataset 6 when compared with the method used in [4]. e major reason of the better performance is that our method can be well contributed by the noise reduction from the camera calibration and the unpredictability of road conditions as facilitated by taking advantage of the characteristics of adaptive-length RPCA.

Detection Performance.
To demonstrate the detection performance of our proposed system, we compared it with three methods [33,48,49] with different human detectors on the ETHMS dataset, in terms of the detection rate and false    Table 4. e result shows that both the proposed method and the method in [33] are superior to the method in [48,49]. Both methods further utilize the 3D information of the detected object, instead of only using 2D information in [48,49]. ey can effectively handle the occlusion issues. When compared with the method [33] with the DPM detector, the proposed method performs much better because it performs better in the tracking with the adaptive ground plane estimation, which results in increasing the detection rate and decreasing the FPPI. And compared with DPM and YOLOv3 detectors, the proposed method with YOLOv3 has a better performance due to the low false positive detection rate in the YOLOv3. anks to the proper insertion of hypothesized associations and the successive tracking, the detection rate of the proposed method can achieve about 78%. is implies that missing detection can be improved by the tracking techniques, and thus better detection results benefit the tracking performance.

Multiple Object Tracking Result.
To demonstrate the tracking performance of our proposed system, we compare the performance with the following three different tracking methods: the method in [44] is a kernel-based humantracking system which tracks a human in 2D space and without estimating the ground plane. e method in [50] uses the tracking-by-detection scheme to associate the detected objects by calculating their similarity. e method in [33] is a human tracking system which uses a constant number of frames to estimate the ground plane. To fairly evaluate the tracking performance for each method, we manually labeled 7302 locations as ground truth which includes 31 moving vehicles and 89 pedestrians across 3393 frames and also adopt the following metrics which are widely used in multiple object tracking (MOT) challenge [51].  Table 5. e proposed method achieved the best performance in all of the metrics except for FN. e reason is that the CNN-based tracking by detection retains more foreground around the object regions. However, the extra extracted background information will also cause the increase in FP and IDS. e ability of the proposed depth CMK to deal with occlusion issues can be learned from the fact that there is less identity switching, while the other methods are tending to generate new object identities when occlusion occurs. To facilitate the comparison of experimental results, the red entries in Table 5 indicate that the best results in the corresponding columns and blue italics are the second best.
An additional typical example of performance comparison is shown in Figures 10 and 11, which both extract five continuous frames from 175 to 179 from the UW campus sequence 1. Figure 10 shows the tracking results in the method [33], which use a constant number to estimate ground plane. Figure 11 shows the tracking result in the proposed method, which takes advantage of the yaw angle from the camera pose to estimate the ground plane adaptively. From Figure 10, we can see that the camera mounted on the driving vehicle starts to change direction in the frame 175, and in the frame of 177, the distance of the vehicle to the camera sharply changed from 10.31 to 7.98, and then back to 8.31 in the frame 179. e estimated ground plane remains the same even when the vehicle starts to turn. Figure 11 shows the tracking performance of the proposed method using adaptive ground plane estimation, we can see that the distance of the vehicle gradually reduces from 10.51 to 8.44, and the ground plane keeps changing with the direction of the vehicle adaptively. It can be observed that the proposed method can track objects more continuously and effectively by using the adaptive ground plane estimation. Several object tracking results with estimated ground plane are shown in Figures 12-14, which show the tracking results on the UW campus sequence 2, Kitti datasets, and ETHMS datasets, respectively. e results show favorable performance of the proposed system, which not only can successively track objects but also estimate a reliable ground plane adaptively.  e implementation is constructed by C/C++, and the experimental settings are described as follows: in the structure from motion phase, the proposed system uses the Harris corner detector to extract 1000 features initially, which are tracked by a KLT tracker. And these corresponding feature points are used to estimate the camera pose. In object detection, the pretrained YOLOv3 detectors are independently used in the proposed system to detect objects such as human and vehicle. In the depth CMK tracking, a depth map is constructed to describe the relative 3D locations of all the tracked objects firstly, and the histogram of objects is constructed based on the HSV color space with a roof kernel; then, the K-L distance is used for all the similarity-related measurements. Table 6 shows the running time of the proposed system on different datasets with different image resolutions.
6.5. Discussion. In this paper, we proposed an adaptive ground plane estimation algorithm-based tracking system. Existing ground plane estimation methods are required to meet significant assumptions, such as the ground plane is the largest plane in the scene and the ground plane is constant in color or texture. ese assumptions are not practical in cluttered or dynamic environments, especially not suitable for driving environments. Our method can robustly estimate  the ground plane on a moving camera with nonrestrictive assumption: the camera is mounted on a fixed height of the vehicle.
Combining the adaptive ground plane estimation, object detection, Kalman filter framework, and efficient depth CMK tracking techniques, the proposed tracking system can not only track the object effectively but also robustly handle occlusion during tracking. Nevertheless, several limitations are still existed. First, the proposed approach adopts the tracking-by-detection scheme to detect and then track objects, and this implies that the method highly relies on the detection results. However, if the quality of video sequences is not sufficient for the object detectors, the proposed tracking system is not able to perform well on the poor detection results. More specifically, the positive detection of a target can always trigger the tracking of a specific object. In other words, the proposed method may not work well at night or some cases of insufficient lighting. Second, the proposed method effectively estimates ground planes based on certain video frames when the vehicle moves on flat roads, but if the roads are severely bumpy, it will produce less reliable estimation, resulting in larger error of the object back-projection and impacting accuracy of the reprojected 3D information. Hence, the proposed method is not reliable for the unmanned aerial vehicle, because its height dynamically changes and then infers unreliable 3D information of objects.
In the future, we will focus on improving the performance of the algorithm by enhancing the accuracy of the object detection algorithms. In addition, we will also test our algorithms on video sequences that have higher outdoor complexity and more objects visible in the scene.

Conclusion
We propose a robust object tracking system and ground plane estimation simultaneously in a dashcam mounted on a free-moving vehicle. e proposed system effectively integrates the object detection, ground plane estimation, CMK tracking, and Kalman filter framework to relocate the objects in 3D space, and the estimated camera yaw angle has been adopted into the adaptive ground plane estimation. With the depth CMK tracking, the 3D positions of the detected targets are updated on the more reliable ground plane and occlusion issue is also handled in the tracking system.
e experimental result shows that the proposed method greatly improved the tracking performance. Such tracking system can be regarded as a key component for high-level applications, such as video analysis in a large scale of the mobile network. Besides, the proposed framework can also be futher applied to the advanced driver assistance system (ADAS).

Data Availability
e Kitti dataset used to support the findings of this study may be released upon application to the KITTI Vision Benchmark Suite, which is a project of Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago. e dataset can be downloaded for free at this web page http://www.cvlibs.net/datasets/kitti/raw_data.php. e ETHMS dataset can be downloaded on the following web page https://data.vision.ee.ethz.ch/cvl/aess/dataset/#pami09. Requests for self-recorded UW data, 6/12 months, after the publication of this article, will be considered by the corresponding author.