Hybrid Video Stabilization for Mobile Vehicle Detection on SURF in Aerial Surveillance

Detection of moving vehicles in aerial video sequences is of great importance with many promising applications in surveillance, intelligence transportation, or public service applications such as emergency evacuation and policy security. However, vehicle detection is a challenging task due to global camera motion, low resolution of vehicles, and low contrast between vehicles and background. In this paper, we present a hybrid method to efficiently detect moving vehicle in aerial videos. Firstly, local feature extraction and matching were performed to estimate the global motion. It was demonstrated that the Speeded Up Robust Feature (SURF) key points weremore suitable for the stabilization task.Then, a list of dynamic pixels was obtained and grouped for different moving vehicles by comparing the different optical flownormal. To enhance the precision of detection, somepreprocessingmethods were applied to the surveillance system, such as road extraction and other features. A quantitative evaluation on real video sequences indicated that the proposed method improved the detection performance significantly.


Introduction
In recent years, analysis of aerial videos has become an important topic [1] with various applications such as intelligence, surveillance, and reconnaissance (ISR), intelligence transportation, and military fields [2,3].As an excellent supplement of ground-plane surveillance system, airborne surveillance is more suitable for monitoring fast-moving targets and covers larger area [4].Mobile vehicles in aerial videos need to be detected for event observation, summarization, indexing, and high level aerial video understanding [5].This paper is focused on vehicle detection from a low altitude aerial platform (about 120 m above ground).
Detection of objects has traditionally been a very important research topic in classical computer vision [6,7].However, there are still some challenges related to detection with low resolution aerial videos.Firstly, vehicles in aerial video have small size and low resolution.Lack of color, low contrast between vehicles and backgrounds, and small and variable vehicle sizes (400∼550 pixels) make the appearance and size of vehicle not very distinct to arouse correspondence.On the other hand, frame and background modeling usually assume static background and consistent global illumination.However, in practice, changes of background and global illumination are common in aerial videos due to motion of the global camera.Moreover, UAV video analysis requires real-time processing.Therefore, fast and robust detection algorithm is strongly desired.So far, detection of moving vehicle is still a big challenge.
In this work, a vehicle detection method was proposed based on the method of VSAM by Cohen and Medioni [8].The similarity and difference of these two methods were discussed in detail.We used Speeded Up Robust Feature (SURF) for video stabilization and demonstrated its validity.The scene context such as road in mobile vehicle detection was introduced, and good results were obtained.Also, complementary features such as shape were used to achieve wellperformed detection.
This paper is organized as follows.Section 2 enumerates related work on vehicle detection from aerial videos.Section 3 describes the details about the proposed approach.Section 4 presents our experimental results.Conclusions of this work are summarized in Section 5.

Related Work
In the literature, some approaches have been proposed to deal with vehicle detection in airborne videos.However, they mostly tackle stationary camera scenarios [9][10][11].Recently, there has been an increasing interest in studying the mobile vehicle detection from moving cameras [12].Background subtraction technique is one of the most successful approaches to extract moving objects [13,14].However, they have limitation that they are only applicable with the stationary cameras in fixed fields of view.Detection of moving objects with moving cameras has been researched to overcome this limitation.
As for moving object detection in video captured by moving camera, the most typical method for detecting moving objects with mobile cameras is the extension of background subtraction method [15,16].In these methods, panoramic background models are constructed by applying various image registration techniques [17] to input frames and the position of current frame in panoramas if found by image matching algorithms.Then, moving objects are segmented in a similar way to the fixed camera case.Cucchiara et al. [15] built background mosaic considering internal parameters of cameras.However, camera internal parameters are not always available.Shastry and Schowengerdt [18] proposed a frameby-frame video registration technique using a feature tracker to automatically determine control-point correspondences.This converts the spatiotemporal video into temporal information, thereby correcting for airborne platform motion and attitude errors.However, digital elevation map (DEM) is not always available.In this work different types of motion model are used, none consider registration error by parallax effect.
The second method to detect moving objects with moving camera is optical flow [2,19,20].The main concept proposed in [2] is to create an artificial optical flow field by estimating the camera motion between two subsequent video frames.Then, this artificial flow is compared with the real optical flow directly calculated from the video feed.Finally, a list of dynamic pixels is obtained and then grouped into dynamic objects.Yalcin et al. [19] propose a Bayesian framework for detecting and segmenting moving objects from the background, based on statistical analysis of optic flow.In [20] the authors obtain the motion model of the background by computing the optical flow between two adjacent frames in order to get motion information for each pixel.The methods of optic flow need calculation of the field of optic flow first which is sensitive to noise and cannot get a precise result; meanwhile, it is not proper to detect real-time moving vehicles.
Recently, appearance feature based classification is used widely in vehicle detection [3,4].Shi et al. [3] proposed a moving vehicle detection method based on a cascade of support vector machine (SVM) classifiers.Shape and histogram of orientated gradient (HOG) features are fused to training SVM for classifying vehicles and nonvehicles.Cheng et al. [4] proposed a pixelwise feature classification method for vehicle detection using dynamic Bayesian network (DBN).These approaches are promising.However, the effectiveness of methods depends on the selected feature.For example, color feature of each pixel in [4] is extracted by new color transformation in [21].However, the new color transformation only considers the difference between vehicle color and road color and does not take similar color among vehicle color, building color, and road color (Figures 9(a2) and 9(b1)).Moreover, the fact that a number of positive and negative training samples need to be collected to train the SVM for vehicle classification is another concern.
In this paper, we designed a new vehicle detection framework that preserves the advantages of the existing works and avoids their drawbacks.The modules of the proposed system framework are illustrated in Figure 1.It is two-stage object detection: initial vehicle detection and refined vehicle detection with scene context and complementary features.The whole framework can be roughly divided into three parts, which are video stabilization, initial vehicle detection, and refined vehicle detection.Video stabilization is used to eliminate camera vibration and noise with SURF feature extraction.Initial vehicle detection is used to find the candidate motion region with optical flow normal.Performing background color removal can not only reduce false alarms and speed up the detection process but also facilitate the road extraction.The initial vehicle detections are refined by using the road context and complementary features such as size of the candidate region.The whole process is proceeding online and iteratively.

Hybrid Method for Moving Vehicle Detection
Here, we elaborate each module of the proposed system framework in detail.We compensated the ego motion of airborne vehicle by SURF [22] feature point based image alignment on consecutive frames and then applied an optical flow normal method to detect the pixels with motion.Pixels with high optical flow normal value were grouped as candidates of mobile vehicles.Meanwhile, the features such as size were used to improve the detection accuracy.

SURF Based Video Stabilization.
Registration is the process of establishing correspondences between images, so that the images are in a common reference frame.Aerial images are achieved with a moving airborne platform, and large camera motion exists between consecutive frames; thus sequence stabilization is essential for motion detection.Global camera motion is eliminated or reduced by the process of image registration.For registration, descriptors such as SURF or SIFT (scale invariant feature transform) [23] can be used.In particular, SURF features were exploited due to its efficiency.

SURF
Once the integral image has been computed, it takes three additions to calculate the sum of the intensities.( 1 ,  1 ), ( 2 ,  2 ), ( 3 ,  3 ), and ( 4 ,  4 ) are assumed to be four points, respectively, of the rectangular area shown in Figure 2. Hence, the sum of all pixels in the black rectangular area can be expressed by  Σ ()+ Σ ()−( Σ ()+ Σ ()).The calculation time is independent of its size.This is important in SURF algorithm.
Then SURF uses the Hessian matrix to detect feature points, for a point  = (, ) in the image marked in the scale  on Hessian matrix is defined as In formula (2),   (, ) means the convolution result of the point in the image and the Gaussian filter second order partial derivative (, )/ 2 , and the calculation methods in   (, ) and   (, ) are similar.
In order to reduce the workload of calculation, SURF uses the box filters to replace, respectively,   ,   , and   with the convolution of the original input images   ,   , and   .The calculations are shown in Figure 3 and formula (3).In Figure 3, the weight of black pixel is −2 and white pixel is 1.The formula of   ,   , and   calculations using integral image is shown as follows: In formula (3), (, ) are the row and column of the pixel in image, respectively,  is 1/3 of the size of box filter, and  = [/2] ([] is operation of rounding).
The formula  approx , which is the approximation for the Hessian matrix Gaussian calculation determinant matrix, can be illustrated as follows: By using a nonmaxima suppression method in the neighborhood, the image feature points can be found in different scales.After feature extraction process, it is necessary to match feature point between two successive frames.For this process, we are investigating the matching process as proposed by Lowe [23].This process is based on finding a match between two consecutive image features using Euclidean distance.The Euclidian distance between SURF descriptors is employed to determine the initial corresponding feature point pairs in different images.We used RANSAC to filter outliers that come from the imprecision of the SURF model.The example is shown in Figures 4(c) and 4(d).
This means that an image taken at time  +  is considered to be shifted from the earlier image by  = (, ), called the displacement in time .If the pixel is obscured by noise, or if there is an abnormal intensity change due to light reflection by objects, (5) can be redefined as Using feature matching, we can get the geometric transformation between  + and   .Indeed, let Γ ,+ denote the warping  of the image to the reference frame  + .And the stabilized image sequence is defined by   + =   (Γ ,+ ).The parameter estimation of the geometric transform is done by the minimum mean square error criteria: Generally, the geometric transformation between two images can be described by a 2D or 3D homograph model.We adopted four parameters 2D affine motion model to describe geometric transformation between two consecutive frames.If (, ) is the point in frame , and   (  ,   ) is the same point in the successive frame, then the transformation from  to   can be represented as shown in the following: or in the form of  = .The affine matrix can describe accurately pure rotation, panning, and small translations of the camera in a scene with small relative depth variations and zooming effects. is the scaling factor,  is the rotation, and   and   are the translations in the horizontal and vertical direction, respectively.Corresponding pairs of feature points were used to determine the transform matrix in (1) from two consecutive image frames.Since four unknowns exist in (8), at least three pairs are needed to determine a unique solution.
Nevertheless, more matches can be added under least-square criteria to ensure results are more robust: Then we can compensate the current frame to obtain stable images.Compensation of the video is calculated directly using warping operation.The example is shown in Figure 4(e).

Vehicle Detection.
After removing the undesired motion of camera, the first step of mobile vehicle detection was the initial vehicle detection, which produces the vehicle candidates, including many false alarms.

Normal Flow.
The reference frame and the warped one do not, in general, have the same metric since in most cases, the mapping function Γ ,+ is not a translation but a 2D affine transform.This change in metric can be incorporated into the optical flow equation associated with the image sequence   , in order to detect more accurately candidate mobile vehicle region.From the image brightness constancy assumption [24,25], the gradient constraint equation selected by Horn and Schunck [24] is where  and V are the optical flow velocity components and /, /, and / are the spatial gradients and temporal gradient of image intensity.Equation ( 10) is written in matrix form: The optical flow associated with the image sequence  + is Expanding the previous equation we obtain According to composite function derivation rules Expanding (13) Although  ⊥ does not always characterize image motion, due to the aperture problem, it allows accurate detecting of moving points.The amplitude of  ⊥ is larger near moving regions and becomes null near stationary regions.The relation of normal flow and optical flow is shown in Figure 5(a) and the candidate mobile region detection is shown in Figure 5(b).

Context Extraction.
Context is especially useful in aerial video analysis, because most of the vehicles move in special area.And road is an effective context information for robust mobile vehicle detection.Many estimate the road network using the scene classification, which needs complicated training and many issues are prepared in advance.Based on human knowledge in general, we can make the following brief description of the road.
(i) Road has constant width along all its length.
(ii) Road always is vertical or horizontal in the airborne videos.
(iii) There are two distinct parallel edges of the road.
(iv) Road is always a connected region area.
Based on above assumption, we use Canny Edge detection and Hough Transform to extract the road area.The results are shown in Figure 6.

Complementary Features.
Initial vehicle detection produces candidate mobile vehicle regions, including many false alarms, shown in Figure 7.We use shape (size) [3] of the candidate motion regions to improve the detection performance.Size feature is a four-dimensional vector, which is represented as (17), where  and  denote the length and width of the object, respectively:

Experiment Results and Analysis
We tested our method with three surveillance videos.The first two were got from our own hardware platform, shown in Figure 9(a), named 2.avi and gs.avi, respectively.The other   is from the Shastry and Schowengerdt's paper [18], shown in Figure 9(b), named TucsonBlvd origin.avi.The first two were taken in 25 frames per second with resolution of 720 × 576 pixels on the airship of 120 m height from the ground, where the speed of airship is 30 Km/h, shown in Figure 8.
From the vehicle numbers and background complexity in Figure 9, vehicles contained in (a1) are the least.And the background is simple, which includes no buildings; therefore it cannot cause visual error.The vehicles increase in the (a2), and the background includes buildings, which cause visual error.The most complex video is (b), which not only includes more vehicles and buildings but also has lowest resolution.The hardware platform of the simulation is CPU 2.1 GHz and RAMS 2 G.The software used in the experiments is opencv1.0 and VC++ 6.0.

Image Stabilization Comparison between SURF and SIFT.
Our first experiment consists of comparing our video stabilization system to [5].This system is based on SIFT feature extraction.We demonstrated the Speeded Up Robust Feature (SURF) key points are more suitable for the stabilization task.Figure 10 shows five frames of the unstable input sequence corresponding to 1, 2, 5, 10, and 15, taken from 2.avi.Next, we compute the global motion vector Γ 1, , shown in Table 1.Table 1 shows that the airplane moves in vertical direction mostly and the accuracy of vector Γ 1, is almost the same in two video stabilization methods.
Figures 11(a) and 11(b) show the stabilization result of SIFT and SURF, respectively.Subjectively, our video stabilization system has the same results compared to SIFT.
Then, we used Peak Signal-to-Noise Ratio (PSNR), an error measure, to evaluate the quality of the video stabilization.PSNR between frame 1 and stabilized frame  is defined as where MSE(   ,  1 ), mean square error, between frames  and  is frame dimensions: We found that our stabilization system using SURF feature is working well compared to the stabilization system using SIFT feature in Figure 12.For the parallax effect of wrapping operation and multiple moving vehicles, the PSNR is low.So in the mobile vehicle detection, we use the normal optical flow.Objectively, our video stabilization system has the better results compared to SIFT from Table 1 and Figure 12.
Last, we compare the performance of the two video stabilization methods, shown in Figure 13.
Through the experiments, the image stabilization accuracy is the same in subjective and objective evaluation.And the efficiency of image stabilization on SURF is better than on SIFT.We find that our stabilization system is working well.

Mobile Vehicle Detection Comparison between Proposed
Method and Existing Methods.To evaluate the performance of mobile vehicle detection, our tests were run on a number of real aerial video sequences with various contents.Aerial video includes cars and buildings.Figure 14 shows the results under different conditions in video.The mobile vehicle is identified with a red rectangle.From the results, we can see that moving object can be successfully detected with different backgrounds.But we find a failure in the detection process.
To evaluate the performance of this method, we used detection ratio (DR) and false alarm ratio (FAR).In (20), For the quantitative analysis of our results we used two metrics: DR and FAR.Table 2 and Figures 14 and 15 illustrate the performance of our system.Because the resolution and complexity of videos are different, the detection performance is different.Our system has the highest rates of DR and the lowest rate in FAR.

Conclusion
In this paper, we present a hybrid method to detect mobile vehicle efficiently in aerial videos.We also demonstrate that SURF as features are robust for video stabilization and mobile vehicle detection purpose compared with SIFT.A quantitative evaluation on real video sequences demonstrates that the proposed method improves the detection performance.Our future work will focus on the following aspects to improve our method.
(i) To increase the accuracy of the mobile vehicle, more local and global features, such as color information and gradient distribution, can be applied in the methods.(ii) We have to balance between the processing speed and algorithm complexity and robustness.

Figure 1 :
Figure 1: Overview of the proposed method.

3. 1 . 2 .
Feature Point Detection.The integral images allow for fast computation of box filters.The entry of an integral image  Σ () at a location  = (, ) represents the sum of all pixels in the input image  with a rectangular region formed by the origin and :  Σ () =

3. 1 . 3 .
Feature Point Description.In order to be invariant to image rotation, a dominant orientation for each key point is identified first in feature point description.For a key point, Haar wavelet responses in  and  directions are calculated within a circular neighborhood of radius 6 around it, where  is the corresponding scale of the detected key point.The Haar wavelet responses can be computed using Haar wavelet filters and integral images.The wavelet responses are then weighted with a Gaussian ( = 2) centered at the key point.The dominant orientation can be estimated by rotating a sliding fan-shaped window of size /3.At each position, the horizontal and vertical responses within the sliding window are summed and used to form a new vector.The longest such vector over all windows is assigned as the orientation of the key point.Then, SURF descriptor is generated in a 20 square region centered at the key point and oriented along its dominant orientation.The region is divided into 4×4 square subregions.For each subregion, Haar wavelet responses   in horizontal direction and   in vertical direction are computed from 5 × 5 sample points.Then the wavelet responses   and   are weighted with a Gaussian ( = 3.3) centered at the key point.The responses and their absolute values are summed up over each subregion and form a 4D feature vector (∑   , ∑   , ∑ |  |, ∑ |  |).Thus for each key point, this results in a descriptor vector of length 4 × 4 × 4 = 64.Finally, the SURF descriptor is normalized to make it invariant to illumination changes.

2 Figure 4 :
Figure 4: Image registration with SURF key points and image warping.

3. 1 . 4 .
Motion Detection and Compensation.The temporally and spatially changing video can be modeled as a function   (, ), where (, ) is the spatial location of a pixel and  is the temporal locator index, within the sequence.The function  can be thought of as representing the pixel intensity at location (, ) and time .Thus, this function satisfied the following property:  + (, ) =   ( − ,  − ) .

Figure 8 :
Figure 8: Hardware used in the experiments.

Stabilized frame 2 with frame 1 Stabilized frame 5 with frame 1 Stabilized frame 10 with frame 1 Figure 11 :
Figure 11: The 1st frame wrapped with the 2nd frame, 5th frame, 10th frame, and 15th frame, respectively, using SIFT features (a) and SURF features (b).

Figure 12 :Figure 13 :
Figure 12: Graph of the Peak Signal-to-Noise Ratio of the original video and the stabilized video.

Figure 14 :
Figure 14: Vehicle detection results of the proposed method.

Figure 15 :
Figure 15: Vehicle detection results of GMM, LK, and proposed method.

Table 1 :
Global motion parameter comparison of SURF and SIFT using 2D affine transform.

Table 2 :
Quantitative analysis of detection with other methods.