Car Detection from Low-Altitude UAV Imagery with the Faster R-CNN

UAV based traffic monitoring holds distinct advantages over traditional traffic sensors, such as loop detectors, as UAVs have higher mobility, wider field of view, and less impact on the observed traffic. For traffic monitoring from UAV images, the essential but challenging task is vehicle detection. This paper extends the framework of Faster R-CNN for car detection from low-altitude UAV imagery captured over signalized intersections. Experimental results show that Faster R-CNN can achieve promising car detection results compared with other methods. Our tests further demonstrate that Faster R-CNN is robust to illumination changes and cars’ in-plane rotation. Besides, the detection speed of Faster R-CNN is insensitive to the detection load, that is, the number of detected cars in a frame; therefore, the detection speed is almost constant for each frame. In addition, our tests show that Faster R-CNN holds great potential for parking lot car detection. This paper tries to guide the readers to choose the best vehicle detection framework according to their applications. Future research will be focusing on expanding the current framework to detect other transportation modes such as buses, trucks, motorcycles, and bicycles.


Introduction
Unmanned aerial vehicles (UAVs) hold promise of great value for transportation research, particularly for traffic data collection (e.g., [1][2][3][4][5]).UAVs have many advantages over ground based traffic sensors [2]: great maneuverability and mobility, wide field of view, and zero impact on ground traffic.Due to the high cost and challenges of image processing, UAVs have not been extensively exploited for transportation research.However, with the recent price drop of off-theshelf UAV products and widely applications of surveillance video technologies, UAVs are becoming more prominent in transportation safety, planning, engineering, and operations.
For UAV based applications in traffic monitoring, one essential task is vehicle detection.This task has become challenging due to the following reasons: varying illumination conditions, background motions due to UAV movements, complicated scenes, and different traffic conditions (congested or noncongested).Many traditional techniques, such as background subtraction [6], frame difference [7], optical flow [8], and so on, can only achieve low accuracy; and some methods, such as frame difference and optical flow, can only detect moving vehicles.In order to improve detection accuracy and efficiency, many object detection schemes have been applied for vehicle detection from UAV images, including Viola-Jones (V-J) object detection scheme [9], the linear support machine (SVM) with histogram of orientated gradient (HOG) features [10] (SVM + HOG), and Discriminatively Trained Part Based Models (DPM) [11].Generally, these object detection schemes are less sensitive to image noise and complex scenarios therefore are more robust and efficient for vehicle detection.However, most of these methods are sensitive to objects' in-plane rotation; that is, only objects in one particular orientation can be detected.Furthermore, many methods, like V-J, are sensitive to illumination changes.
In recent years, convolutional neural network (CNN) has shown impressive performance on object classification and detection.The structure of CNN was first proposed by LeCun et al. [12].As a feature learning architecture, CNN contains convolution and max-pooling layers.Each convolutional layer of CNN generates feature maps using several different convolution kernels on the local receptive fields from the preceding layer.The output layer in the CNN combines the extracted features for classification.By applying downpooling, the sizes of feature map can be decreased and the extracted features become more complex and global.Many studies [13][14][15] have shown that CNN can achieve promising performance in object detection and classifications.
However, directly combining CNN with sliding window strategy has difficulties to precisely localize objects [16,17].To address above issues, region-based CNN, that is, R-CNN [18], SPPnet [19], and Fast-R-CNN have, been proposed to improve objected detection performance.But the region proposal generation step consumes too much computation time.Therefore, Ren et al. further improved Fast R-CNN [20] and developed the Faster R-CNN [21], which achieves state-of-the-date object detection accuracy with real-time detection speed.Inspired by the success of Faster R-CNN [21] in object detection, this research aims to apply Faster R-CNN [21] for vehicle detection from UAV imagery.
The rest of the paper is organized as follows: Section 2 briefly reviews some related work about vehicle detection with CNN from UAV images, followed by the methodological details of the Faster R-CNN [21] in Section 3. Section 4 presents a comprehensive evaluation of the Faster R-CNN for car detection.Section 5 presents a discussion on some key characteristics of Faster R-CNN.Finally, Section 6 concludes this paper with some remarks.

Related Work
A large amount of research has been performed on vehicle detection over the years.Here we only focus on vehicle detection with CNN from UAV images.Some of the most related work is reviewed here.
Pérez et al. [22] developed a traditional object detection framework based on the sliding window strategy with a classifier.This paper designed a simple CNN network instead of using traditional classifiers (SVM, Boosted Trees, etc.).As the sliding window strategy is time-consuming when handling multiscale objects detection, the framework of [22] is time-consuming for vehicle detection from UAV images.
Ammour et al. [23] proposed a two-stage car detection method, including candidate regions extraction and classification stage.In the candidate regions extraction stage, the authors employed the mean-shift algorithm [24] to segment images.Then fine-tuned VGG16 model [25] was used to extract region feature.Finally, SVM was used to classify the features into "car" and "non-car" objects.The proposed framework of [23] is similar to R-CNN [18], which was time-consuming when generating region proposals.Besides, different models should be trained for the three separate stages, which increases the complexity of [23].
Chen et al. [15] proposed a hybrid deep convolutional neural network (HDNN) for vehicle detection in satellite images to handle large-scale variance of vehicles.However, when applying HDNN for vehicle detection from satellite images, it takes about 7-8 seconds to detect one image even using Graphics Processing Unit (GPU).Inspired by the success of Faster R-CNN in both detection accuracy and detection speed, this work proposed a car detection method based on Faster R-CNN [21] to detect cars from low-altitude UAV imagery.The details of the proposed method are presented in the following section.

Car Detection with Faster R-CNN
Faster R-CNN [21] has achieved state-of-the-art performance for multiclass object detection in many fields (e.g., [19]).But so far no direct application of Faster R-CNN on car detection from low-altitude UAV imagery, particularly under urban environment, has been applied.This paper aims to fill this gap by proposing a framework for car detection from UAV images using Faster R-CNN, as shown in Figure 1.
3.1.Architecture of Faster R-CNN.The Faster R-CNN consists of two modules: the Regional Proposal Network (RPN) and the Fast R-CNN detector (see Figure 2).RPN is a fully convolutional network for efficiently generating region proposals with a wide range of scales and aspect ratios which will be fed into the second module.Region proposals are rectangular regions which may or may not contain candidate objects.Fast R-CNN detector, the second module, is used to refine the proposals.The RPN and Fast R-CNN detector share the same convolutional layers, allowing for joint training.The Faster R-CNN runs through the CNN only once for the entire input image and then refines object proposals.Due to the sharing of convolutional layers, it is possible to use a very deep network (e.g., VGG16 [25]) for generating highquality object proposals.The entire architecture is a single and unified network for object detection (see Figure 2).

Fast R-CNN Detector.
The Fast R-CNN detector takes multiple regions of interest (RoIs) as input.For each RoI (see Figure 2), a fixed-length feature vector is extracted by the RoI pooling layer from the convolutional layer.Each feature vector is fed into a sequence of fully connected (FC) layers.The final outputs of the detector through the softmax layer and the bounding-box regressor layer include (1) softmax  probabilities which estimate over  object classes plus the "background" class and (2) related bounding-box (bbox) values.In this research, the value of  is 1, namely, the object classes only contain one object "passenger car" plus the "background" class.

Region Proposal Networks and Joint
Training.When using RPN to predict car proposals from UAV images, the RPN takes a UAV image as input and outputs a set of rectangular car proposals (i.e., bounding boxes), each with an objectless score.In this paper, the VGG-16 model [25], which has 13 shareable convolutional layers, was used as the Faster-RCNN convolutional backend.
The RPN utilizes sliding windows over the convolutional feature map output by the last shared convolutional layer to generate rectangular region proposals for each position (see Figure 3(a)).A  ×  spatial window (filter) was convolved with the input convolutional feature map.Then each sliding window is projected to a lower-dimensional feature (512-d for VGG-16), by convolving with two 1 by 1 filters, respectively, for a box-regression layer (reg) and a box-classification layer (cls).For each sliding window location,  possible proposals (i.e., anchors in [21]) were generated in the cls layer.For the reg layer, 4 outputs were generated to encode the coordinates of  bounding boxes.Meanwhile, 2 objectness scores were output in the cls layer to estimate probability whether each proposal contains a car or a non-car object (see Figure 3(b)).
As many proposals highly overlap with each other, nonmaximum suppression (NMS) was applied to merge proposals that have high intersection-over-union (IoU).After NMS, the remaining proposals were ranked based on the object probability score, and only the top  proposals are used for detection.
For training RPNs, each proposal is assigned a binary class label which indicates whether the proposal is an object (i.e., car) or just background.A positive training example is designated if the proposal overlaps with a ground-truth box with an IoU more than a predefined threshold (0.7 in [21]), or if it has the highest IoU with a ground-truth.
A proposal will be assigned as a negative example if its maximum IoU is lower than the predefined threshold (0.3 in [21]) for all ground-truth boxes.Following the multitask loss in Fast R-CNN network [20], the RPN is trained by a multitask loss, which is defined as where  is the index of an anchor and   is the predicted probability of anchor  being an object.The ground-truth label  *  is 1 if the anchor is positive and 0 if the anchor is negative.The multitask loss has two parts, a classification component  cls and a regression component  reg .In (1),   is a vector representing the 4 parameterized coordinates of the predicted bounding-box; and  *  is the vector of the groundtruth box associated with a positive anchor.These two terms are normalized by  cls and  reg and weighted by a balancing parameter .In the released code [26], the cls term in ( 1) is normalized by the minibatch size (i.e.,  cls = 256), the reg term is normalized by the number of anchor locations (i.e.,  reg ∼ 2,400), and  is set as 10.
Bounding-box regression is to find the best nearby ground-truth box of an anchor box.The parameterization of the 4 coordinates of an anchor is described as follows: where , , , and ℎ denote the bounding-box's center coordinates, width, and height, respectively.,   , and  * are for the predicted box, anchor box, and ground-truth box, respectively.Similar definitions apply for , , and ℎ.
The bounding-box regression is achieved by using features with the same spatial size on the feature maps.A set of  bounding-box regressors are trained to adapt for varying size.
Since the RPN and Fast R-CNN detector can share the same convolutional layers, these two networks can be trained jointly to learn a unified network through the following 4step training algorithm: first, training the RPN as described above; second, training the detector network using proposals generated by the RPN trained in the first step; third, initializing RPN training by the detector network but only train the RPN specific layers; and finally, training the detector network using the new RPN's proposals.Figure 4 shows two screenshots of car detection with the Faster R-CNN.

Experiments
4.1.Data Set Descriptions.The airborne platform used in this research is a DJI Phantom 2 quadcopter integrated with a 3axis stabilized gimbal (see Figure 5).
Videos are collected by a Gopro Hero Black Edition 3 camera mounted on the UAV.The resolution of the videos is 1920 × 1080 and the frame rate is 24 frames per second (f/s).The stabilized gimbal is used to stabilize the videos and eliminate video jitters caused by UAV therefore greatly reducing the impact from external factors, such as wind.In addition, an On-Screen Display (OSD), an image transmission module, and a video monitor are installed in the system for data transmission and airborne flying status monitoring and control.
A UAV image dataset is built for training and testing the proposed car detection framework.For training video collection, we followed the following two key suggestions: (1) collecting videos with cars of different orientations; (2) collecting videos with cars of a wide range of scales and aspect ratios.To collect videos with cars of different orientations, UAV videos from signalized intersections were recorded; since cars at intersections have different orientations while making turning.To collect videos covering cars of a wide range of scales and aspect ratio, UAV videos at different flight height, ranging from 100 m to 150 m, were recorded.In this work, UAV videos were collected from two different signalized intersections.For each intersection, videos 1-hour long were captured.Totally, videos two hours long were collected for building the training and testing datasets.Particularly, we applied the VGG-16 model [25].For the RPN of the Faster-RCNN, 300 RPN proposals were used.The source code of Faster R-CNN was from [26].GPU was used during the training.The main configurations of the computer used in this research are (i) CPU: Intel Core i7 hexa-core 5930 K@3.5 GHz, 32 GB DDR4; (ii) Graphics card: Nvidia TITAN X, 12 GB GDDR5; (iii) Operating system: Linux (Ubuntu 14.04).
The training and detection implementation in this paper is all performed on the open source code released by the authors of Faster R-CNN [21].The inputs for training and testing are images with the original size (1920 × 1080) without any preprocessing steps.

Performance Evaluation
where TP is the number of "true" detected cars; FP is the number of "false" detected objects which are non-car objects; and FN is the number of cars missed.In particular, Quality is considered as the strictest criterion, which contains both possible detection errors (false positives and false negatives).
As ViBe [6] and frame difference [7] are sensitive to background motions, image registration [28] is applied first to compensate UAV motions and delete UAV video jitters.The time for image registration is included in the detection time for these two methods.The performance indicators are calculated based on the same 100 images as the testing dataset.Note, for ViBe and Frame Difference, the postprocessing for blob segmentation results is very important for the final car detection accuracy as blob segmentation using ViBe and Frame Difference may yield segmentation errors.In this work, two rules are designed to screen out segmentation errors: (1) the area of a detected blob is too large (2 times larger than that of a normal passenger car) or too small (smaller than 1/2 of a normal passenger car); ( 2) the aspect ratio of the minimum enclosing rectangle of a detected blob is larger than 2. Note, the area of the normal passenger car was obtained by human.If any of the two rules is met, the detected blob will be screened out as segmentation errors.The V-J [9] and HOG + SVM [10] methods are trained on 12,240 positive samples and 30,000 negative samples.These 12,240 samples only contain cars orientated in the horizontal direction.Besides, all positive samples are normalized to a compressed size of 40 × 20.The performance evaluations of Faster R-CNN, V-J, and HOG + SVM are run on our testing dataset (100 images, 3,115 testing samples).

Experiment Results.
The testing results of five methods are presented in Table 1.The detection speed was an average of the 100 tested images.To comprehensively evaluate the performance of different algorithms on both CPU and GPU architectures, detection speeds for V-J, HOG + SVM, and Faster R-CNN were tested on the i7 CPU and the high-end GPU, respectively.
The results show that Faster R-CNN achieved the best Quality (94.94%) compared with other four methods.ViBe and Frame Difference achieved fast detection speed under CPU mode but with very low Completeness.The reason is that many stopped cars (such as cars waiting at the traffic light) are recognized as background objects, therefore generating many false negatives and leading to a low Completeness.Only when those stopped cars run again could they be detected.As many moving non-car objects (such as tricycles and moving pedestrians) lead to false positives, the Correctness of those two methods is low (76.64% and 78.17%, resp.).
Although the two object detection schemes V-J and HOG + SVM are nonsensitive to image background motions compared with ViBe and Frame Difference, the Completeness of these two methods is also as low as 41.61% and 42.89%, respectively, which is only slightly higher than that of ViBe and Frame Difference.The reason, as mentioned in Section 1, is that both V-J and HOG + SVM are sensitive to objects' inplane rotation.Only cars in the same orientation with the positive training samples could be detected.In this paper, only cars in the horizontal direction can be detected.A sensitivity analysis of the impact of cars' in-plane rotations has been provided in Discussion.
The method of Faster R-CNN achieved the best performance (Quality, 94.94%) among all five methods.As Faster R-CNN can intelligently learn the information of orientation, aspect ratio, and scale during training, this method is not sensitive to cars' in-plane rotation and scale variations.Therefore, Faster R-CNN achieves high Correctness (98.43%) and Completeness (96.40%).
Though Faster R-CNN achieved 2.1 f/s under GPU mode, which is slower than other methods, 2.1 f/s can still satisfy real-time applications.

Robustness to Illumination Changing Condition.
For car detection from UAV videos, one most challenging issue is the illumination changing.Our testing datasets (100 images, 3,115 testing samples) do not contain cars in such scenes; for example, cars travel from an illumination (or shadowed) area to a shadowed (or illumination) area.Therefore, we further conducted an experiment using a 10 min long video captured under illumination changing condition to evaluate the performance of the Faster R-CNN (see Figure 6).
The testing results are highlighted in Table 2.The results show that Faster R-CNN achieved a Completeness of 94.16%, which is slightly lower than that in Table 1 (96.40%),due to the oversaturation of the image sensor under strong illumination condition.The Correctness of Faster R-CNN is 98.26%.The results shown in Table 2 confirm that illumination changing condition has little impact on the accuracy of vehicle detection using Faster R-CNN.The methods of ViBe and Frame Difference achieved higher Quality than that shown in Table 1.That is because this test scene is an arterial road (see Figure 6), where most cars were running fast along the road; therefore these moving cars can be easily detected by ViBe and Frame Difference.However, many black cars that have similar color as the road surface and cars under strong illuminations could not be detected; therefore, the Completeness of ViBe and Frame Difference are still low (67.90% and 64.69%, resp.).The V-J and HOG + SVM methods achieved higher Completeness (81.36% and 82.38%, resp.)than those shown in Table 1 (41.61% and 42.89%, resp.); because most of these cars in this testing scene (see Figure 6) are orientated in the horizontal direction; thus these vehicles can be successfully detected by V-J and HOG + SVM.However, the Completeness of these two methods is significantly lower than that of the Faster R-CNN.As argued by some research [29], methods like the V-J method are sensitive to lighting conditions.

Sensitivity to Vehicles'
In-Plane Rotation.As mentioned in Section 1, methods like V-J and HOG + SVM are sensitive to vehicles' in-plane rotation.As the vehicle orientations are generally unknown in UAV images, the detection rates (Completeness) of different methods may be affected significantly by the vehicles' in-plane rotation.
To analyze the sensitivity of different methods to vehicles' in-plane rotation, experiments are conducted based on dataset which contains vehicles orientated in different directions (see Figure 7).The dataset contains 5 groups of images; each group contains 19 images which orientated in  different orientations as 0 ∘ , 5 ∘ , 10 ∘ , . . ., 85 ∘ , 90 ∘ at an interval of 5 ∘ .
From Figure 8 we can see that the Completeness of the V-J downgrades significantly as the vehicles' orientation exceeds 10 degrees.Compared to V-J, HOG + SVM is less sensitive to vehicles' in-plane rotation, but the Completeness of HOG + SVM still downgrades significantly when the vehicles' orientation exceeds about 45 degrees.Compared with V-J and HOG + SVM, Faster R-CNN is insensitive to vehicles' in-plane rotation (the red curve in Figure 8).The reason is that the Faster R-CNN can automatically learn the information of orientation, aspect ratio, and scale of vehicles from vehicle training samples during the training.Therefore, Faster R-CNN is insensitive to vehicles' in-plane rotation.

Sensitivity of Detection Speed to Different Detection
Load.Detection speed is crucial for real-time applications.Detection speed can be easily affected by many factors, such as the detection load (i.e., the number of detected vehicles in one image), hardware configuration, and video resolution.Among these factors, the most important factor is detection load.
To comprehensively explore the speed characteristic of Faster R-CNN, experiments on images which contain different number of detected vehicles have been conducted (see Figure 9).Other four methods are also included for comparison.To fairly evaluate the detection speed of different algorithms on different architectures, the speed tests are performed on the i7 CPU and the high-end GPU, respectively.We explored the detection speed on i7 CPU for all five methods (see Figure 9) and explored the detection speed on GPU for VJ, HOG + SVM, and Faster R-CNN (see Figure 10).
From Figure 9 we can see that the detection speeds of V-J and HOG + SVM are monotonically decreasing with the increase of the number of detected vehicles.The V-J method presents a higher descending rate than HOG + SVM as the number of detected vehicles increases.The speed curves of ViBe and Frame Difference are unsmooth, but we can see that the increase of the number of detected vehicles has little influence on the detection speed of the two methods.
The detection speed of Faster R-CNN was very slow under CPU mode (see Figure 9).Under GPU mode (see Figure 10), the detection speed of Faster R-CNN was about 2 f/s.From Figures 9 and 10, we can find that the Faster R-CNN holds  similar speed characteristic as the ViBe and Frame Difference but with a smooth speed curve.The detection load almost has no influence on the detection speed of Faster R-CNN.The reason is that when detecting vehicles using Faster R-CNN, the method is applied on the entire image.In the proposal regions generation stage, 2000 RPN proposals are generated from the original image [21].The top-300 ranked proposal regions are fed into the Fast R-CNN [20] to check whether the proposal region contains one car.The computational cost is almost the same for each frame; therefore, the detection speed of Faster R-CNN is nearly insensitive to detection load.As shown in Table 3, the AdaBoost method using Haarlike features (V-J) trained on 12,240 positive samples and 30,000 negative samples takes about 6.8 days.The training procedure was only run on CPU without parallel computing or other acceleration schemes.The linear SVM classifier with HOG features (HOG + SVM) shares the fastest training speed among all the three methods.It only takes 5 minutes on the same training set as the V-J method.Although HOG + SVM has the fastest training speed, its detection performance is significantly lower than that of Faster R-CNN (see Table 1).The training of Faster R-CNN takes about 21 hours to complete.For practical applications, 21 hours is acceptable, as the annotation of training samples may take several days.For example, in this paper, the annotation of the whole dataset (12,240 training samples and 3,115 testing samples, totally 500 images) using the tool LabelImg [27] costs 4 days by two research fellows.

Concluding Remarks
Inspired by the impressive performance achieved by Faster R-CNN on object detection, this research applied this method for passenger car detection from low-altitude UAV imagery.The experimental results demonstrate that Faster R-CNN can achieve highest Completeness (96.40%) and Correctness (98.43%) with real-time detection speed (2.10 f/s), compared with four other popular vehicle detection methods.
Our tests further demonstrate that Faster R-CNN is robust to illumination changing and cars' in-plane rotation; therefore, Faster R-CNN can be applied for vehicle detection from both static and moving UAV platforms.Besides, the detection speed of Faster R-CNN is insensitive to the detection load (i.e., the number of detected vehicles).The training cost of Faster R-CNN network is about 21 hours, which is acceptable for practical applications.
It should be emphasized that this research provided a rich comparison of different vehicle detection techniques which covers a lot of aspects of object detection challenges that are usually partially covered in object detection papers: detection rate without in-plane rotation, sensitivity to in-plane rotation, detection speed, and sensitivity to the number of vehicle in the image as well as the training cost.This paper tries to guide the readers to choose the best framework according to their applications.
However, due to the lack of enough training samples, this research only tested the Faster-RCNN networks for passenger cars.Future research will expand this method for the detection of other transportation modes such as buses, trucks, motorcycles, and bicycles.

Figure 1 :
Figure 1: Car detection framework with the Faster R-CNN.

4. 3 . 1 .
Evaluation Indicator.The performance of car detection by Faster R-CNN is evaluated by four typical indicators: detection speed (frames per second, f/s), Correctness, Completeness, and Quality, as defined in (3

Figure 6 :
Figure 6: Car detection under illumination changing condition using Faster R-CNN.

Figure 9 :
Figure 9: Sensitivity of detection speed to different detection load (tested on i7 CPU).

Figure 10 :
Figure 10: Sensitivity of detection speed to different detection load (tested on GPU).

Table 1 :
Car detection results.

Table 2 :
Vehicle detection under illumination changing condition.
Comparison.When applying the Faster R-CNN for vehicle detection, one important issue that should be considered is the computational cost of training procedures.As the training samples may change, it is necessary to efficiently update the Faster R-CNN model to satisfy the requirement of vehicle detection.The training costs of three different methods are shown in Table 3.Because the open source code of Faster R-CNN can only support training function under GPU mode, only training time under GPU mode was provided.For V-J and HOG + SVM, as the open source code only supports CPU mode, only training time under CPU mode was provided.