DAR-Net: Dense Attentional Residual Network for Vehicle Detection in Aerial Images

With the rapid development of deep learning and the wide usage of Unmanned Aerial Vehicles (UAVs), CNN-based algorithms of vehicle detection in aerial images have been widely studied in the past several years. As a downstream task of the general object detection, there are some differences between the vehicle detection in aerial images and the general object detection in ground view images, e.g., larger image areas, smaller target sizes, and more complex background. In this paper, to improve the performance of this task, a Dense Attentional Residual Network (DAR-Net) is proposed. The proposed network employs a novel dense waterfall residual block (DW res-block) to effectively preserve the spatial information and extract high-level semantic information at the same time. A multiscale receptive field attention (MRFA) module is also designed to select the informative feature from the feature maps and enhance the ability of multiscale perception. Based on the DW res-block and MRFA module, to protect the spatial information, the proposed framework adopts a new backbone that only downsamples the feature map 3 times; i.e., the total downsampling ratio of the proposed backbone is 8. These designs could alleviate the degradation problem, improve the information flow, and strengthen the feature reuse. In addition, deep-projection units are used to reduce the impact of information loss caused by downsampling operations, and the identity mapping is applied to each stage of the proposed backbone to further improve the information flow. The proposed DAR-Net is evaluated on VEDAI, UCAS-AOD, and DOTA datasets. The experimental results demonstrate that the proposed framework outperforms other state-of-the-art algorithms.


Introduction
Object detection, as an important topic in computer vision, aims to precisely localize the targets in given images and classify each target. is topic is of broad interest for potential applications of face detection, pedestrian counting, automatic driving, vehicle detection, etc. [1].
Before the emergence of deep learning, most traditional object detection algorithms which are based on hand-crafted features can be roughly divided into three steps: region selection, feature vector extraction, and region classification. In the region selection step, the input image is usually scanned by multiscale sliding windows to find the locations which may contain targets. ese locations selected in the region selection step are called candidate regions. During the feature extraction step, low-level visual descriptors such as SIFT [2], HOG [3], or SURF [4] are used to extract and encode semantic information from each candidate region. In the final step, the encoded feature vectors of each candidate region are classified by classifiers such as SVM [5]. Although object detection algorithms based on traditional manual features have made some breakthroughs in detection accuracy, there still are two nonignorable limitations. Firstly, they inevitably generate many redundant candidate regions during region proposal steps, which leads to imbalanced class distribution during region classification steps. Secondly, hand-crafted feature extraction algorithms are not capable of capturing high-level semantic information; besides the low-level information it extracted is not sufficient for complex localization and classification problems. Because of these limitations, traditional object detection algorithms are generally time-consuming and inaccurate.
Recently, with the development of computer science and hardware technologies, as well as abundant data resources, more and more researchers give their top research priority to deep learning based object detection algorithms. Generally speaking, deep learning based object detection algorithms can be classified into two categories: two-stage detection algorithms, such as R-CNN families [6][7][8][9], FPNs [10][11][12], and their variants [13][14][15], and one-stage detection algorithms, such as YOLO families [16][17][18][19], SSD families [20,21], and their variants [22][23][24]. Similar to traditional object detection algorithms, two-stage detection algorithms also contain region selection and region classification steps. However, different from traditional detection algorithms, two-stage detection algorithms utilize convolutional neural networks (CNN) to generate hierarchical feature maps including high-level semantic information in the feature extraction step. Without the region selection step, one-stage algorithms directly detect targets in different locations by bounding box regression processes and classification processes. Usually, two-stage algorithms achieve better results on benchmark detection tasks, while one-stage algorithms can achieve faster processing speed.
Powered by advanced remote sensing technologies and wide usages of UAVs, vehicle detection in aerial images, as a downstream research direction of object detection, becomes indispensable in many important applications such as disaster relief, population density estimation, parking lot planning, traffic monitoring, etc. However, the algorithms detecting targets in ground view images cannot be directly utilized for vehicle detection in aerial images because of the following differences between ground view images and aerial images: (1) Generally speaking, aerial images cover much larger areas than ground view images, while the sizes of targets in aerial images are much smaller than the sizes of targets in ground view images. is feature produces great difficulties for extracting information, especially spatial information, of targets from aerial images, which leads to further difficulties in correctly localizing the targets.
(2) Due to the vertical shooting angle of aerial images, the texture of vehicles in aerial images is relatively simple. As a result, the background objects such as buildings are easily confused with the vehicles. In addition, the vehicles in aerial images usually appear with arbitrary orientations. ese special characteristics make it difficult to correctly classify the targets.
(3) e number of targets in aerial images is normally more than the number of targets in ground view images, which also brings difficulties to detect targets.
Considering the difficulties illustrated above, a new onestage vehicle detection framework for aerial images is proposed in this paper. e main contributions of this paper are listed as follows: (1) A novel residual block, named dense waterfall residual block (DW res-block), is proposed in this paper. In each DW res-block, the partial information of each convolutional layer is transmitted into all subsequent layers; in other words, each layer can obtain the partial information from all preceding layers. Because of this design, the proposed framework can preserve the low-level spatial information and extract high-level information simultaneously during the feature extracting stage. (2) A multiscale receptive field attention (MRFA) module is proposed and plugged into the proposed DW res-block. e proposed MRFA module generates the multiscale receptive field feature maps by using dilated convolution and fuses these feature maps with the attention-weighted feature maps. e proposed attention module selects the informative feature from the feature maps and enhances the ability of multiscale perception of the proposed framework.
(3) Utilizing the DW res-block and the MRFA module, a backbone for vehicle detection in aerial images is proposed. e proposed backbone extracts the semantic information from high-resolution feature maps to alleviate the loss of spatial information. Besides, downprojection units and transition layers are also employed in the proposed backbone to reduce the impact of information loss caused by downsampling and improve the information flow of the proposed framework, respectively.

Related Works
Due to the rapid improvement of deep learning and the wide utilization of UAVs in recent years, many CNN-based detection algorithms for aerial images have been proposed. Some typical algorithms are reviewed in this section.
In 2017, to solve the limitations of directly using Faster R-CNN for vehicle detection in aerial images, T. Tang et al. proposed an improved detection framework based on Faster R-CNN [25]. In that paper, T. Tang [27]. e DFL-CNN framework combined the low-level features and high-level features by using skip-connection. Besides, Yang MY et al. also adopted focal loss to DFL-CNN to solve the problem caused by imbalanced numbers of positive and negative targets. Rotation Dense Feature Pyramid Networks (R-DFPN) were proposed in [28] [29]. e proposed algorithm also used rotation region proposals to improve the location accuracy of object detection in aerial images.
In 2019, Pang J. et al. proposed a remote sensing regionbased convolutional neural network (R2-CNN) [30] for small object detection in aerial images. In that paper, a new residual structure called Tiny-Net containing a global attention block was designed to suppress false positives caused by objects belonging to the background. Li C. et al. proposed a learning objectwise semantic representation for object detection in aerial images in [31]. In the proposed algorithm, proposal detection was guided by using a semantic segmentation module. Mandal et al. proposed a one-stage vehicle detection network (AVDNet) for small vehicle detection in aerial images in [32]. In the AVDNet paper, ConvRes residual blocks were designed to retain finegrained feature in deep convolutional layers.
In 2020, Wu et al. proposed a novel geospatial object detection framework, called Fourier-based rotation-invariant feature boosting (FRIFB) [33]. In that paper, the rotation-invariance FourierHOG, ACF, FPGM, and boosting learning were integrated to achieve an effective and robust framework. e corresponding rotation-invariant channel maps were obtained by the FourierHOG algorithm and subsequently refined by ACF against object rotation and shift. By performing extensive experiments, it can draw a conclusion that the proposed method is robust to objects rotation. Shen et al. proposed a lightweight deep convolutional network for vehicle detection in aerial detection [34]. In that paper, a new aerial vehicle image dataset was also published. Zhou et al. proposed an anchor-free polar remote sensing object detector (P-RSDet) [35]. In that study, the author used a polar coordinate system for arbitrary-oriented object detection rather than Cartesian coordinates. To make algorithms gaze at the targets in an image, Chen et al. proposed a novel multiscale spatial and channelwise attention (MSCA) mechanism [36]. MSCA paid more attention to the spatial area and the feature channel related to the foreground. Furthermore, MSCA can be easily plugged into classic deep learning based detection frameworks. Wang B. et al. proposed an Improved FBPN Based Detection Network for small object detection in aerial images. In that paper, an improved feature-balanced pyramid network (FBPN) [37] was designed to balance the high-level and lowlevel feature maps.
In 2021, Yi et al. proposed an oriented keypoint-based detection framework to solve the class imbalance problem of anchor-based detection algorithms [38]. In that paper, the horizontal keypoint-based detection algorithm was improved to the oriented keypoint-based object detection framework and the box boundary-aware vectors (BBA-Vector) were proposed to describe the oriented bounding box. e experiment proved that BBAVectors can achieve better performance of object detection in aerial images. Li et al. proposed an efficient detection framework called simple convolutional neural networks (simple-CNNs) in [39], which can be directly applied to real-world applications. In that paper, a new loss function, namely, the change-IOU Loss (CI-Loss), was designed to improve the detection performance with the target position information.

Methods
To deal with complicated vision-based applications, researchers prefer to increase the depth of convolutional neural networks to get stronger power of information perception and learning ability. However, before various residual blocks have been proposed, the maximum depth of the mainstream convolutional neural networks at that time is restricted to relatively small numbers because of the problem of degradation problem. For example, Alex-Net [40] [43] which stack multiple residual blocks and construct the identity mapping by shortcuts.
is design can effectively solve the degradation problem, which leads to a much deeper network (e.g., 53 or 101 3×3 convolutional layers) [18,43] comparing to other kinds of convolutional neural networks. Driven by various residual blocks, various deep learning based vision algorithms develop rapidly, especially in the research fields of object detection. However, in the research field of vehicle detection in aerial images, the state-of-the-art residual networks have their limitations. Because the scales of the target sizes are significantly smaller and the background areas are larger and more complicated than normal object detection, high-level semantic information and low-level spatial information are equally important for vehicle detection in aerial images. Although deeper residual networks do have the advantages in exploring deeper features that contain rich semantic information, the spatial information contained in shallower features is easily corrupted and lost during the processing. As a result, the stateof-the-art general object detectors usually do not perform well in this specific application. To protect low-level special features and explore high-level semantic features at the same time, a novel framework designed for vehicle detection in aerial images is proposed in this paper. e overall architecture of the proposed framework is shown in Figure 1. Details of the proposed framework will be introduced as follows.
Section 3.1 introduces the dense waterfall residual block (DW res-block); Section 3.2 introduces the proposed multiscale receptive field attention (MRFA) module; in Section 3.3, the DW res-block is applied with MRFA module to compose an attentional dense waterfall residual block; the backbone of the proposed framework is introduced in Section 3.4.

Dense Waterfall Residual Block.
ResNet strengthens the learning ability of the network by increasing the depth of the network. It combines earlier layers with later layers by elementwise summation, which may lead to information Computational Intelligence and Neuroscience contamination during the process of information flow. Different from ResNet, DenseNet [44] proposed a new connection strategy. DenseNet directly connects each layer to several preceding layers by concatenation operation. According to [44], this connection strategy greatly strengthens feature reuse, improves the information flow, substantially reduces memory usage, and makes the network easier to train. Inspired by DenseNet [44], a new residual block named dense waterfall residual block (DW res-block) is proposed in this section. e proposed DW res-block keeps the setting of identity mapping in ResNet. In each block, the subsets of feature maps generated by each layer are densely concatenated with subsets of feature maps generated by preceding layers and then fed into the subsequent layers. As a result, each convolutional layer within the proposed block can obtain the subsets of all feature maps generated by the preceding layers of the same block. For example, the last layer in each block can obtain the subsets of the output of 3 preceding convolutional layers in the same block. us, the proposed DW res-block can preserve the shallow spatial information and extract high-level semantic information simultaneously, which is important for object detection in aerial images. Because the inner features are connected densely and the overall architecture of the proposed block looks like a waterfall, the proposed block is given the name of dense waterfall residual block (DW res-block). e structure of the DW res-block is shown in Figure 2. e supplementary description can be found in the box of the top-left corner. As shown in Figure 2, the rectangular boxes represent the feature maps and the width of rectangular boxes depends on the channel number of the corresponding feature map. e marks 1 × 1 and 3 × 3 nearby the connecting lines represent the convolutional layers. e width of rectangular boxes depends on the channel number of the corresponding feature map. Each convolutional layer is followed by a rectified linear unit (ReLU) function. S i j denotes the subset of the corresponding feature map which is split along the channel axis. i denotes the number of times the subset is processed by 3 × 3 convolutional layers. j denotes the subset number of the corresponding feature map. e dimension of the output of the proposed block is kept the same as the input feature map. e proposed block firstly decreases the channel dimension of the input feature map to 1/4 of origin by a 1 × 1 convolutional layer. en, the output feature map S 0 1 which only has one subset is processed by a 3 × 3 convolutional layer, and the output feature map S 1 is split into two subsets S 1 1 and S 1 2 along channel axis immediately. Following this, S 1 is concatenated with S 0 1 along the channel axis, and the concatenated feature map is processed by a 3 × 3 convolutional layer to obtain the output S 2 . S 2 is split into three subsets S 2 1 , S 2 2 , and S 2 3 . After that S 0 1 , S 1 1 , and S 2 are concatenated together and processed using a 3 × 3 convolution to obtain the feature map S 3 . en S 0 1 , S 1 1 , S 2 1 , and S 3 are concatenated together and decreased the channel dimension to the same as the input feature map of the block by a 1 × 1 convolutional layer. Finally, the output of the 1 × 1 convolutional layer is combined with the input of the block by an elementwise summation to obtain the output of the proposed block. (1) As shown in Figure 2, the computation process of the proposed DW res-block can be expressed as follows: where x and y denote the input and output feature maps of the proposed residual block, respectively, W i denotes weights matrixes, and f represents the proposed residual mapping process.

Multiscale Receptive Filed Attention Module.
For human visual perception, the attention mechanism represents the process of human eyes concentrating on 'what' or 'where' in a given scene. For computer vision tasks, attention mechanism relates to the process of channel or spatial selection of a given feature map, corresponding to 'what' and 'where' of human visual counterpart. Due to the complex backgrounds and relatively small foreground objects, it is difficult to distinguish the foreground and background region in an aerial image. erefore, inspired by SE-block [45], CBAM [46], and MSCA [36], to provide more information of categories and positions of foreground objects in an input aerial image, a multiscale receptive field attention (MRFA) module is proposed in this partition. Different from MSCA [36], the MRFA module generates the spatial attention map from the interspatial relationship of multiscale receptive field feature maps by using average-pooling and maxpooling operations along the channel axis. In addition, to strengthen the ability of multiscale perception, the multiscale receptive field feature maps are concatenated with the attention-weighted feature in the MRFA module. e detailed structure of MRFA is shown in Figure 3. e MRFA module generally consists of two branches: the channel attention module which is shown in the blue box and the spatial attention module which is shown in the orange box. Some supplementary description is shown in the bottom-left corner of Figure 3.

Channel Attention Module.
In the channel attention module, as shown in the blue box in Figure 3, global maxpooling (GMP) and global average-pooling (GAP) operations are both used to extract channel attention information.
e GMP and GAP features are then processed using a multilayer perceptron (MLP) module. Finally, the output attention features are combined by elementwise summation. e process can be generally expressed as the following equation: where X ∈ R C×H×W denotes the input feature of the attention module, and C, H, W represent the number of channels, height, and width of the feature map, respectively. ω c ∈ R C×1×1 represents the output attention weights of the channel attention module. e GMP operation can be computed using the following equation: and the GAP operation is computed as follows: 4 Computational Intelligence and Neuroscience After expanding the process of MLP, the whole computation process of the channel attention module can be express as where ω c represents the channel attention weights, σ denotes the sigmoid activation function, and W 0 0 ∈ R (C/r×C) , W 0 1 ∈ R C×C/r , W 1 0 ∈ R C×r/C , and W 1 1 ∈ R C×C/r denote the weights of MLP. e parameter 1/r denotes the reduction ratio in the bottleneck of the MLP. Here, r equates to 16. ReLU denotes the rectified linear unit function.

Spatial Attention
Module. Different from channel attention, the spatial attention module focuses on extracting useful position information. For the tasks of vehicle detection in aerial images, each target only occupies few pixels in the input image. Besides, the background can be more complex and confusing in aerial images. us, effectively applying spatial attention mechanism is difficult and important for vehicle detection in aerial images. To select the useful information from complex background areas of the input image, a multiscale receptive spatial field attention module is proposed. In the proposed module, dilated convolution is used for extracting multiscale receptive field information. Generally, the dilated convolution can effectively expand the receptive field of a network without increasing the computation cost. Besides, this strategy can effectively avoid the extraction of redundant information from the original feature map. More details will be illustrated as follows.
In the spatial attention module, multiscale receptive field information is extracted by using dilated convolution with different dilation rates. en, the output feature maps are processed by max-pooling and average-pooling along the channel axis, respectively. Finally, the pooling feature maps are concatenated together and processed using a 3 × 3 convolution layer to generate the spatial attention weights. e calculation process can be express as  Computational Intelligence and Neuroscience where ω s ∈ R 1×H×W represents the spatial attention weights, σ denotes the sigmoid activation function, and DC 3×3 1 represents the dilated convolution operation (its superscript denotes the size of the convolution kernel and its subscript denotes the dilation rate).
[, ] represents concatenation operation. Here, the output channel dimension of each DC operation will be reduced to 1/r of the input feature map.
Here, r equates to 16 which keeps the same with channel attention module. C 3×3 represents the convolution operation and its superscript denotes the size of the convolution kernel too. Each convolution operation and dilated convolution operation are followed by a rectified linear unit (ReLU) function. S − pool (Spatial pooling) represents max-pooling and average-pooling operations along the channel axis followed by a concatenation operation. e process of S − pool can be expressed as the following equations: where Max Pool c and Avg Pool c represent the max-pooling and the average-pooling operations along the channel axis, and the subscript c denotes the channel dimension here.

e Fusion of Channel Attention and Spatial Attention
Module. In the proposed framework, to protect spatial information of shallow feature, the proposed backbone only   Computational Intelligence and Neuroscience contains 3 times downsampling operations; in other words, the total downsampling ratio of the proposed backbone is 8 (more details will be illustrated in Section 3.4). However, a small downsampling ratio leads to a small receptive field, which may be not sufficient for distinguishing vehicles from complex backgrounds in aerial images. Based on this fact, the multiscale receptive field feature maps are implemented with the attention-weighted feature map for collecting multiscale receptive field information. e process can be represented by the following equation: where Y ∈ R C×H×W represents the output feature map, and X ∈ R C×H×W represents the input feature map. ω c ∈ R C×1×1 represents the channel attention weights, and ω s ∈ R 1×H×W represents spatial attention weights. ⊙ denotes the elementwise product operation. C 1×1 represents the 1 × 1 convolution operation for decreasing the channel number of the output feature map to the same as the channel number of the input feature map. For each input feature map, the proposed MRFA module will produce an attention-weighted feature map with the same size as the input feature map.

Attentional Dense Waterfall Residual
Block. e proposed MRFA module can fuse with the DW res-block introduced in Section 3.2 to compose an attentional dense waterfall residual block, which can select the informative pixels from complex background areas of the input image and strengthen the ability of multiscale perception. e general arrangement of the MRFA module and the DW res-block are shown in Figure 4. e MRFA module is placed between the residual mapping module and the identity mapping module of the proposed dense waterfall  Computational Intelligence and Neuroscience 7 residual block. e process of attentional dense waterfall residual block can be expressed as the following equation: where MRFA represents the process of equation (8) and f denotes the residual mapping introduced in equation (1).

e Backbone of Proposed Framework.
To enlarge the receptive field, backbones such as VGG-Net, Google-Net, and ResNet involve 5 downsampling layers; as a result, the resolution of the output feature map is downsampled 32 strides relative to the resolution of the input image. is backbone design is beneficial for extracting high-level semantic information on the limited condition of memory and computation resources. However, the 32 strides' downsampling ratio will lead to the loss of spatial information, which is harmful for object localization, especially for relatively small object localization in aerial images. To solve this problem, algorithms such as YOLOv2 [17] or FPN [10] keep shallow spatial information by skip-connection or feature fusion. ese methods can only alleviate the problem; it can not solve the problem. For these reasons, based on the proposed attentional dense waterfall residual block, a backbone designed for vehicle detection in aerial images is proposed. e proposed backbone preserves the spatial information from the following aspects. Firstly, inner features of the proposed residual block are connected densely which can improve the information flow between shallower and deeper layers. In addition, the identity mapping is adopted between each stage of the proposed backbone by the transition layer to further improve the information flow. e structure of the transition layer is diagrammed at the bottom of Figure 1. Secondly, the new backbone which only involves 3 times downsampling operations keeps the feature map of high resolution. anks to this design, the proposed backbone can extract high-level semantic information from the feature map of high resolution with alleviating loss of spatial information.
irdly, the downprojection unit which is originally used in superresolution reconstruction [47] is applied to reduce the impact of information contamination caused by the downsampling operation. e general structure of the downprojection unit is shown in the bottomleft of Figure 1. e structure of the proposed backbone is shown in Figure 1, and details of the proposed backbone structure are listed in Table 1.

Experiments and Results
e framework proposed in this paper is evaluated on three popular public aerial datasets: VEDAI [48], UCAS-AOD [49], and DOTA [50]. In this section, these three datasets and the evaluation metrics are introduced firstly. en, the training details of the proposed framework are illustrated. Finally, the evaluation results of the proposed framework and the efficacies of its components are analyzed and compared with other state-of-the-art detection algorithms.

Datasets.
Deep learning based vision algorithms require large-scale labeled training data. e ground view object detection algorithms such as Faster R-CNN, YOLO, and SSD are usually trained on MS COCO [51] and PASCAL VOC [52] which contain images taken from the ground. With object detection algorithms for aerial images being widely studied, there is an increasing need for aerial image datasets. As a result, some public datasets such as NWPU VHR-10 [53], RSOD [54], VEDAI, UCAS-AOD, and DOTA are produced recently. Among these datasets, VEDIA, UCAS-AOD, and DOTA are the most commonly used datasets to evaluate vehicle detection algorithms for aerial images [55][56][57][58][59][60][61][62][63][64][65][66][67][68][69][70][71][72]. To get better comparisons with state-of-the-art algorithms, these three datasets are also used in this section to evaluate the proposed framework and its components. Some details of the three datasets used in this paper are introduced as follows.   e VEDAI (Vehicle Detection in Aerial Imagery) dataset is published for the task of small vehicle detection in aerial images. Images in the VEDAI dataset are taken from the realistic and unconstrained environment.
ere are 4 different versions of the VEDAI dataset: LCIs (large-size color images), SCIs (small-size color images), LIIs (large-size infrared images), and SIIs (smallsize infrared images). e resolutions of images in large and small versions are 1024 × 1024 and 512 × 512. e Ground Sampling Distances (GAD) of large and small versions are 12.5 cm and 25 cm, respectively. VEDAI dataset contains various backgrounds such as trees, buildings, roads, cities, and so on. e different vehicles contained in VEDAI belong to 9 categories, namely, the "plane", "boat", "camping car", "car", "pick-up", "tractor", "truck", "van", and the "other" categories. Since most of the targets in VEDAI are labeled as "small land vehicle", i.e., "car", "pick-up", "tractor", and "van", all the targets labeled as "small land vehicles" are used to evaluate the proposed framework in this section.

UCAS-AOD Dataset.
e UCAS-AOD dataset is proposed by Patterns and Intelligent System Development Laboratory in the University of China Academy of Sciences. e dataset only contains targets from two categories: "car" and "airplane". It contains 7482 planes in 1000 images and 7114 cars in 510 images. In this paper, targets labeled as "car" are used to evaluate the proposed framework.

DOTA Dataset.
DOTA dataset is a large-scale dataset proposed for object detection in aerial images. e resolution range of images in the DOTA dataset is from about 800 × 800 to about 4000 × 4000. e dataset contains targets from 15 categories, namely, "ship", "plane", "baseball diamond", "storage tank", "tennis court", "swimming pool", "ground track field", "harbor", "large vehicle", "small vehicle", "helicopter", "roundabout", "soccer ball field", and "basketball court". Because the targets from most categories in the DOTA dataset are either too large or irrelevant to the theme of this paper, only the targets labeled as "small vehicles" are employed in the evaluation of this paper.

Image Shooting Angle and Target Scales.
Because the images contained in VEDAI, UCAS-AOD, and DOTA datasets are all taken by satellites and UAVs from high altitudes, the shooting angle of these images is fixed. On the other hand, the scales of the targets used in the evaluation vary from about 30×30 pixels to 90×90 pixels. e varied scales can provide opportunities to evaluate the performance of the proposed framework on multiscale targets.

Evaluation Metrics.
In this paper, the quantitative evaluation metrics (precision, recall, mean Average Precision (mAP), and F1-measure) are used to verify the proposed framework.
Precision is the ratio of the number of correctly detected targets to the total number of predicted examples. It is used to measure the accuracy of the proposed algorithm. It is defined as where (false negative) FN represents the number of positive examples which are not correctly detected. e recall and precision are generally contradictory in the same cases. Considering the negative correlation between precision and recall rate, a comprehensive evaluation metric is necessary. e F1-measure is an important metric for measuring the performance of detection algorithms, which is equally considering the recall and precision rate. e definition of the F1-measure is as follows: Although F1-measure is proposed to measure the performance of object detection equally considering recall and precision, it only reflects the performance of a single point value. To solve this problem, the mAP which can reflect the global performance is proposed. It is defined as the following equations: where N denotes the number of categories. AP measures the global performance of single category, and mAP measures the global performance of all categories.

Implementation Details.
e images are processed to the resolution of 640 × 640 by sliding window cropping and padding for both training and testing stages. e experiments are performed by using an NVIDIA GeForce RTX 2080Ti GPU on TensorFlow 2.0. e weights of the proposed framework are initialized under Xavier uniform [73]. e Adam [74] optimizer and Cosine learning rate decay with an initial learning rate of 1 × 10 − 4 are used to train the proposed framework. e number of training epochs is set to 100. e learning rate decays from the beginning to the end of training with Cosine learning rate decay policy of default setting of Tensorflow 2.0.

Comparison with Other Algorithms.
e comparison results between the proposed framework and other state-ofthe-art algorithms on the VEDAI dataset are summarized in Table 2. It can be seen that the proposed framework outperforms the existing state-of-the-art detection algorithms on the VEDAI dataset. As shown in Table 2, the proposed framework achieves an mAP of 95.13%, which is roughly 2.59% higher than L-RCNN 2020 [40] and 3.86% higher than Improved FBPN Based Detection Network [37]. e detailed results of recall, precision, F1-measure, and test time are also shown in Table 3. e P-R curves of the proposed framework are shown in Figure 5. As shown in Table 3, the proposed Dense Attentional Residual Network (Baseline + DW resblock + MRFA) achieves state-of-the-art performance on the VEDAI dataset: 89.91% for recall, 93.08% for precision, 95.13% for mAP, and 91.47% for F1-measure. To evaluate the experimental results qualitatively, some detection examples generated by the proposed framework are shown in Figure 6.

Efficacies of the Proposed Components.
To demonstrate the efficacies of the proposed DW res-block and MRFA module, in addition to the proposed DAR-Net (noted as Baseline + DW res-block + MRFA in Table 3) evaluated in the previous section, two other algorithms are evaluated using the VEDAI dataset. Firstly, a Baseline algorithm (noted as Baseline in Table 3) is implemented by keeping the backbone of DAR-Net and utilizing the regular residual block proposed in [43]. Secondly, based on the Baseline algorithm, another algorithm (noted as Baseline + DW res-block in Table 3) is implemented utilizing the proposed DW res-block instead of the regular residual block. Since the only difference between Baseline and Baseline + DW res-block is DW res-block, and the only difference between Baseline + DW res-block and Baseline + DW res-block + MRFA is MRFA, the efficacies of the proposed modules can be demonstrated by comparing the evaluation results of these three algorithms.
e P-R curves of these three algorithms are shown in Figure 5. As shown in Table 3, the DW res-block contributes 2.7% improvement of mAP with only 2M increase of parameter amount, and the MRFA module has a contribution of almost 2.8% improvement of mAP with 8M increase of parameter amount. And the two proposed modules increase the processing time of 0.077s and 0.255s, respectively. e two proposed modules increase the computational complexity of the algorithm.

Parameter Analysis.
According to the experiments, the performances of the proposed DW res-block and MRFA module are both relatively sensitive to their dimension reduction ratios. erefore, to find an effective tradeoff between model parameters amount and detection accuracy, the dimension reduction ratios of the proposed two modules are selected as hyperparameters to evaluate the parameter sensitivity of the proposed framework.

Results on the UCAS-AOD Dataset.
e experiment results of the proposed framework on the UCAS-AOD dataset are as follows. Table 6 shows the comparison of performance between the proposed framework and other existing algorithms. As shown in Table 6, the proposed framework achieves an mAP of 96.78%, which outperforms the existing state-of-the-art detection algorithms on the UCAS-AOD dataset.
e detailed results are shown in Table 7. It can be seen that the proposed framework achieves state-of-the-art performance on the UCAS-AOD dataset: 91.67 for recall, 94.05 for precision, 96.78 for mAP, and 92.85 for F1-measure. e P-R curve of the proposed framework on the UCAS-AOD dataset is shown in Figure 8. Some detection examples generated by the proposed framework from the UCAS-AOD dataset are shown in Figure 9.

Results on the DOTA Dataset.
e evaluation results of the proposed framework on the DOTA dataset are as follows.

Conclusions
In this paper, a novel framework named Dense Attentional Residual Network (DAR-Net) is proposed for vehicle detection in aerial images. To effectively preserve the spatial information and extract high-level semantic information at the same time, a novel residual block named dense waterfall residual block (DW res-block) is implemented in the proposed DAR-Net. To select the informative feature from the feature maps and solve the problem of the small receptive field of the proposed backbone, the multiscale receptive field attention (MRFA) module is plugged into the proposed DW resblock. Based on the DW res-block and MRFA module, a backbone designed for vehicle detection in aerial images is proposed. e proposed backbone only involves 3 times downsampling operations and extracts the semantic information from feature maps of high resolution to further preserve the spatial information. Downprojection units and transition layers are also used to reduce the impact of information loss caused by downsampling and improve the information flow, respectively. According to the experimental results, the proposed framework achieves state-of-the-art performance on VEDAI, UCAS-AOD, and DOTA datasets. e evaluation also demonstrates the efficacies of the DW res-block and the MRFA module. On the downside, object rotation still has negative effects in vehicle detection in aerial images. In the future, to improve the robustness of rotation-invariance of the proposed framework, the methods such as FourierHOG will be tried to be applied on the proposed framework, and to reduce the parameter amount without harming the performance, some solutions such as depthwise separable convolution will also be implemented in the proposed framework. Additionally, as recent researches of vehicle detection algorithms in aerial and ground view images are mutually independent, more generalized algorithms for both aerial and ground view images are worth more research for some potential applications.
Data Availability e data supporting this study were taken from previously reported studies and datasets, which have been cited. e processed data are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.