Aircraft Detection for Remote Sensing Image Based on Bidirectional and Dense Feature Fusion

Aircraft, as one of the indispensable transport tools, plays an important role in military activities. Therefore, it is a significant task to locate the aircrafts in the remote sensing images. However, the current object detection methods cause a series of problems when applied to the aircraft detection for the remote sensing image, for instance, the problems of low rate of detection accuracy and high rate of missed detection. To address the problems of low rate of detection accuracy and high rate of missed detection, an object detection method for remote sensing image based on bidirectional and dense feature fusion is proposed to detect aircraft targets in sophisticated environments. On the fundamental of the YOLOv3 detection framework, this method adds a feature fusion module to enrich the details of the feature map by mixing the shallow features with the deep features together. Experimental results on the RSOD-DataSet and NWPU-DataSet indicate that the new method raised in the article is capable of improving the problems of low rate of detection accuracy and high rate of missed detection. Meanwhile, the AP for the aircraft increases by 1.57% compared with YOLOv3.


Introduction
Object detection for remote sensing images, which is considered as the focus issue of remote sensing information processing, exerts an enormous function on aviation and transportation fields. In the wake of developments in the acquisition technology for remote sensing images, we can obtain clearer and higher resolution remote sensing images. ese high-resolution images can provide more detailed information, which can help us better identify and locate the corresponding target objects. However, in most cases, the scenario of the aircraft for remote sensing images is usually complex. With the uneven distribution of background and object, the recognition difficulty is easily affected by background noise. What's worse, the object detection algorithm currently in common use for remote sensing image has the problems of low rate of detection accuracy and high rate of missed detection. erefore, a new method is needed to propose for object detection. e initial object detection method is generally composed of three parts, and they are region candidate, feature extraction, and classifier classification, respectively. Firstly, the sliding window strategy is used for the candidate regions of interest. en, features designed manually, such as Haar [1], HOG [2], and DPM [3], are used for withdrawing the candidate regions' features. Finally, classifiers that have been previously trained, such as SVM [4] and Adaboost [5], are utilized for identifying the candidate regions. Traditional methods for object detection, nevertheless, leveraging the strategy of the sliding window for region candidate to improve the area selection strategy. e sliding window strategy is usually blind, needing to design a variety of different window sizes ahead of time. To ensure that objects of different sizes can get candidates, it will produce a great number of redundant candidate windows, leading to the calculation of ascension. In addition, features designed manually usually refer to simple geometric features. Feature extraction based on features designed manually usually does not have robustness for objects in complex scenes and cannot cope well with noise interference caused by the changeable environment.
In the wake of the progress of deep learning, the method on the fundamental of convolutional neural network for object detection has been proposed and widely utilized in a variety of complex scenes. e methods mentioned above are usually composed of two categories, namely, one-stage methods and two-stage methods. One-stage methods, for instance, YOLO [6], SSD [7], and Retinanet [8], distinguish the object detection problem as a classification and regression problem without the need to generate candidate boxes in advance. Instead, images are directly input into the detection network for feature extraction, and then object classification and prediction box regression are carried out simultaneously. Two-stage methods, for instance, R-CNN [9], SPP-NET [10], FAST R-CNN [11], and Faster R-CNN [12], firstly guide the generation of candidate boxes through the predesigned clustering algorithm and then carry out feature extraction operation on the candidate boxes. en, the classification of the target object and the regression of the prediction box are carried out. One-stage methods, in contrast to the two-stage methods, without the need to generate candidate box area, have much smaller amount of calculation, and the speed is much faster. With the classification and regression carried out simultaneously, it can achieve the training for end-to-end. Although the two-stage methods have lower speeds, with the guidance of the candidate boxes, detection accuracy is higher than the one-stage methods.
A lot of algorithms for aircraft detection have been developed and applied to different scenes. Wang et al. [13] aimed at the problem that there were different scale aircraft objects in the images of remote sensing, establishing a minitype data set and proposing a multiscale aircraft detection algorithm. Hou et al. [14] aimed at the problem that infrared aircraft target was a blur and the detection was easy to be interfered by noise, proposing an improved detection method for microinfrared aircraft target. Zhiyong et al. [15] aimed at the problem that LCCD with VHR optimal images performed poorly because of high intraclass variation and low interclass variance, presenting an overview of the development of LCCD with VHR remote sensing images and discussing the future challenges and opportunities in applying VHR remote sensing images in LCCD. Zhiyong et al. [16] aimed at the problem that existing approaches had limited capability to capture the objects of varying shapes/ sizes present in an area impacted by the landslide, developing an algorithm based on automatic adaptive region extension using very-high-resolution remote sensing images. Wang et al. [13] aimed at the problem that spaceborne optical remote sensing images were difficult to obtain and costly, proposing the aircraft detection algorithm which could detect aircraft objects with small samples. Li et al. [17] aimed at detecting the keypoints of aircraft, proposing a category-aware landmark detection network (CALDN) that possessed two streams: a classification stream for size categorization and a localization stream for landmark detection. Zhao et al. [18] aimed at the problem that detecting aircrafts accurately in SAR images was still challenging due to the effects of the special structures of aircrafts and the complexity of SAR imaging mechanism, proposing a novel network called pyramid attention dilated network (PADN). Lin and Chen [19] aimed at the problem that whether directly employing a large number of instances with great variation would lead to a good performance, proposing a you-only-look-once-v3-based detection process for automatic aircraft detection. Yan et al. [20] aimed at the problem that while many advanced works had been developed with powerful learning algorithms in natural images, there still lacked an effective one to detect aircraft precisely in remote sensing images, especially in some complicated conditions, proposing a novel method to detect aircraft precisely, named aircraft detection using Center-based Proposal regions and Invariant Features (CPIF). Wu et al. [21] aimed at the problem that aircraft targets were usually small and the cost of manual annotation was very high, proposing a simple yet efficient aircraft detection algorithm called Weakly Supervised Learning in AlexNet (AlexNet-WSL). Luo et al. [22] aimed at the problem that there were several major challenges in aircraft detection from synthetic aperture radar (SAR) images, such as the shattered features of the aircraft, the size heterogeneity, and the interference of a complex background, proposing an Efficient Bidirectional Path Aggregation Attention Network (EBPA2N). Heiselberg and Heiselberg [23] aimed at the problem that detecting aircrafts in satellite images was a challenge when the background was strongly reflective clouds with varying transparency, proposing a fast and effective detection algorithm that could find almost all aircrafts above and between clouds in Sentinel-2 multispectral images. Shi et al. [24] aimed at the problem that it was still a challenge in remote sensing detection due to complex background and multiscale characteristics, proposing a two-stage aircraft detection method based on deep neural networks, which integrated Deconvolution operation with Position Attention mechanism (DPANet). Xu et al. [25] aimed at the problem that the aircraft to be detected was very small, external environmental factors were easily fused, and the interference of objects to aircraft had a great impact on the aircraft characteristics in remote sensing images, proposing a remote sensing aircraft detection method based on deep learning. Zhou et al. [26] aimed at the problem that the recent algorithms would miss some small-scale aircrafts when applied to the remote sensing image, proposing the Multiscale Detection Network (MSDN), which introduced a multiscale detection architecture to detect small-scale aircrafts. Although lots of methods for aircraft detection have been proposed, there are still many problems when the methods are applied to the remote sensing images, needing a more suitable method for aircraft detection in remote sensing images.
With the intention to address the problem of low rate of detection accuracy and high rate of missed detection, on the fundamental of the object detection algorithm of YOLOv3 [27], this paper analyzes the FPN [28] feature fusion module in YOLOv3 and finds that the feature fusion module only 2 Computational Intelligence and Neuroscience fuses the shallow features. Because shallow features and deep features are not combined effectively, some details will be lost in the process of detection. erefore, this paper proposes a bidirectional and dense feature fusion detection network. e bidirectional and dense feature fusion detection network fuses the feature map extracted from different detection layers, making the detailed information of the shallow features combined with the semantic information of the deep features, so as to decrease the rate of missed detection and false detection. e main contributions of this paper are as follows: (1) To address the problem of low rate of detection accuracy, this paper proposes the Bidirectional Feature Fusion Detection Network (BFFDN), which not only transmits the shallow layers' detailed information to the deep layers but also transmits the deep layers' semantic information to the shallow layers, making the feature fusion more sufficient and increasing the detection accuracy.

YOLOv3 Detection Framework.
e YOLOv3 object detection algorithm adopts Darknet-53 network structure as the backbone network. e backbone network uses the residual connection which is used in the ResNet [29] network for reference, so that the problem of gradient disappearance can be avoided while deepening the network's depth. In order to eliminate the negative effects brought by pooling, a stride of 2 convolutional operations is adopted to replace the pooling operation. e backbone network includes 5 subsampling operations. e input image has to go through 5 subsampling operations when passing through the backbone network, and the output feature image's size becomes 1/32 of the original's. With the intention to enhance the prediction of minitype objects, YOLOv3 uses three different scales of the feature maps for target prediction, and by leveraging the characteristics of the FPN feature fusion for reference, the different scales of the feature maps are spliced together by way of upsampling. e YOLOv3 network structure is shown in Figure 1. As we can clearly catch from the figure, if the size of the input image is 416 × 416, the backbone network gets Predict1 at the 82nd Layer after convolution of several layers, that is, the 13 × 13 detection scale. e feature map at the 82nd Layer gets 32 times subsampling operation, and the feature map is appropriate for detecting max-type objects due to the large scale receptive field it has. e backbone network performs an upsampling operation on the 79th layer's feature map and performs feature fusion with the 61st layer's feature map to obtain the feature map at the 91st layer. e backbone network gets Predict2 at the 94th Layer after convolution of several layers, that is, the 26 × 26 detection scale. e feature map at the 94th Layer gets 16 times subsampling operation, and the feature map is appropriate for detecting middle-type objects due to the middle scale receptive field it has. e backbone network performs an upsampling operation on the 91st layer's feature map and performs feature fusion with the 36th layer's feature map to obtain the feature map at the 103rd layer. e backbone network gets Predict3 at the 106th Layer after convolution of several layers, that is, the 52 × 52 detection scale. e feature map at the 106th Layer gets 8 times subsampling operation, and the feature map is appropriate for detecting minitype objects due to the smallscale receptive field it has. e prediction of the boundary box is shown in Figure 2, where the dark blue box stands for the predicted boundary box and the light blue box stands for the prior box. e purpose of the prediction of the boundary box is to forecast the boundary box's position through the prior box so that the predicted position of the boundary box is closer to that of the real box. e prediction formulas of the boundary box are shown in the following equations: where t x , t y represent the center point's coordinates of the prediction box relative to the center point of the cell, t w , t h represent the prediction box's length and width relative to the prior box, σ represents the sigmoid activation function, σ(t x ), σ(t y ) represent the offset based on the upper-left coordinates of the center point of the rectangle, p w , p h represent the length and width of the corresponding a priori box, c x , c y represents the coordinates of the upper-left corner of the cell, and b x , b y , b w , b h stand for the prediction box's position and the length and the width. e loss of YOLOv3 is composed of three parts: coordinate loss, confidence loss, and category loss. e formula of coordinate loss, confidence loss, and category loss is shown in the following equations, respectively: Computational Intelligence and Neuroscience σ (t y ) where λ coord stands for the weight of coordinate error, λ noobj represents the weight of no-target error, λ obj represents the weight of target error, λ class represents the weight of the classification error; where S represents the grid size, for the 416 × 416 images, the three grid sizes are 13,26 and 52, respectively; where B stands for the bounding boxes' number, l stands for the true box's center coordinates, width, height, confidence, and probability of the category, respectively, x, y, w, h, c, p(c) represent the bounding box's center coordinates, width, height, confidence, and probability of the category, respectively, where 2 − w * h represents the scale factor, the smaller the target object is, the larger the regression loss is, and the stronger the detection effort for detection the small objects.

FPN Feature Fusion Module.
Because the shallow layers' features include more detailed information, and the deep layers' features include more semantic information. With the process of downsampling constantly, the feature map will contain more and more semantic information while with less and less detailed information. However, most object detection algorithms focus on the deep layers' features only and ignore the shallow layers' features, leading to inaccurate target positioning.
To settle the problem down, the FPN Feature Fusion Module is proposed. e FPN Feature Fusion Module is capable of integrating the shallow layers' features with the deep layers' features by introducing the feature pyramid structure, making the fused feature map have both the shallow layers' detailed information and the deep layers' semantic information. e FPN Feature Fusion Module is shown in Figure 3, in which the operation downsampling × 0.5 represents 2 times downsampling, Conv1, Conv2, Conv3 represents the 2 times, 4 times, and 8 times downsampling, respectively, the operation 1 × 1 represents using the 1 × 1 convolution size to adjust the number of channels, the operation 3 × 3 represents using the 3 × 3 convolution size to eliminate the effect of confusion by upsampling, and the operation upsampling × 2 represents the 2 times upsampling. Mix2 is formed by feature splicing of Mix3 after 2 times upsampling with Conv2 after 4 times downsampling, and Mix1 is formed by feature splicing of Mix2 after 2 times upsampling with Conv1 after 2 times downsampling. In this way, the spliced Mix1 and Mix2 not only contain more detailed information from the shallow layers but also contain more semantic information from the deep layers, and then use three feature maps of different scales, namely, Mix1, Mix2, and Mix3, to make predictions.

Dense Feature Fusion Module.
e FPN Feature Fusion Module transmits the shallow features' detailed information to the deep features by way of upsampling so that the deep features have the shallow features' detailed information. However, in the process of feature transmission, some features and details will be lost due to the upsampling operation for several times. With the intention to reduce the loss of features, the Dense Feature Fusion Module is proposed through the study of DenseNet [30].
e Dense Feature Fusion Module is shown in Figure 5.
e Dense Feature Fusion Module, based on the FPN Feature Fusion Module, not only passes the feature of Mix3 to Conv2 by upsampling to form Mix2 and passes the feature of Mix2 to Conv1 by upsampling to form Mix1 but also makes the feature splice of Mix3 after 4 times upsampling with Conv1 and transmits the feature information of Mix3 directly to Conv1. e operation shortens the transfer path from Mix3 to Mix2 to Mix1 so that the feature loss caused by multiple times upsampling is alleviated and the feature information of Mix1 is enriched and then, using three different scales of feature maps, namely Mix1, Mix2, and Mix3, to make predictions. is paper introduces the Dense Feature Fusion Module into the YOLOv3 method and names the new method Dense Feature Fusion Detection Network (DFFDN).

Experimental Environments.
e operating system used in this paper is Ubuntu16.4.0, the processor is Intel(R) Xeon(R) Silver 4114 CPU @ 2.20 GHz, and the graphics card is two-piece Quadro P4000. e dataset adopted is the RSOD-DataSet annotated by Wuhan University [31,32]. Some examples of the RSOD-DataSet are shown in Figure 7. RSOD includes four kinds of objects, including aircraft, oil tank, playground, and overpass and contains a total of 976 pictures, each of which is about 1100 × 900 pixels in size. And the aircraft contains 446 pictures with a total of 4,993 targets, the oil tank contains 165 pictures with a total of 1,586 targets, the playground contains 189 pictures with a total of  In the experiment, learning rate attenuation is adopted to adjust the learning rate. e initial learning rate is 0.001, momentum is 0.9, weight attenuation is 0.0005, and the number of iterations is 40200. As the iterations reach the 32000 generations and 36000 generations, respectively, the learning rate is adjusted to 0.1 and 0.01 of the initial learning rate, respectively. In this way, the convergence speed of loss can be adjusted.

Experimental Results.
Loss curve is one of the performance indicators to evaluate the object detection algorithm. Generally speaking, the smaller the loss value of a model is, the better the model is trained and the better the training effect will be. e loss comparison of different models is shown in Figure 8, where the horizontal axis stands for the loss value and the vertical axis stands for the iteration number. It can be clearly seen from the figure that the loss value of different models decreases rapidly between 0 and 2000 iterations. After 3000 iterations, the loss value of different models gradually tends to be stable and fluctuates within a certain small range. Between 35000 and 40200 iterations, it can be seen that the BDFFDN algorithm is at the lowest point of loss value, which means that compared with the YOLOv3 algorithm, BFFDN algorithm, and DFFDN algorithm, the BDFFDN algorithm has a better training effect.
IOU Curve is one of the performance indicators to evaluate the object detection algorithm. IOU stands for the overlapping area between the predicted boundary box and the labeled real box, that is, the ratio of their intersection and union. e closer the value is to 1, the greater the overlap area between the predicted bounding box and the labeled real box will be, and the closer the predicted bounding box is to the labeled real box. e calculation formula of IOU is shown in the following equation: where Predict represents the predicted bounding box calculated by the network model, GroundTruth represents the labeled real box. e IOU curve comparison of different models is shown in Figure 9, where the horizontal axis represents the IOU value, and the vertical axis represents the iteration number. As can be seen from the figure, the IOU value fluctuates greatly at the beginning of training. As the number of iterations increases, the value of IOU gradually tends to be stable and fluctuates within a certain range. Compared with the YOLOv3 algorithm, BFFDN algorithm and DFFDN algorithm, the fluctuation of IOU of the BDFFDN algorithm is smaller and tends to 0.8.  Computational Intelligence and Neuroscience e P-R Curve comparison of different models is shown in Figure 10, where the horizontal axis represents the Precision, and the vertical axis represents the Recall. e area of the P-R curve represents the performance of the model. Generally speaking, the larger the area of the P-R curve is, the better the performance of the model is. It can be clearly seen from the figure that the YOLOv3 algorithm occupies the smallest area compared with other algorithms, followed by the BFFDN algorithm, followed by the DFFDN algorithm, and finally BDFFDN algorithm. BDFFDN algorithm occupies the largest area compared with other algorithms, which means that the BDFFDN algorithm has the best performance.
In addition to Precision, Recall, IOU, and other performance indicators, F1-ccore, AP, and FPS can also be used as indicators to evaluate object detection algorithms. For aircraft object detection, the positive example is aircraft, and the negative example is the objects other than aircraft.  Figure 11. e calculation formulas of performance indexes of Precision, Recall, F1-Score, and AP are shown in the following equations, respectively: e contrastive results in the RSOD-DataSet are presented in Table 2. From Table 2, we are capable of seeing that the mAP of BDFFDN is 91.41%, which increases by 18.44%, 16.34%, 15.55%, 15.43%, 14.83%, 13.62%, 3.65%, 3.49%, 3.08%, 2.68%, and 1.99% compared with SSD, DSSD, FFSSD, ESSD, DC-SPP-YOLO, UAV-YOLO, FRCN, DConvNet, MRFF-YOLO, Improved-YOLOv3, and SigNMS, respectively. e results demonstrate that our proposed method has superior performance than other algorithms. Although the index of FPS is not very high compared with other algorithms and the AP for overpass is not the highest among the algorithms, it could still meet the basic demand for aircraft detection.
As shown in Figure 12, there are 20 images for comparing the detection result of YOLOv3 with BDFFDN. Among these images, the 1st column and the 2nd column are the detection result of YOLOv3, and the 3rd column and the 4th column are the detection result of BDFFDN. From Figure 12, we can clearly see that the objects in the images are mostly in small size, and the objects are distributed densely, which increases the difficulty of detection. e detection result of YOLOv3 has shown the result that YOLOv3 has defectiveness when detecting the small-size objects and missing the small-size objects, while our proposed method has detected the objects missed by YOLOv3. is Computational Intelligence and Neuroscience demonstrates that our proposed method, BDFFDN, has better performance when detecting the small-size objects for remote sensing images than YOLOv3.

Extended Experiments.
With the intention to prove the algorithm's generality and generalization, besides the experiments in the RSOD-DataSet, we also do the experiments in the NWPU-DataSet. e NWPU-DataSet [42][43][44] is a remote sensing dataset used for object detection, which consists of ten kinds of objects, including airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle. It contains a total of 800 pictures, of which 650 pictures were used for the positive image set and 150 pictures used for the negative image set. e ratio of the training set to test set is 4 to 1. e total number of iterations is 40200 generations, the learning rate is initialized to 0.001, and the method of constant attenuation is adopted. As the iterations reach the 32000 generations, the learning rate is 0.0001, and as the iterations reach the 36000 generations, the learning rate is 0.00001. e contrastive results in the NWPU-DataSet are presented in Table 3. From Table 3     6.39% compared with EB-V-F-BR, AD-FCN, CPISNet * , RICNN, FRCN, R-P-FRCN, NEOON, DConvNet, and DNN, respectively. e results demonstrate that our proposed method has superior performance than other algorithms. Although the AP for ship, storage tank, baseball diamond, basketball court, and vehicle is not the highest among the algorithms, it could still be acceptable.
As shown in Figure 13, there are 20 images for comparing the detection result of YOLOv3 with BDFFDN. Among these images, the 1st column and the 2nd column are the detection result of YOLOv3, and the 3rd column and the 4th column are the detection result of BDFFDN. rough the comparison of the detection result, we are capable of seeing that our method has detected the objects YOLOv3 missed, which demonstrates our method has superior performance than YOLOv3.

Conclusions
is paper focuses on the issues of low rate of detection accuracy and high rate of missed detection and finds that the FPN Feature Fusion Module has the problem of insufficient fusion of shallow layers and deep layers through the research of the FPN Feature Fusion Module, which will lead to an insufficient combination of the shallow features' detailed information and the deep features' semantic information, and thus lead to inaccurate positioning of small targets, proposing the Bidirectional and Dense Feature Fusion Detection Network and carrying out the experiments on the RSOD-DataSet and NWPU-DataSet. Experimental data show that the proposed Bidirectional and Dense Feature Fusion Detection Network is significantly better than the YOLOv3 object detection algorithm in Precision, Recall, F1-score, IOU, AP, and other performance indicators, and detects various small targets that YOLOv3 object detection algorithm cannot detect. With the increase of the detection accuracy and the decrease of missed detection, the computation cost of the method has increased and the detection speed has decreased. In the future direction, how to decrease the computation cost while increasing the detection accuracy will be researched.

Data Availability
e research data come from the network public data sets.

Conflicts of Interest
e authors declare that they have no conflicts of interest.