OAB-YOLOv5: One-Anchor-Based YOLOv5 for Rotated Object Detection in Remote Sensing Images

,


Introduction
Object detection plays an important role in the field of computer vision. Remote sensing images have high resolution and optional observation range. Remote sensing object detection provides a new detection method for object detection, which is of significant value in military and national defense security fields. In recent years, object detection has been based on anchored detectors, which can be generally categorized into one-stage detection [1][2][3][4][5][6] and two-stage detection methods [7][8][9][10]. The one-stage method usually places numerous preset-anchor points on the image. Generally, various anchor points with different proportions are preset by clustering, and the coordinates and categories of each anchor box are refined many times. Finally, the screened anchor boxes are considered as the detection results. Compared to the one-stage methods, the two-stage methods refine the anchor boxes with a higher degree and achieve promising results in terms of accuracy, while the one-stage methods maintain faster detection speed. With the emergence of feature pyramid networks (FPN) [11], the accuracy gap between the one-stage and two-stage methods has been narrowed to some extent.
In consideration of numerous preset-anchor boxes by anchor-based detectors, the relevant academic research has gradually shifted from anchor-based detectors to anchorfree detectors. One approach is to locate several predefined or self-learned keypoints and bind the spatial scope of the object, which is called keypoint method [12]. Another approach is to define the positive sample using the centerbased or region of the object and predict the four distances (up, down, left, and right) from the positive samples to the object boundary. This type of anchor-free detection is called a center-based method [13]. It eliminates the hyperparameters related to anchors and has a generalization ability.
However, the performance of anchor-free detectors cannot catch up with anchor-based ones at present. There are two main differences between anchor-based and anchor-free detectors. We take RetinaNet [14], YOLOv3 [6], and FCOS [13] as examples to illustrate the differences between anchor-based and anchor-free detectors. (1) Allocation strategy of positive samples was as follows: Retina-Net based on IoU filter strategy takes positive samples if the IoU value of default anchors and ground truth (GT) IoU is greater than the threshold. YOLOv3 compares the ratio of the width and height of the anchors to that of GT. If the ratio is less than the set hyperparameters of width and height ratio, it is a positive sample. FCOS takes all points in a ground truth bounding box as positive samples. (2) Targets for regression, RetinaNet, and YOLOv3 algorithms are regression of the offset of the border relative to the anchors, while FCOS is to regress the distance of the upper left corner point and the lower right corner point relative to the anchor point. For the moment, anchor-based detectors achieve high performance.
Most anchor-based detectors densely preset anchors at each location of the feature map with three different scales. Particularly, additional anchors are set according to different angle intervals for oriented arbitrary objects with additional angle settings. Numerous preset anchors lead to an extreme imbalance between positive and negative samples. The most common solution is to control the candidate ratio through a specific sampling strategy [15,16]. Both of them have the problem of uneven positive and negative samples. Some scholars have made some researches on this problem. For example, ATSS [17] and dynamic R-CNN [18] adaptively select high-quality positive samples. However, the study above only considers the noise of positive samples and ignores the potential localization ability of numerous negative samples and the credibility of IoU. HAMBox [19] shows that low-quality negative samples can achieve high-quality positioning. ATSS [17], DAL [20], and FCOS [13] show that adding high-quality positive sample anchors significantly accelerates convergence.
In aerial image scenes, the shooting angle of the image is generally a top-view angle. In contrast, the objects of interest, such as cars, planes, and ships, are usually relatively small and occupy only a few pixels of the image. According to DOTA [21], remote sensing images have the following challenges. (1) Complex background: aerial images usually contain complex scenes, and the target is easily surrounded by scenes, resulting in missed or false detections; (2) huge scale variations: the scale of the target varies greatly; (3) dense arrangement: the detected objects are sometimes densely or sparsely arranged; and (4) small objects. We referred to the MS COCO [11] definition of large, medium, and small targets; approximately 60% of the targets in DOTA have less than 50 pixels.
Due to the complex background and the huge variation in the orientation, scale, and appearance of the object instances in remote sensing images, it is difficult to apply the horizontal detection algorithms to rotated object detec-tion. In order to predict the location and orientation of the rotated objects in remote sensing images, previous rotation detection algorithms [22][23][24][25][26][27][28][29] use preset rotation anchors and additional angle prediction. Owing to changes in orientation, numerous anchors should be preset on the feature map making them spatially aligned with GT boxes. Other methods use horizontal anchor points to detect rotating objects. For example, RoI transformer [23] uses horizontal anchor points but learning the RoI of rotation through spatial transformations reduces the number of predefined anchors to some extent. Rotate-YOLOv5 [29] uses CIoU as the loss function of the bounding box and mosaic data enhancements to improve the detection accuracy on the basis of ensuring the detection speed. R3Det [30] recodes modules using cascading regression and redefinition boxes to achieve high performance. Although this method achieves high performance, it must lay numerous anchor frames on the feature graph. However, there is significant redundancy in the distribution of anchor frames in the rotation scenario.
In this study, the DOTA dataset is representative and challenging, and we discuss the proposed method based on that dataset. The problem discussed is universal in detection algorithms. Inspired by FCOS [13], YOLO [4], ATSS [17], and Rotate-YOLOv5 [29], we analyze not only the characteristics of the existing mainstream algorithms for positive and negative sample sampling strategies but also the advantages of anchor-based and anchor-free methods. Meanwhile, we propose a remote sensing object detection algorithm based on a one anchor-based method. It optimizes the problems of IoU or shape-matching strategy and reduces the design of hyperparameters. Experiments were performed on the DOTA dataset to support the analysis and conclusions. The main contributions of this study are as follows: (i) The characteristics of the matching strategy based on IoU and shape are analyzed, and it is not necessary to set the anchor frame with multiple proportions on the same anchor point (ii) Combining the idea of anchor-based and anchorfree methods, a screening strategy for positive and negative samples based on the one anchor-based (OAB) method is proposed (iii) The self-attention mechanism of the vision transformer is introduced to weaken the complex background information in remote sensing scenes, strengthen the extraction of useful information, and increase the overall detection performance 2. Proposed Methods 2.1. Network. The object detection of remote sensing images must consider both efficiency and accuracy, and the algorithm has good portability. As an improved version of YOLOv3 [6] and YOLOv4 [1], YOLOv5 has similar basic architecture and good algorithm portability. The YOLOv5 method was chosen as the baseline to meet both the detection performance and speed. The pipeline of the network structure is illustrated in Figure 1. We used cross-stage partial connections (CSP) [1] as backbone. At the top of the backbone network, we added a vision transformer (ViT) [31] module to connect to the top of the neck. This allows the network to focus on key information and better learn specific target features. For the detection head part of the network, we added a point-spacing branch to each layer to suppress the regression box far from the center point of GT and improve the detection accuracy. This is similar to the centerness of FCOS [13].
2.2. One-Anchor-Based Method for YOLOv5. One of the important parts of anchor-based target detector is the sampling strategy of positive and negative samples. Currently, there are two mainstream sampling strategies to collect and distinguish it, one is the sampling strategy based on IOU, and the other is the sampling strategy based on shape. The sampling strategy based on IOU sets the IOU threshold and combines the sampling step. When the IOU value of anchor and GT is greater than the set threshold, it is considered to collect a positive sample. The sampling step can control the number of anchors. The smaller the step, the more anchors will be generated, and the more positive samples will be matched, but at the same time, the more redundant negative samples will be collected. The number of positive and negative samples is also smaller, so, the sampling threshold of positive and negative samples based on IOU matching need to be set reasonably. It is hard to do and easy to lead to the loss of small targets, and an imbalance of positive and negative samples exists, especially in remote sensing images. The sampling strategy based on shape matching is relatively simple, but this method is more flexible and has fewer hyperparameters. Due to the unreasonable anchors setting by the sampling strategy based on IOU, a certain GT has no anchor to correspond to and becomes the one that ignores the region. It can be seen that this allocation system will lead to relatively few positive samples. It is guaranteed that each GT box must have a unique anchor by the sampling strategy based on shape. The threshold is not fully considered. By comparing the anchor aspect ratio and threshold, the sample is positive within the maximum IOU value. Even if the maximum IoU is less than ignore threshold, it will not affect the prediction box to be a positive sample. Otherwise, it is negative. However, more anchor frames need to be pre-set to match targets of different scales. Due to the different sizes of targets in the real environment, a large number of anchor aspect ratios will be set in advance to be more appropriate and real, which will increase the large amount of calculation and result in low calculation efficiency. In this section, we analyze the differences between the IoU and shape label collection methods. Subsequently, we solve the problem of IoU and shape label collection using the OAB method. Finally, we introduce the self-attention mechanism of ViT [31] to enhance the global reasoning ability of the network to the feature map to detect the accuracy.

Label Assignment Based on IoU and Shape Strategy
(1) Based on the IoU Strategy. As shown in Figure 2, red represents GT box, yellow represents grid of feature graph divided according to different sampling stride, and stride represents sampling stride. The FPN generates feature maps of large, medium, and small scales; each scale feature map can predict the target of the corresponding scale. In the sampling process, the sampling step of the anchor frame expands with a decrease in the resolution of the feature graph. Generally, for feature maps of large, medium, and small targets, the sampling step size is set to 8 (2) Based on the Shape Strategy. As shown in Figure 3, two GT boxes with large-scale differences are listed to illustrate the problems existing in the shape-based matching strategy. Red represents GT box. Based on the shape matching strategy, the ratio between the width and height of the presetanchor frame and that of the GT box is calculated. Subsequently, the hyperparameter threshold (anchor_ratio_thres) is set according to this ratio to divide the positive and negative samples. If the aspect ratio between the preset-anchor frame and GT box is between (1/anchor_ratio_thres, anchor_ratio_thres), this part of the sample is positive. The GT box in the upper left corner is a small target, whereas the lower right corner is a large target. Red represents the default anchor frame. It is discovered that the aspect ratio of the default anchor frame is very different from the red GT box in the upper left corner. Therefore, such small targets are likely to be ignored, resulting in no positive sample to predict them, while the aircraft in the lower right corner is well matched. The shape-based matching strategy matches more positive samples by setting a larger range of aspect ratios. Compared with the IoU-based matching strategy, this method is more flexible and has fewer hyperparameters. However, more anchor frames need to be preset to match targets of different scales. In the real world, especially in aerial images, the target scale varies significantly, and there are targets that are very large or small. Therefore, once the range of the aspect ratio is set improperly, some objects lose positive samples, resulting in poor detection performance of the corresponding categories.

2.2.2.
One-Anchor-Based Sampling Strategy. During data preprocessing, the coordinates of the GT were normalized. We counted the distribution of the coordinates after normal-ization and filled them into grid points of 1 × 1. The results are shown in Figure 4(a). We found that most objects were located in the center of the grid. According to this finding, we chose the intersection of grids around each GT center point as the center of the positive sample, instead of the center of each grid, to speed up the convergence rate of regression. As shown in Figure 4(b), the stride size of each layer was set to 1. The feature map of each layer of the FPN was divided into grids N × N grids, and the center point ðg x , g y Þ of each lattice point in the grid was calculated. For the center point ðc x , c y Þ of each real label, a rectangle with (fixed value) radius r = 1 was generated around it, which is defined as grid box. Furthermore, if the location ðg x , g y Þ falls within the range of the grid box, the location is regarded as a positive sample, and the category label of the location is obj (foreground class). Otherwise, it is a negative sample and   Journal of Sensors obj = 0 (background class). In addition to classification, there is a 5-dimensional real vector t = ðt x , t y , t w , t h , objÞ as the regression target for this position. Notably, the coordinate regression range of bounding box (bbox) in YOLOv5 is −0:5~1:5, which was used for sample expansion. In the proposed method, by changing the sampling method, the regression range of ðx, yÞ coordinates is −1~1. As shown in Figure 5, the regression of width and height follows for YOLOv3 [6]. If the cell is offset from the top-left corner of the image by ðcx, cyÞ and the bounding box prior has p w ,p h , then the inference regression targets for the location can be formulated as 2.3. Vision Transformer. Generally, the background of aerial datasets is complex, which reduces the localization ability of the model. The self-attention mechanism of ViT [31] enables the network to perform global reasoning on the image and on the predicted specific target. The model is used to observe other areas of the image to help determine the target in the bounding box. On the contrary, traditional detection models can only predict each target in isolation. Therefore, we introduce ViT [31] to suppress background noise and strengthen the positioning ability of the model.

Regression Loss.
In DOTA [21] dataset, most targets belong to small targets, and they are arranged intensively. While the IoU evaluates the predicted box as a unit of measurement for the whole, the traditional IoU method only       Journal of Sensors  [21] dataset, the overlap area, central point distance, and aspect ratio of the bounding boxes are considered comprehensively. Therefore, CIoU loss [32] is adopted to perform the regression of the boundary boxes, and the loss function can be defined as follows.

Journal of Sensors
And the trade-off parameter α is defined as

Angle Loss.
Angle regression is a difficult problem in rotation tasks. Therefore, we introduce CSL [33] as the angle regression method and apply it to the baseline YOLOv5 and the proposed method. The CSL [33] method cleverly transforms the angle prediction task from a regression problem to a classification problem to solve the discontinuous boundary problem in a rotating detector. Please refer to [33] for further details. Finally, the expression of the angle regression is as follows: Variables θ ′ , θ a , and t θ ′ are for the ground truth angle, anchor angle, and predicted angle, respectively.

Experiment Results and Discussion
This section is divided into subheadings. It provides a concise and precise description and interpretation of the experimental results.

DOTA Dataset and Parameter Settings
3.1.1. DOTA-v1.5. DOTA [21] is a large-scale dataset for object detection in remote sensing images. DOTA-v1.0 contains 2806 large aerial images with the size ranges from 800 × 800 to 4000 × 4000 and 188,282 instances among 15 common categories. DOTA-v1.5 uses the same images as DOTA-v1.0 and more extremely small instances (less than 10 pixels). Moreover, a new category "container crane" is added. DOTA-v1.5 contains 403,318 instances in total. Thus, DOTA-v1.5 is more challenging than DOTA-v1.0. The version of DOTA dataset used in this experiment is DOTA-v1.5. The proportion of the training set, validation set, and test set in DOTA-v1.5 is 1/2, 1/6, and 1/3, respectively. Meanwhile, we crop a series of 1024 × 1024 patches from the original images with an overlap of 200 pixels by DOTA development kit. Subsequently, the subimages that do not contain the targets are ignored.

Implementation Details.
The DOTA-v1.5 is trained by 120 epochs in total with YOLOv5m as the pretraining model. The initial learning rate is 0.01, and the cosine annealing learning rate schedule is utilized to update learning rate. The weight decay is set to 0.0005. The SGD momentum is set to 0.937. Besides, the warm-up strategy is adopted to find a suitable learning rate in the third epoch during training. And other experimental parameters were set as shown in Table 1. The patches of training and test patches were 1024 × 1024. During inference, we first send patches (the same settings as training) to obtain the detection results before merging, then map the detected results from patch coordinates to the original image coordinates, and perform nonmaximum suppression (NMS) on these results through the original image coordinates. Referring to benchmarks [34,35], we set different NMS thresholds for each class, "roundabout" is set to 0.1, "tennis-court" is set to 0.3, "swimming-pool" is set to 0.1, "storage-tank" is set to 0.2, "soccer-ball-field" is set to 0.3, "small-vehicle' is set to 0.2, "ship" is set to 0.2, "plane" is set to 0.3, "large-vehicle" is set to 0.1, "helicopter" is set to 0.2,"harbor" is set to 0.0001, "ground-track-field" is set to 0.3, "bridge" is set to 0.0001, "basketball-court" is set to 0.3, "baseball-diamond" is set to 0.3, "container-crane" is set to 0.05, and limit the maximum number of predicted targets for the experiment to 1000. After interference, the detection results are submitted to the DOTA official website at https://captain-whu .github.io/DOTA/evaluation.htmlfor online evaluation of the test dataset and compare with the mainstream SOTA methods. The evaluation index is the average value of each category of average precision (mean average precision, mAP). And the expression of mAP is as follows: where AP is the average accuracy of each category, obtained by integrating PRðrÞcurves which is combined with precision and recall, AP i represents the average precision in class i, and K represents the number of classes.  We compare the ten peer techniques, including DCL [36], RSDet [23], GWD [37], KLD [38], R2-CNN [39], Rotate-YOLOv5 [29], RetinaNet [14], MR [40], CMR [41], and FR OBB [21] on DOTA-v1.5. Specifically, Zhuang et al. [29] proposed Rotate-YOLOv5 which is one of our most relevant work. We proposed one-anchor-based method as new sampling strategy that can better balance the positive and negative samples of small targets. At the same time, we add a ViT between backbone and neck to reduce background interference and increase the focus on the target, while they used the mosaic data enhancement to enrich the dataset and improve the detection accuracy of small targets. And then, they used the long-edge definition method based on circular smoothing labels to achieve a rotatable bounding box, which solved the effect of angle periodicity on training by converting the regression problem into a classification problem. Finally, they used the CIoU loss as the loss function of the bounding box to improve the detection accuracy on the basis of ensuring the detection speed.

Experiment
In this work, we also care about rotate object detection that is based on the DOTA-v1.5 dataset and evaluated on the official website to ensure the reliability of the experimental results.
In addition, the results in Table 2 show the effectiveness and superiority of our proposed method. In contrast, class CC has the lowest number of instances in DOTA-v1.5, where the mAP of the "CC" class for all but the KLD method is very low, near to 0%. We believe that this is because of the imbalance of the positive and negative samples caused by the sampling strategy. The OAB method proposed in this study ensures that the sample ratio of each label is stable and does not cause the instability of the positive and negative sample ratios owing to the setting of the IoU threshold and the object size, thus mitigating the long-tail effect caused by imbalanced sample sizes between categories. As shown in Table 2, OAB-YOLOv5 achieves the highest mAP of 28.15% of the "CC" class. The experimental results show that OAB-YOLOv5 has very excellent performances in the field of oriented object detection.

Ablation Study.
To further demonstrate the effectiveness of the proposed sampling strategy and the influence of the ViT [31] module on the overall performance, we compared the influence of the proposed sampling method on the detection performance. Without using the ViT [31] module, the sampling method in this study achieved comparable performance to the baseline and achieved a 5.45% improvement for category "CC." In all, the maps for the 16 categories were the same. In all, the maps for the 16 categories were the same. Thus, we reduced the number of parameters by 1 M and did not need to set additional hyperparameters during  9 Journal of Sensors the sampling phase. Therefore, we achieved results similar to that of the baseline method with less complexity, which proves the effectiveness of the proposed sampling method. Finally, we introduce the ViT [31] to reduce the interference of background factors in the remote sensing images, enable the network to study the overall image, and strengthen the ability of the model. The precision of the algorithm is further tested to prove the effectiveness of the overall approach. Table 3 shows the experimental results.
Finally, we retrospectively evaluated the speed of the baseline method and the proposed method. The test resolution was 1024 × 1024, and the batch size was 8. The results are shown in Table 4; in terms of reasoning time and NMS time, the method in this study reduces the number design of the anchors. Therefore, the reasoning time on NMS is reduced by approximately 11% compared with the baseline method. In conclusion, the proposed method achieves a better balance in speed and accuracy than the baseline method.

Detection Effect and Analysis
The detection effect of some categories is visualized at the end of this study. The detection confidence and IoU threshold are set to 0.1 and 0.6, respectively. The specific results are shown in Figures 6(a) and 6(b). Test effect diagram of the proposed method in the right column shows that the test results are clearer and better than the baseline method. In the detection result graph obtained using the baseline method, there are many disorderly anchor frames because each label presets a variety of anchors of different scales. Therefore, there are some redundant detection anchor frames. However, the method in this study is simpler and more efficient.

Conclusions
In this study, we proposed a screening strategy based on a single anchor frame to achieve high-performance arbitrary direction remote sensing object detection. Specifically, the characteristics of the two matching methods based on IoU and shape are analyzed, and their shortcomings are identified. Therefore, it is unnecessary to preset multiple anchors. It presets one-anchor-based (OAB) by combining the two ideas of anchor-based and anchor-free and adopts the central point method for sampling. To obtain high-quality samples, the grid points around each real label were calculated as the sampling benchmark, which reduced the hyperparameter design of the matching part and ensured that each GT had a corresponding positive sample for prediction. The validity of this idea was verified using the challenging DOTA dataset.