Research Article Scale Adaptive Feature Pyramid Networks for 2D Object Detection

,

Most of these methods use a DCNN such as VGG [1] or ResNet [12] as their "backbone" network to extract features to detect object area, or bounding box, and to classify object in the bounding box. Some object detection algorithms, such as faster R-CNN, uses a feature map at a fixed resolution level for the task. Others, such as RetinaNet, use a hierarchy of feature maps having different resolutions. e single resolution feature map of the faster R-CNN (e.g., C 5 in Figure 1(a)) has high semantic content induced by later CNN layers of the backbone network. However, its resolution is low, so a small object in the input image may be missed (in Figure 1, thin and thick borders of the feature maps indicate their low and high semantic content, respectively). One could use earlier layers of the backbone CNN as feature maps for object area detection and classification. However, these earlier layers of feature maps (e.g., C 1 or C 2 in Figure 1(a)) have lower semantic content while having higher resolution. is leads to inaccurate object area (bounding box) and inaccurate classification labels of objects in the area.
In recent years, the multilayer feature map has been proposed to deal with this issue, significantly improving the accuracy of small object detection (e.g., SSD, FPN, and RetinaNet). For example, SSD combines a multiscale set of feature maps to localize and identify objects. However, its ability to detect smaller objects is still not satisfactory. It is also computationally expensive due in part to its choice of the base network, VGG.
Lin et al. [11] added the feature pyramid network (FPN) to faster R-CNN for efficient yet accurate detection of objects having varying scales. e FPN, illustrated in Figure 1(b), consists of a pair of multiscale pyramids of feature maps. e bottom-up pathway is a pyramid induced by the backbone CNN, starting from the highest resolution yet least semantic input image. e bottom-up pyramid produces a highly semantic yet lowest resolution feature map (C 5 in Figure 1(b)) at the top of the pyramid. e top-down pyramid starts with the low resolution yet highly semantic image handed over from the bottom-up pyramid (P 5 in Figure 1(b)) and is generated by repeated up-sampling. e top-down pyramid feature maps are enriched by fusing information passed on from feature maps of the bottom-up pathway via the lateral connections. e feature maps of the top-down pyramid, which simultaneously have high semantic content as well as high resolution, are used for object area detection and classification.
Several object detection network architectures appeared since incorporated FPN. For example, a single-stage detector RetinaNet combines the FPN with a modern backbone network and a loss function called focal loss to achieve high speed as well as high accuracy. e accuracy of RetinaNet is comparable to the two-stage detector faster R-CNN. Focal loss is introduced to the RetinaNet training to alleviate positive/negative sample imbalance.
In the FPN, integration of information from the bottomup pathway via lateral connection and information from the top-down pathway is done by using fixed-weight summation, as illustrated in Figure 1(b). However, the optimal ratio of the fusion depends on the size of objects in the input image. Obviously, the sizes of objects vary from image to image, and distribution in size of objects in images depends on the dataset. Fixed weights used in the FPN to integrate Predict Bottom-up pathway

Scale attention module
Predict Predict Predict Bottom-up pathway  [5] predicts object bounding box based on a highly semantic yet low-resolution feature map. Due to low resolution of the feature map, mall objects tend to be missed. (b) e feature pyramid network (FPN) [11] predicts object bounding box based on multiresolution, highly semantic feature maps formed by fusion feature maps of top-down and bottom-up pathways. Adjacent scale feature maps are integrated by using fixed weights 0.5. RetinaNet [10] also uses this FPN. (c) e proposed scale adaptive FPN (SAFPN) integrates feature maps by using weights computed, at each resolution level, by using the scale attention module (SAM) to suit scales of objects in each input image.
bottom-up and top-down feature maps, which amounts to an even weight average of the two, are most likely suboptimal.
In this paper, to improve both object localization and classification accuracy, we propose integrating feature maps of the bottom-up pathway and top-down pathway of the FPN using scale adaptive weights computed per input image. In the proposed approach, called scale adaptive FPN (SAFPN), the weights for feature map integration are learned from the images in the training dataset and computed per input image at inference time. e SAFPN may be viewed as an attention mechanism over image object scales and semantic levels. We evaluate our proposed approach's effectiveness by applying the proposed SAFPN to two object detection architectures, one a two-stage detector faster R-CNN and the other a one-stage detector RetinaNet.
Our contributions can be summarized as follows: (1) Proposal of the scale adaptive feature pyramid network (SAFPN): a novel method to fuse feature maps of the bottom-up pyramid and top-down pyramid of the FPN based on input image. e weights for fusion are computed per input image at each resolution level by the scale attention module (SAM). e SAM is trained end-to-end along with the other parts of the object detection network.
(2) Experimental evaluation of the SAFPN: experimental evaluation of the proposed SAFPN using two representative object detection networks, a one-stage detector and a two-stage detector. e evaluation using PASCAL VOC 2007 and PASCAL VOC 2007 + 2012 as training datasets showed that the proposed SAFPN significantly improves object detection accuracies in both types of detectors. e rest of this paper is organized as follows. In the next section, we review relevant work. In Section 3, we describe the proposed method, followed by its experimental evaluation in Section 4. Conclusion and future work will be presented in Section 5.

Object Detection.
We review some of the previous methods of object detection in this section. In 2013, Girshick et al. proposed the R-CNN [2], which employs a region proposal. While it uses the CNN, the R-CNN's overall architecture is influenced by a traditional shallow object detection approach. A set of object region proposals are extracted by the classical selective search [11] approach. en, each region proposal is fed to a CNN after the region is rescaled to a prescribed size to extract features for the region. e R-CNN used Alexnet by Kryzhevsky et al. [13] trained on the ImageNet dataset for feature extraction. e feature extracted by the CNN is passed on to a linear support vector machine (SVM) to determine the object class.
In 2014, He et al. proposed spatial pyramid pooling networks (SPP-nets) [3] to handle objects in images having arbitrary size/scale. e SPP layer is located after the last convolution layer, right before the fully connected layers of a standard CNN pipeline. e SPP-net reduces the costs of detecting large objects using pyramidal pooling.
In 2015, Girshick proposed the fast R-CNN [4], which is an improvement over the R-CNN and SPP-net. e fast R-CNN employs single CNN pipeline for both object bounding box regression and object classification. e CNN is trained simultaneously for both box regression and object classification objectives.
is and other improvements brought significant speedup over the predecessor R-CNN. Note that region of interest (ROI) pooling in the fast R-CNN can be considered a special SPP case.
Also in 2015, shortly after the fast R-CNN [5], Ren et al. proposed the faster R-CNN. e faster R-CNN is the first DNN-based object detector trained end-to-end. It is also the first DNN-based object detector to perform at near real-time speed. e most important innovation of the faster R-CNN is its region proposal network (PRN) that proposes bounding boxes having high objectness. By sharing most of its processing with the main object detection network, the faster R-CNN is much more efficient than the fast R-CNN. e faster R-CNN uses a single resolution feature map for region proposal and object classification (Figure 1(a)). As the feature map of a latter layer of a standard CNN is highly semantic yet of low resolution, bounding box region regression and object classification accuracy are limited.
Lin et al. proposed, in 2017, the feature pyramid network (FPN) [11] to faster R-CNN. e idea of the feature pyramid had been popular during pre-DNN due to its ability to perform multiscale processing of images, e.g., for object recognition. However, it went out of favor in the early years of the DNN for its computational cost. e FPN couples a bottom-up pyramid inherent in a CNN with a top-down pyramid that performs up-sampling and deconvolution. Semantic information trickles down the top-down pathway from the small (low resolution) yet highly semantic feature map to the high-resolution and highly semantic feature map. e two pyramids are connected by lateral connections to pass on high-resolution information to the top-down network from the bottom-up network. e FPN has been adopted by other object detection networks, most notably a one-stage detector RetinaNet [10] by Lin et al.
Traditionally, a two-stage detector held an advantage in accuracy over a one-stage detector. However, RetinaNet, despite being a one-stage detector, achieved accuracy comparable to the two-stage detector faster R-CNN. Reti-naNet combines the FPN with a new loss function called focal loss with an improved backbone network. Focal loss alleviates issues associated with an unbalanced number of samples between the object region (foreground) and its background.
e method proposed in this paper improves upon the FPN by adding a scale-space attention mechanism to adaptively compute, for each input image, weights for feature map integration. Given an input image, a trained scale attention module in the top-down pathway adaptively weights feature maps based on scales of objects contained in the input image. For example, given an image containing smaller objects, higher-resolution feature maps are emphasized more. e scale attention module is trained along Scientific Programming with the other parts of the object detection network using the training image dataset.

Attention Mechanisms.
Various forms of attention mechanisms have been used in biological systems, i.e., human beings, as well as in recent neural networks. e human visual system is immediately attracted to salient locations in the visual field. is behavior indicates that the human visual system assigns different importance to an image's different locations to perform the task at hand. Spatial attention mechanism has been applied in computer vision. For example, in [14], Pinheiro and Collobert proposed a CNN trained to put higher weight on pixels important in classifying an image. e CNN learns to perform segmentation tasks based on the per-image class label in a weakly supervised setting.
Another well-known example of attention is the adaptive weighting of channels in feature maps of neural networks. Fu et al. in [15] employed channel attention to selectively emphasize channels in feature maps, as well as spatial attention, for scene segmentation.
is paper applies the idea of the attention to scale space to adaptively weight multiple-scale feature maps of the bottom-up pyramid and top-down pyramid of the feature pyramid networks for fusion.

Overview.
Our proposed method, scale adaptive feature pyramid networks (SAFPNs), is an improvement over the original FPN. Feature pyramid networks (FPNs) [11], depicted in Figure 1(b), try to create a multiresolution feature pyramid in which feature maps at all resolution levels are highly semantic and, at the same time, have a high resolution for accurate object localization and classification.
is is achieved by a top-down pathway using up-sampling and lateral connections linking the top-down pathway with the bottom-up pathway inherent in a CNN. e FPN uses fixed, equal weights to integrate information coming from the two pathways. Fixed weight used for the integration, however, is not optimal for every image.
Our SAFPN, illustrated in Figure 1(c), computes weights for the integration adaptively, per scale and per each input image, so that the feature map at a scale concordant with the scale of object in the input image is weighted more than the others. For example, for an image containing smaller objects, a higher-resolution feature map would be weighted more than the other feature maps. For an image containing largerscale objects, on the other hand, the lower resolution feature map would be weighted more than the others. e adaptive weighting is learned by the scale attention module (SAM), which is trained end-to-end with the other parts of the object detection network. e proposed approach, SAFPN, is versatile in which it applies to many different object detection architectures, including both 1-stage and 2-stage networks for object detection. We will later evaluate the SAFPN on both the faster R-CNN [5], a 2-stage method, and RetinaNet [10], a 1-stage method.

Adaptive Multiscale Feature Integration.
e scale adaptive feature pyramid network (SAFPN) is illustrated in Figure 1(c). e original FPN, illustrated in Figure 1(b), uses a fixed weight of 0.5 to weigh both feature maps C i of the bottom-up pathway and P i of the top-down pathway. Our SAFPN employs the scale attention module to compute weights for integration adaptively for each input image. is is done by the SAM observing the strengths of responses of feature maps at various scales in the bottom-up pathway induced by a backbone CNN.
Let us assume the SAFPN to contain five scale levels, C � C 1 , C 2 , C 3 , C 4 , C 5 , in its bottom-up pathway induced by convolutional layers of a CNN. A feature map C i has height H, width W, and depth D. Of these multiple levels of feature maps in the pyramid, C 1 is the highest resolution yet least semantic feature map, while C 5 is the lowest resolution yet most semantic feature map. Similarly, let P � P 2 , P 3 , P 4 , P 5 be the set of feature maps generated by the top-down pathway formed by up-sampling. As shown in Figure 1(c), the lowest resolution feature map at the top of the top-down pathway, P 5 , is obtained by reducing the number of channels of C 5 using 1 × 1 convolution. Other feature maps P i−1 at i th level, where i � 2 ∼ 4 is computed by using the following equation, except for P 5 : where C i and P i , i th feature map in C and P, are weighted by the (scalar) weights w C i and w P i � (1 − w C i ), respectively. e weights are determined based on the strengths of responses of the feature maps C � C 2 , C 3 , C 4 , C 5 in the bottom-up pathway and are normalized using the Softmax function as equation (2) and Figure 2: e strength of response of the feature map C i is computed as its L p -norm of all the values in the feature map using equation (3). In the experiments below, we tried several different values of p; L p -norm corresponds to L1norm, L2-norm, and L∞-norm if p � 1, p � 2, or p � ∞, respectively. We also tried L0.5-norm and square of L2norm: Note that the feature maps of resolution levels i ∈ 2, 3, 4, 5 { } are used, and the highest resolution feature map C 1 and its counterpart P 1 are not used in our implementation. C 1 and P 1 are not used in part due to their large memory footprint.

Experimental Settings.
To evaluate the proposed scale adaptive feature pyramid network (SAFPN), we conduct a set of experiments over the following five variations of networks: (1) FRCNN: original faster R-CNN [5] that uses a single resolution feature map (i.e., no FPN). (2) FRCNN-FPN: faster R-CNN added with the multiresolution feature map FPN [11]. e original FPN [11] and RetinaNet [10] use fixed weights (w C i , w P i ) � (0.5, 0.5) for integration. We also experimented with different values of fixed weights as listed below to see how it affects object detection accuracy: In the list above, two networks using the proposed approach, the faster R-CNN SAFPN (3) and the RetinaNet SAFPN (5), are compared against others, which are (1), (2), and (4). Recall that the original RetinaNet (4)  We added our SAFPN to the faster R-CNN and Reti-naNet reimplementations by a group called UCAS-Det downloaded from [16,17]. UCAS-Det makes available these networks with several different backbone networks, e.g., ResNet-50 and ResNet-101. For both the faster R-CNN and for the RetinaNet, we chose ResNet_v1_101 [12] pretrained by using the ILSVRC-2012-CLS image classification dataset as the backbone. Both networks are then trained by using either the Pascal VOC2007 trainval or Pascal VOC 2007 + 2012 joint trainval dataset. Pixel resolution of images in these databases varies, but the majority of images are either 500 × 375 (landscape) or 375 × 500 (portrait). e training used the Adam [18] optimizer with momentum 0.9 and minibatch size 1. e small minibatch size of 1 is due to memory limitation (5GByte) of the GPU, Nvidia Tesla K20, which we used. Training is done for 100,000 epochs for the Pascal VOC 2007 trainval dataset and 150,000 epochs for the Pascal VOC 2007 + 2012 joint trainval dataset. e learning rate is manually scheduled, starting at 1 × 10 − 3 and reduced to 1 × 10 − 4 at 50,000 th epoch and to 1 × 10 − 5 at 70,000 th epoch. Table 1 shows the number of images and number of objects in Pascal VOC 2007 and Pascal VOC 2012 datasets. Both of these two datasets consist of 20 categories. Tables 2 and 3 compare accuracies in mAP [%] of the five cases listed above. It also compared L1-norm, L∞-norm, and L2-norm for computing strength of response for a feature map C i in determining weights for feature map fusion. e results in both Tables 2 and 3 show that for both the 2-stage network faster R-CNN [5] and for 1-stage network RetinaNet [10], the proposed SAFPN produces the highest accuracy among those compared.

Experimental Result.
In Table 2, networks are trained using the Pascal VOC 2007 train dataset. In the table, the FRCNN-SAFPN produced mAP of 78.3%, which is significantly better than 74.6% of the original FRCNN (without FPN) and 76.1% of the FRCNN-FPN that uses fixed weight (w C i , w P i ) � (0.5, 0.5). In case of RetinaNet, mAP improved from 73.3% using the original FPN to 74.2% using the proposed SAFPN. Of various norms tried for the pooling of responses, square of L2-norm, (L2) 2 , shows the highest accuracy, very closely followed by L2-norm.
In Table 3  It is compared against the cases using the SAFPN, which appear as horizontal lines. We tried values of (w C i , w P i ) as listed below and plotted accuracy against it: Accuracy varies depending on (w C i , w P i ), but the proposed adaptive SAFPN performs equal or better than the hand-tuned fixed-weight integration.
In the case of hand-tuned integration, for both the RetinaNet and faster R-CNN, weighting the bottom-up pyramid feature maps C i more than top-down pyramid feature maps P i seems to produce higher accuracy. Figure 5 shows examples of object detection using the FRCNN-FPN and FRCNN-SAFPN. As indicated in Figure 5, the FRCNN-FPN uses (w C i , w P i ) � (0.5, 0.5) for every resolution level of its FPN. e SAFPN, on the other hand, uses adaptively computed values of w C i determined from each input image for each of the four resolution levels. Using fixed weights, a small cow in the background (A), as well as a small cow close to the center (C) is not detected. Using the SAFPN, however, small-sized objects are detected, as in (C) and (D). Note that, for the images having small objects, i.e., (B) and (D), the SAFPN weighs higher-resolution feature maps at

Conclusion
In this paper, we tackled the issue of correctly detecting objects having varying scales, especially those having small scales, from images.
With a single low-resolution feature map, such as those found in the faster R-CNN [5], small objects localization and classification are difficult. e feature pyramid network (FPN) [11] significantly improved object detection accuracy by using pyramids of multiple resolution feature maps that provide high semantic content at multiple resolution levels. However, its ability to detect small-scale objects is still limited. We conjectured that the limitation is in part due to fixed weights used in the integration of feature maps between the bottom-up pyramid and the top-down pyramid.
is paper proposed an improvement to the FPN [11] called scale adaptive feature pyramid network (SAFPN) that adaptively determines weight for feature map fusion per We performed a set of experiments using both the 2stage network faster R-CNN [5] and 1-stage network Ret-inaNet [10], both of which are modified with the SAFPN. e set of experiments has shown that the proposed SAFPN significantly improves object detection accuracy over the FPN. Accuracy measured in mean average precision (mAP) for the original faster R-CNN and faster R-CNN with the (fixed weight) FPN is 74.6% and 76.1%, respectively, when trained using the Pascal VOC 2007 dataset. e faster R-CNN with the proposed SAFPN improved mAP value to 78.3%. For RetinaNet, replacing its (fixed weight) FPN with the proposed SAFPN improved accuracy in mAP from 73.3% to 74.2%.
Future work includes combining the proposed scalespace attention mechanism with some form of spatial attention mechanism to improve accuracy.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.