Object Detection Algorithm Based on Improved Feature Pyramid

Feature pyramid network is widely used in advanced object detection. By simply changing the network connection, the performance of small object detection can be greatly improved without increasing the amount of calculation of the original model. However, the algorithm still has some shortcomings. erefore, a new attention feature pyramid network (attention PFN) is proposed. Firstly, an improved receptive eld module is added, which can make full use of global and local context information. Secondly, the connection mode of the pyramid is further optimized. Deconvolution is used to replace the nearest neighbor interpolation in top-down up-sampling and a channel attention module is added to the horizontal connection to highlight important context information. Finally, adaptive fusion and spatial feature balance are used for each feature pyramid, so that the network can learn the weights of dierent feature layers. Each pyramid layer contains more discrimination information. Attention PFN is tested on Pascal VOC andMS COCO datasets, respectively. e experiment results revealed that the proposed method has better performance than the original algorithm. erefore, the attention PFN is an eective algorithm.


Introduction
In recent years, object detection algorithms based on deep learning have been widely used in various elds, such as face detection, vehicle-pedestrian detection, dangerous goods detection [1], transmission line defect identi cation [2], and so on. ese object detection algorithms can be divided into two main categories. e rst type is a twostage algorithm based on a region proposal network, including Faster R-CNN [3], R-FCN [4], and others. e other type is a regression-based one-stage algorithm, such as YOLO [5], SSD [6], and others. However, no matter which type of algorithm it is, it faces the di cult problem of poor detection for small targets. In order to solve the problem of small target detection, the Feature Pyramid Network (FPN) was proposed in literature [7], which has been widely used in the two-stage detection algorithm (such as Mask-RCNN [8]) and one-stage detection algorithm (such as RetinaNet [9]).
Based on the structure of ConvNet, the FPN adopts a top-down and horizontal connection method to transfer the top layer semantic information to the low layer. However, the top layer information will be lost. Features of di erent scales contain information from di erent abstract levels, so there are large semantic di erences in direct addition, and the formed pyramids at all layers are not further fused. erefore, various variants of the Feature Pyramid Network have been proposed.
In literature [10], all information of a multilevel structure was used to generate a multilevel contextual features pyramid with multiple scales. In literature [11], authors proposed a global information extractor and a local information extractor as a framework for single-stage object detection. Yang et al. [12] used a multilayer feature map stacking method to fuse semantic information and detailed features. PANet [13] added an additional bottomup path on the basis of FPN to transfer low layer location information to high layer. Libra-RCNN [14] designed a balanced feature pyramid; each layer of the pyramid has the information of all layers. NAS-FPN [15] used neural structure to nd irregular characteristic network topology and then repeatedly applied the same block, achieving outstanding results. E cientDet [16] used E cientNet as the feature extraction network to design the BiFPN, which has an efficient two-way cross-scale connection and weighted feature fusion with a better precision and efficiency trade-off.
is paper proposes an improved feature pyramid network based on the attention mechanism and receptive field. e improved receptive field module (ARFB) is added to the top of the feature extraction layer to obtain global and local context information. In the next place, we improved the connection method of feature layers with different resolutions. For the top-down up-sampling method, the deconvolution method is used to replace the nearest neighborinterpolation for reducing information loss. In the horizontal connection, a channel attention mechanism is added to the feature layer of the backbone network before the 1 × 1 convolution, which can reduce the secondary channel to emphasize important channel information. Finally, the spatial adaptive fusion and balance mechanism are adopted for each layer of the pyramid. e weights of the fused features at all layers can be learned as well as information on each layer of the feature pyramid is fully enhanced. e principal contributions of the paper are summarized as follows: (1) e top of the feature extraction layer adopts the ARFB module to obtain ample global and local context information. (2) Deconvolution is used for up-sampling to reduce the loss of feature information. SE module is added at the horizontal connection of the feature layer of the backbone network to extract effective feature information and reduce the interference of noise information. (3) e spatial adaptive fusion and balance mechanism is applied to further refine the feature information of each feature layer. It greatly improves the detection ability of the network.

Fundamental Knowledge
FPN has two paths: the bottom-up path and the top-down path. e bottom-up path is the feed-forward calculation of networks, such as ResNet [17]. e output of each stage is C 1 , C 2 , C 3 , C 4 , C 5 , the width and height are halved in turn, and the number of channels is {64, 256, 512, 1024, 2048}.
In the top-down path, the output C5 reduces the number of channels to 256 by a 1 × 1 convolution, formingP 5 . Subsequently, up-sampling is performed by the nearest neighbor interpolation method, and C4 with the same resolution is added to it after a 1 × 1 convolution becomes P4. Repeating the above operations, a top-down path composed of {P5, P4, P3, and P2} is formed, and they all use a further 3 × 3 convolution to avoid the aliasing effect caused by up-sampling. In addition, P5 performs max pooling operation to obtain P6. When FPN is combined with Faster R-CNN, a multiscale detection method is adopted. e RPN network searches for the prospect target proposed area in {P2, P3, P4, P5, P6}, and according to the size of the ROI, select the feature map in {P2, P3, P4, P5} to perform the Fast R-CNN operation to obtain the specific target category and more precise location.
However, the FPN has several drawbacks: (1) C5 is located at the top layer. With the deepening of neural networks, the feature extracted by deep convolution has low spatial resolution and position information that can be easily recognized. P5 is formed by channel compression, and then P5 is pooled to form P6, which will cause further loss of characteristic information. (2) e features of different layers contain complex feature information. In the horizontal connection, the direct addition of features at different layers will contain a lot of background noises. It will result in great semantic ambiguity. (3) e formed pyramids at all layers are not further fused effectively.
A novel neural network is proposed in this section called attention FPN, the structure of which is shown in Figure 1.
ere are three main improvements in the proposed method: (1) An improved receptive field module is added between C5 and P5 to obtain more context information and supplement P5 and P6. (2) e channel attention mechanism is added to the horizontal connection to enhance the effective position information in the feature map in the path and suppress irrelevant background information. At the same time, we also optimize the up-sampling method. (3) e pyramids at all layers have added the adaptive spatial fusion and balance mechanism. e weight of each layer is learned by spatial attention, and the information of all layers of the pyramid makes a more reasonable balance and enhancement.

Improved Receptive Field Enhancement Module.
Dilated convolution comes from DeepLab [18], which can increase the receptive field without reducing the scale of the feature map and perform well in semantic segmentation. Literature [19] designed a receptive field block RFB in the SSD network, and this structure adopted the multibranch idea of Inception-ResNet [20]. ere are different convolution kernels and expansion coefficients on different branches, corresponding to various sizes of targets and receptive fields; their combinations can make full use of context information. As shown in Figure 2, the RFB module can only obtain local context information, so the global context information module GCM is introduced into it as a branch so that global context information and local context information can be well fused; the improved module is named ARFB (augment receptive field block). e specific structure of GCM is shown in Figure 3. First of all, the feature map is globally averaged and pooled to obtain the global feature vector. en, up-sampling is performed to restore the original size and reduce the number of channels to splice with other branches. In order to prevent information loss, we added residual connections in the input and output feature maps . C5 is connected to P5 through the ARFB module, which can give P5 and P6 rich context information to overcome the deficiency of FPN in large target recognition.

Improved Top-Down Path and Horizontal Connection.
As shown in Figure 4, for the top-down path, the original interpolation method lacks flexibility and learning ability, resulting in poor performance. erefore, the deconvolution method is used to replace nearest neighbor interpolation for improving resolution. Moreover, the deconvolution can adjust the training hyperparameter, which can better express the relationship between different layers of features.
In the horizontal connection, the spatial attention SE [21](Squeeze-and-Excitation Networks) module is introduced, which is shown in Figure 5. On the bottom-up path, the SE module is added to enhance the important information in the channel, reducing each feature layer channel to 256. e SE module mainly contains two parts, the input feature map is U � [u 1 , u 2 , . . . , u c ], and each channel is u c ∈ R H×W . First of all, the global average pooling is used to obtain the global vector Z, and the calculation formula is shown as follows: en, Z is encoded by using the fully connected network to obtain the dependency between each channel and complete the space compression. e calculation formula is as follows: where W 0 ∈ R c×c/2 ,W 1 ∈ R c/2×c , δ is the ReLU activation function, Z c is obtained through the sigmoid activation function, and the value range is adjusted to [0, 1], which is multiplied by the input feature map to obtain the feature after channel excitation: σ(Z c ) can be regarded as the importance of each channel. When the network is learning, this module is adaptively tuned well to enhance important channels and suppress irrelevant channels. is module can merge with the feature layer of the top-down path well. Due to the small amount of calculation of the channel attention module, the increased time cost is negligible.

Adaptive Spatial Feature Fusion and Balance Mechanism.
Different feature pyramids have different contributions to various target recognition. Inspired by AugFPN [22] and Libra R-CNN, we propose a spatial adaptive fusion and  Scientific Programming balance mechanism. e mechanism is shown in Figure 6. First, {P2, P3, P4, P5, P6} is integrated into the same scale by adopting up-sampling and down-sampling. ey were merged into a branch through concatenation operation subsequently.
e other branch remains unchanged. e combined features reduce the number of channels by 1 × 1 convolution. It is performed by 3 × 3 convolutions for feature extraction. en, sigmoid is used to obtain the weight on each channel. e channels perform split operations. e weight of each feature layer is multiplied by the corresponding feature layer to extract effective features. Finally, obtained feature layer performs concatenation operation and reduces the dimension through 1 × 1 convolution. e subsequent steps are the same as Libra R-CNN. e fused features are, respectively, scaled to the original size, and the corresponding original features are added to avoid the loss of some detailed information. e enhanced feature pyramid {P2′, P3′, P4′, P5′, P6'} is formed. e location information and semantic information of each feature layer have been fully enhanced. It is conducive to the detection of multisize targets.

Experimental Results and Analysis
e environment configuration of these experiments is presented as follows: CPU model is Intel Sliver 4210; GPU model is NVIDIA GTX 2080Ti and a total of 4 blocks. e operating system is Ubuntu16.04; the acceleration library is CUDA 10.0 and cuDNN 7.6.5. Based on the deep learning framework of PyTorch 1.4 and the implementation network of python 3.7, experimenting them is carried out separately on the PASCAL VOC dataset and MS COCO dataset.

PASCAL VOC Experiment Results.
e PASCAL VOC datasets have a total of 20 categories of targets. We use the training validation sets including 16551 pictures of PASCAL VOC 2007 and PASCAL VOC 2012 for training and the testing set of PASCAL VOC 2007 including 4952 pictures. Stochastic gradient descent (SGD) is used for training, momentum is set to 0.9, the input image is re-sized to 1000 × 600, the batch size is set to 16, the weight attenuation coefficient is set to 0.0005, the initial learning rate is set to 0.02, and the maximum number of iterations epoch is 5. At epoch 3, the learning rate is multiplied by 0.1. e results of our algorithm on the VOC2007 testing set are shown in Table 1 (with * indicating the reimplemented version of PyTorch). We use Faster R-CNN + FPN as the baseline. It can be seen from the result that when ResNet-50 and ResNet-101 are used as feature extraction networks, respectively, we use attention FPN (abbreviated as AFPN in the table) to replace FPN, which not only increases mAP by 2.8% and 1.9%, respectively, but also has a small speed loss. e accuracy of ResNet-50 exceeds that of most algorithms in the table, while the recognition accuracy of ResNet-100 is the highest among all algorithms. Table 2 shows the specific results of 20 categories. Our model obtains the best accuracy in multiple categories. Compared with FPN, attention-FPN has an obvious improvement effect on small targets (such as monitors, water cups, bottles, cows, sheep, and so on).
In order to further study the influence of the added module on the detection effect of the algorithm, we do an ablation experiment. As shown in Table 3, all experiments were carried out on FPN (ResNet-50). e mAP of the original FPN could achieve 79.4%. e RFB module was embedded in the benchmark network for training and testing. e mAP of 80.7% was obtained. e RFB is replaced by the proposed ARFB. e map can reach 80.9%. It proves that ARFB further expands the receptive field and improves detection abilities. e SE module is added to the network based on the embedded RFB. e mAP is improved by 0.6% compared to the embedded RFB only. e SE module can extract effective features and reduce the impact of background information. Next, the spatial adaptive fusion and balance mechanism continue to be added to the network. e mAP can reach 81.8%. e experimental results demonstrate that the spatial adaptive fusion and balance mechanism can significantly intensify the connection between different feature layers and enhance the detection ability. en, the deconvolution operation is used to replace the nearest neighbor for up-sampling. e improvement effect is not obvious. e possible reason for this phenomenon is that deconvolution will blur the feature of the input. It leads to the gain brought by deconvolution being limited. Lastly, the enhanced receptive field module ARFB is used to replace RFB. e mAP is further increased by 0.3% to 82.2%, which also proves the effectiveness of global context information.

Conv1×1
Conv1×1 Sigmoid × + Figure 5: Squeeze-and-Excitation Networks. Figure 7 shows the qualitative results of the comparative experiment. e left side of each picture is the detection result of FPN (ResNet50), and the right side is the detection result of attention FPN (ResNet50). It can be seen that the improved network has fewer redundant frames in terms of large targets because the expanded receptive field makes the ARFB module supplement effective global information for P5 and P6, so the detection of large targets is more accurate.
In terms of small targets, ours has a higher recall rate, thanks to the fact that the attention mechanism filters out irrelevant information, giving the feature fusion mechanism more detailed information and semantic information at the bottom.
In order to directly reflect the advantages of the proposed method, the visualization results of some feature layers are displayed in Figure 8 e e is the visualization result of the P5 layer after spatial adaptive fusion and balance mechanism. e f is the visualization result of our proposed methods. We can observe that the proposed methods can lead the networks to pay more attention to object features while ignoring the influence of background noise.

MS COCO Experiment
Results. MS COCO 2017 dataset is the authoritative dataset currently used to evaluate the performance of target detection algorithms. e data include a total of 80 target objects, of which the training set contains 118287 pictures, the verification set contains 5000 pictures, and the testing set contains 20288 pictures. In COCO, there are more small objects than large objects, of which about 41% are small objects (area <322), 34% are medium objects (322 < area <962), and 24% are large objects (area >962). e measured area is the number of pixels in the segmentation  Scientific Programming mask. e corresponding evaluation indicators are divided into AP S , AP M, and AP L . AP 0.5 and AP 0.75, respectively, represent the average detection accuracy of all categories when the intersection ratio IoU is 0.5 and 0.75. AP represents all 10 IoU thresholds (0.5 to 0.95) and the average of all 80 categories, and it is considered the most important indicator. We train the model on the COCO 2017 training, testing the experimental results on val2017 and test2017-dev, respectively. e stochastic gradient descent (SGD) method is used for training, the momentum of which is set to 0.9, the weight attenuation coefficient is set to 0.0005, the input image is resized to 1333 × 800, each GPU is assigned two images, and the batch size is set to 8. e maximum number of iterations epoch is 12, and the initial learning rate is set to 0.01, which is multiplied by 0.1 when the epoch is 8 and 11.
As shown in Table 4, on val2017, we choose three backbones: ResNet-50, ResNet-101, and the more powerful ResNeXt-101 [28](32 × 4d), which are compared with the baseline. As we all know the same backbone, Faster R-CNN combined with FPN only has advantages in small target recognition compared to the original Faster R-CNN. e detection ability of medium and large targets will be reduced a lot. However, the detection accuracy of attention FPN on objects of different sizes is all improved. As seen from Table 5, on test2017-dev, the test results of the proposed algorithm are also better than the baseline. e AP values of attention FPN of the three feature extraction networks increased by 1.5%, 0.7%, and 0.2% respectively, reaching 37.7%, 39.5%, and 40.6%. It is superior to algorithms such as Mask R-CNN and CoupleNet which also use ResNet-101.

Conclusion
is paper proposes a target detection algorithm based on the improved feature pyramid network, called attention FPN. In the attention FPN, the improved receptive field module is used to obtain the global and local context information; the channel attention module is added to the transverse connection to enhance the characteristics that contribute greatly to the key information. We also used deconvolution instead of nearest neighbor interpolation to reduce information loss in the up-down up-sampling method. Finally, the spatial attention style is used to weigh the integration of the various characteristic layers. e mAP of our improved algorithm on the PASCAL VOC 2007 test set reached 83.5% and the AP on COCO test 2017-dev reached 40.6%; the results show that our algorithm is better than the original algorithm and some mainstream algorithms with a considerable speed.
Nevertheless, there are certain limitations in our proposed algorithm. Compared to the original algorithm, the number of parameters of our proposed algorithm has increased by 10%. In addition, a small amount of background noise still exists, which affects the detection efficiency. In the future, we will reduce the number of parameters of the network and study a more effective mechanism to identify target features. Furthermore, we will combine the proposed algorithm with practical application.

Conflicts of Interest
e authors declare that they have no conflicts of interest.