A Novel Pyramid Network with Feature Fusion and Disentanglement for Object Detection

In order to alleviate the scale variation problem in object detection, many feature pyramid networks are developed. In this paper, we rethink the issues existing in current methods and design a more effective module for feature fusion, called multiflow feature fusion module (MF3M). We first construct gate modules and multiple information flows in MF3M to avoid information redundancy and enhance the completeness and accuracy of information transfer between feature maps. Furtherore, in order to reduce the discrepancy of classification and regression in object detection, a modified deformable convolution which is termed task adaptive convolution (TaConv) is proposed in this study. Different offsets and masks are predicted to achieve the disentanglement of features for classification and regression in TaConv. By integrating the above two designs, we build a novel feature pyramid network with feature fusion and disentanglement (FFAD) which can mitigate the scale misalignment and task misalignment simultaneously. Experimental results show that FFAD can boost the performance in most models.


Introduction
Object detection is one of the most important and challenging tasks in the field of computer vision. is task widely benefits image/video retrieval, intelligent surveillance, and autonomous driving. Although the performance of object detector grows rapidly with the development of deep convolutional neural networks, the existing detectors still suffer from the problems caused by the scale variation across object instances. To resolve this issue, the image pyramid method [1] takes pictures of different resolutions as input to improve the robustness of the model to small objects. However, this strategy greatly increases the amount of memory and computation. In SSD [2], the authors propose a method to detect objects of different sizes on feature maps at different levels. Compared with the solution that uses an image pyramid, this method has less memory and computational cost. Unfortunately, the performance of small object detection is still poor, since the features in low layers of the convolutional network always contain more geometric information and less semantic information. To alleviate this problem, FPN [3] creates a top-down architecture with lateral connections for building high-level semantic feature maps at all scales. Recently, the assistance of geometric information in shallow layers to large object detection is noticed. Several methods such as PANet [4] and BiFPN [5] add an extra bottom-up information flow path based on FPN to enhance the deep-layer features with accurate localization signals existing in low levels. Several methods like Libra RCNN [6] and M2det [7] first gather multilayer features into one layer and finally split it into a feature pyramid to integrate geometric and semantic information.
Despite the performance gained by the above pyramidal architecture, they still have some intrinsic limitations. Most feature pyramid networks are constructed by simply aggregating the features of different levels intuitively, which ignore the intrinsic properties between the features of different levels. SPEC [8] shows us that the similarity between adjacent feature maps is high, while those far apart are opposite. In this paper, we observe that there are two critical drawbacks existing in most previous fusion methods. First, information redundancy problem caused by directly summing or concatenating feature maps hinders the performance of detection. Second, it is difficult to accurately transfer information between feature maps, especially for feature maps that are far apart, which leads to the loss of some targets. Figure 1 demonstrates the heatmap visualization examples of multilevel features after various feature pyramid networks. We can observe the following: (1) Only a few features are captured by conventional FPN and it has no response to large-scale objects. (2) e second method has larger activation regions at deep layers, but it contains some inaccurate information. (3) Although the third method has better performance on both large and small objects, it still misses several targets and has some unnecessary noise. Further, ignoring the spatial misalignment between classification and localization functions, the output of most pyramidal networks is shared by downstream head of detector. Some researches [9][10][11] have revealed that the spatial sensitivities of classification and localization on the feature maps are different, which can limit the performance of detection. However, previous solutions to this problem can be deemed to disentangle the information by adding a new branch and essentially increase the parameters of the head. e conflict between the two tasks is still not eliminated, since the feature map extracted by backbone is still shared by the two branches, which motivates us to explore a feature pyramid architecture with spatial disentanglement.
In this paper, we aim to propose a novel feature pyramid network to break the above bottleneck restrictions. As shown in Figure 2, we firstly construct two subnetworks for top-down information flow and down-top information flow. en, following the attention mechanism applied in these works [12][13][14][15] and the feature selection method on highdimensional data [16], we set several gate modules to help the network focusing on important features as well as suppressing unnecessary ones. Moreover, we add an extra fusion path in each direction for enhancing the power of communication to prevent the loss of important information. Finally, we gather up the fusion outputs of two subnetworks. It is worth noting that there are five information flow paths in our module: one is horizontal, and the others are vertical. In order to alleviate the inherent conflict between classification and regression in feature pyramid, a modified deformable convolution is proposed for feature decoupling, called task-adaptive convolution (TaConv). By predicting two sets of offsets and mask, respectively, TaConv outputs two feature maps for classification and regression, respectively, at each level of feature pyramid. Our method brings significant performance improvement compared with the state-of-the-art one-stage object detectors. e contributions of this study are as follows: (1) We rethink the limitation existing in previous feature fusion strategies and design a more effective module to avoid these issues. (2) We further propose a method (TaConv) for the feature decoupling in one-stage detector to alleviate the discrepancy between classification and regression.
(3) We construct a novel feature pyramid network with feature fusion and decoupling and validate the effectiveness of our approach on the standard MS-COCO benchmark. e proposed network can boost the performance of most single-shot detectors (by about 1∼2.5AP).

Object Detection.
ere are mainly two streams of methods in object detection. e first stream is two-stage. Methods in this stream include RCNN family [17][18][19]. R-FCN [20] and Mask RCNN [21] consist of a separate region proposal network and a region-wise prediction network.
ey firstly predict region proposals and then classify and fine-tune each of them. Methods in the other stream are one-stage. is type of detector directly predicts objects category and coordinates at each pixel of feature map; thus, the efficiency of such methods is higher than that of two-stage ones. However, one-stage detectors in early time such as SSD [2] and YOLO family [22][23][24] lagged behind two-stage detectors as regards the performance. With the advent of focal loss [25], the category imbalance problem in the single-stage detector is greatly alleviated. Since then, following works [26][27][28] further improve its performance by designing more elaborate heads. At present, the single-stage detectors can achieve performance that is very close to that of the two-stage ones.

Feature Fusion.
Due to the convolutional networks' deepening and downsampling operations, the features of small objects are always lost. To tackle this problem, two strategies were proposed in the literature. e first one is image pyramid method such as SNIP [1] and SNIPER [29]. ese methods take pictures of different resolutions as input and perform detection separately and combine these prediction results to give the final results. e other strategy is feature pyramid. ese methods like SSD [2] and MS-CNN [30] conduct small object detection directly on the shallow feature maps and perform large object detection on the deep feature maps. Compared with the first strategy, the additional memory and computational cost required by the second strategy are greatly reduced, so it can be deployed during the training and testing phase of the real-time network. Moreover, low-level features generally lack semantic information but are rich in keeping geometric details while high-level features are opposite. erefore, an effective feature fusion strategy plays a crucial role in processing features of objects with various scales. FPN [3], the milestone of pyramidal network, propagates high-level semantic information to shallow level by building a top-down architecture. Since then, feature pyramid has been widely used in the object detection task. Recently, considering the lack of geometric information of deep layer features, several bidirectional models such as PANet [4] and BiFPN [5] add a down-top path for low-level feature maps aggregation based on the FPN. Libra-RCNN [6] firstly fuses features of all layers and then disentangles them into the pyramid. M2Det [7] stacks several U-shaped modules to fuse multilayer features followed by generating the feature pyramid. Moreover, different from the above method, there are some other approaches that fuse features by concatenating features from different layers in the forward propagation of the backbone. For instance, Hourglass Network [31]  concatenates features with the previous layers in the repeated bottom-up and top-down processes. HRNet [32] gradually adds a low-resolution subnetwork to the high-resolution major network in parallel.

Feature Disentanglement.
Most object detectors share the features extracted by the backbone for both classification and bounding box regression; thus, there is a lack of understanding between the two tasks. ere has been some work on the conflict between the classification and regression tasks. Zhang and Wang [33] point out that the direction of the two task gradients is inconsistent, implying the potential conflicts between the two tasks gradients. IoU-Net [9] alleviates this discrepancy by adding an extra head to predict the localization confidence and then aggregates it with the classification confidence together to be the final score. Double-Head RCNN [10] disentangles the sibling head into two specific branches for classification and localization. TSD [11] shows that classification task pays more attention to the features in the salient areas of objects, while the features around the boundary are beneficial for bounding box regression. e authors ease this issue by generating two disentangled proposals for classification and localization, respectively. Despite the fact that the satisfactory performance can be obtained by this detection head disentanglement, the conflict between the two tasks still remains, since the inputs to the two heads are still shared. In this paper, we propose a novel feature pyramid network with feature fusion and disentanglement called FFAD, which can alleviate the scale misalignment and task misalignment simultaneously. To the best of our knowledge, there is currently no work to explore spatial decoupling of feature pyramids.

Proposed Method
FFAD contains two submodules, that is, MF 3 M and TaConv. Compared with most of the current methods, MF 3 M aggregates features more effectively. en the output feature maps of MF 3 M are disentangled by TaConv for alleviating inherent conflict between the classification and regression task. e prediction of classical pyramidal networks can be written as where P c and P r denote the classification results and regression results, respectively; H c and H r are the heads for transforming feature to specific category and localization of object; F i denotes the feature map of i-th level in feature pyramid, and L denotes the numbers of layers of feature pyramid. Unlike conventional pyramidal networks, FFAD produces two feature maps for two tasks, respectively, at each level of the feature pyramid: where F c i and F r i denote the feature map for classification and regression of the i-th layer in FFAD, respectively.

Multiflow Feature Fusion
Module. We conclude that there are about three styles of feature pyramid networks: (1) conventional FPNs that are single directional pyramid network (as shown in Figure 3(a)), (2) bidirectional pyramid networks (as shown in Figure 3(b)), and (3) encoder-decoder FPNs (as shown in Figure 3(c)). As shown in Figure 3(d), the parts in the red-and yellow-dotted boxes represent two subnetworks in different directions that share inputs. ere are three feature nodes at each level of each subnetwork. Further, we propose information augmentation for enhancing the signal transmitted between feature nodes, especially those that are far apart. As seen from Figure 3(e), in the top-down subnetwork, both the second and third nodes of each layer have a fusion with the shallower features except for the shallowest. Meanwhile, in the down-top subnetwork, the second and third nodes of each layer are fused with the deeper features except for the deepest layer. At the same time, in order to simplify the network, we remove the shallowest second node in the top-down network and the deepest second node in the down-top network, so that there is only one input edge. It is worth noting that there are two information flow paths in each subnetwork. Finally, we gather up the outputs of two subnetworks to form the fifth information flow. Let x i be the i-th input of MF 3 M and let y i be the i-th output of MF 3 M. en the output of the MF 3 M is where conv(·) denotes the convolution operation, C(·) denotes the concatenation operation, and F t−d (·) and F d−t (·) are the outputs of top-down and down-top subnetworks, respectively: where M(·) is the max-pooling layer and U(·) is the bilinear upsampling layer.
EfficientDet [5] already shows that the feature map of different scales should have a different contribution to the output and proposes adding a weight for each input feature, while most previous methods treat all input features equally without distinction. Inspired by the spatial attention mechanism and the intrinsic connections between feature maps, we design a simple gate module for controlling the intensity of information flow. us, the outputs of top-down and down-top subnetworks are as follows: 4 Computational Intelligence and Neuroscience where g(·) can be written as and x represents the input; ⊗ denotes pixel-wise multiplication.
Deformable convolution is often embedded in the backbone as well as the last layer of detector towers to further improve the performance of object detectors. In order to further improve the feature pyramid network, we use DCN [34] to adjust the results after fusing with other layer features in the pyramid network. To avoid the extra computing cost caused by deformable convolution as far as possible, we only embed it in the nodes of each layer after the first fusion with other layers. In this way, the outputs of top-down and downtop subnetworks, F t−d (·) and F d−t (·), can be formulated as follows: where dc onv(·) denotes deformable convolution operation.

Task-Adaptive Convolution.
To mitigate the misalignment between classification and localization existing in classical feature pyramids, we propose task-adaptive convolution. It is indeed a modified modulated deformable convolution. We borrow the idea of DCN [34] to distinguish between features suitable for classification and suitable for regression, due to its superior ability to capture the key information of objects. As shown in Figure 4, for the features of each level in feature pyramids, TaConv first predicts two groups of offsets and modulations. en the two groups of offsets are added to the coordinates of each sampling point of the convolution kernel, respectively. e two modulations are multiplied by the value of each sampling point of the  convolution kernel. Finally, TaConv generates two independent feature maps: one is sensitive for classification task, and the other is sensitive to localization task. Let x represent the pixel value of feature map and the outputs of TaConv can be formulated as follows: where K denotes the size of convolution kernel; w k denotes the k-th point of kernel. p x and p y denote the horizontal and vertical coordinates of sampling point. Δp c x and Δp c y represent the deviation of the classification task on the X-axis and Y-axis, respectively. Δp r x and Δp r y denote the deviation of the regression task on the X-axis and Y-axis, respectively. m c and m r are the modulation multiplied by the convolution kernel parameters.

Experimental Evaluation
We perform our experiments on the challenging MS-COCO [35] benchmark of 80-category. Following the standard protocol [36], we train on the training set (consisting of around 118k images) and then report the results of minival set (consisting of 5k images) for ablation studies. To compare the accuracy of our algorithm with those of the state-of-the-art single-shot detectors, we also report results of test-dev set (consisting of around 2k images) which has no public labels and requires the use of the evaluation server.

Implementation Details.
In our study, we embed our method into several latest and state-of-the-art single-stage detectors including RetinaNet [25], FCOS [26], and ATSS [28]. For fair comparison with the above detectors, the configuration of hyperparameters used in our experiments is set as same as the literature's. Specifically, we use the ImageNet [37] pretrained models such as ResNet-50 [38] followed by FPN structure as the backbone. We use the Stochastic Gradient Descent (SGD) algorithm to optimize the training loss for 180k iterations with 0.9 momentum, 0.0001 weight decay, and a mini-batch of 8 images. e initial learning rate is set to 0.05 and we reduce the learning rate by a factor of 10 at iterations of 120k and 160k, respectively. Unless otherwise stated, the input images are resized to have their shorter side being 800 and their longer side less or equal to 1333. We do not use any noise reduction method, and no data augmentations except standard horizontal flipping are used. During the inference stage, we resize the input image in the same way as in the training stage and postprocess the predicted bounding  boxes with a predicted class obtained by forwarding images through the network, using the same hyperparameters of the above detectors.

Ablation Study.
To demonstrate that our proposed MF 3 M can capture the objects' features of different sizes more effectively, we compare MF 3 M with other common feature fusion modules on FCOS. e results are shown in Table 1. Compared with the baseline that actually uses single directional FPN (37.1 AP), encoder-decoder FPN obtains a higher score (37.3 AP), especially with an increase of 0.4 AP for medium targets. Meanwhile, bidirectional FPN gives the best performance among these three common FPN styles (37.6 AP), and its large target detection is improved by 0.6 AP. Cooperating with 3 information flows' structure, detailed in Figure 3(d), the detector based on FCOS is promoted to 37.8 AP.
is result verifies that splitting the series bidirectional structure into two unidirectional subnetworks can get better performance. By adding an additional information flow in each subnetwork, the performance of the detector is further improved by 0.5 AP. After fine-tuning the feature by DConv, our MF 3 M achieves 39.2 AP, outperforming most current feature fusion methods by a large margin. Specifically, the accuracy of detecting small objects (increased by 2.0 AP compared to the baseline) and large objects (increased by 3.4 AP compared to the baseline) is particularly improved. It is shown that our method can effectively fuse the features of cross-scale objects. In order to more intuitively observe the feature fusion ability of this method, we visualize the activation values of the features of FPN, bidirectional FPN, and MF 3 M. As shown in Figure 5, the first method loses some features of small objects and cannot detect large objects at all. Although the second and third methods can capture the feature of large objects and make progress in the detection of small objects, several objects are still missed. At the same time, our approach almost never misses features of both large and small targets.
As explained above, the core part of FFAD is composed of MF 3 M and task-adaptive convolution. e MF 3 M is responsible for computing feature maps, which contain rich features and the task-adaptive convolution decouples the features to make them task-sensitive. Table 2 reports the detailed ablations on them to demonstrate their effectiveness. From the experimental results, we can know that this method can alleviate the conflict between the classification task and regression task to a certain extent. To better interpret what task-adaptive convolution learns, we visualize the learned feature on examples. As shown in Figure 6, the features of classification branch are more distributed in the central area of the objects, while the features of regression branch are more sensitive to the edge area of the objects.

Analysis of the Performance in Different DCN's Positions.
We have exhibited the effectiveness of MF 3 M for feature fusion and the deformable convolution plays a significant role in the adjustment of features. In this section, we further discuss the performance of MF 3 M with different deformable convolution's positions. Figure 7 shows the structures where the deformable convolution is placed after the first to the third nodes of each layer in each subnet, respectively. Table 3 illustrates that the scheme of P2, which uses DCN to fine-tune the nodes after the first feature fusion, has the best effect. We believe that better results can be achieved by fine-tuning all nodes after feature fusion with DCN. However, excessive use of DCN will bring greater computational effort, so we choose the most costeffective scheme.

Compatibility with Other Single-Stage Detectors.
Since FFAD has demonstrated its outstanding performance on FCOS with ResNet-50, we also present that it can still be effective when it is applied to other single-stage detectors. We directly conduct several experiments with different detectors including RetinaNet, FCOS, and ATSS on MS-COCO minival. All evaluation was performed on one Nvidia 1080Ti GPU. We set batch size to 8 and used the means of last 300 iterations in computation of speed. e results between the proposed FFAD and their original baselines are compared in Table 4. According to the first two columns of the table, it is obvious that FFAD can steadily improve the performance by 1.8∼2.6 AP, while the testing time is only increased by 3%∼ 11%.

Comparison with Other Feature Pyramids.
With regard to various feature pyramidal models, we compare our FFAD with other state-of-the-art feature pyramid structures on FCOS. Table 5 reports our experimental results. It is obvious that FFAD provides a dramatic performance increase compared to other advanced feature pyramid models, including PANet [4], HRNet [32], Libra [6], and NAS-FPN [39]. Moreover, FFAD also earns the close-to-the-minimum FLOPs increment among the feature pyramidal models.

Comparison with Other State-of-the-Art Detectors.
In this section, we evaluate our proposed method on MS-COCO test-dev set and compare it with other state-of-the-art methods. For convenience, we only report FCOS equipped with our proposed FFAD. As shown in Table 6, it is observed that FFAD boosts the original baselines by a significant margin and achieves the state-of-the-art 49.5 AP using ResNext-101 backbone.         [21], RetinaNet [25], FCOS [26], EfficientDet [5], and YOLOv5 [56] on this dataset. We separate out a fifth of the training set as a validation set and then evaluate the results on that. We set the input size to  1024 × 1024 and the batch size to 4 to train these models for 10 epochs. As shown in Table 7, even in the face of such dense and overlapping scenes, FFAD can still give satisfactory improvements.

Conclusion and Future Work
In this paper, we point out that there are several bottlenecks existing in current feature pyramid networks, which considerably limit the performance of detectors. Motivated by that, we look into these issues and propose a novel feature pyramid network with feature fusion and disentanglement (FFAD) to alleviate these problems. In particular, FFAD first splits the conventional bidirectional feature pyramid into two independent subnetworks and adds an additional flow of information to each of them to strengthen the communication between feature maps and finally fuses the output of the two subnetworks. Furthermore, we propose the taskadaptive convolution to mitigate the inherent task conflict in feature pyramid. By predicting two groups of different offsets and modulations in task-adaptive convolution, FFAD generates the specific feature representation for classification and localization, respectively. Being compatible with most singlestage object detectors, our FFAD can easily enhance the detection performance by about 1∼2.6 AP. Our future work will aim to simplify feature fusion module without losing mAP and further enlarge the performance margin between the disentangled and the shared features in pyramidal model.

Conflicts of Interest
e authors declare that there are no conflicts of interest related to the publication of this work. Results are evaluated on GWHD.
Computational Intelligence and Neuroscience 11