A Small Object Detection Network Based on Multiple Feature Enhancement and Feature Fusion

. Due to the small size, high resolution, and complex background, small object detection has become a difcult point in computer vision. Making full use of high-resolution features and reducing information loss in the process of information propagation is of great signifcance to improve small object detection. In this article, to achieve the above two points, this work proposes a small object detection network based on multiple feature enhancement and feature fusion based on RetinaNet (MFEFNet). First, this work designs a densely connected dilated convolutions to adequately extract high-resolution features from C 2 . Ten, this work utilizes subpixel convolution to avoid the loss of channel information caused by channel dimension reduction in the lateral connection. Finally, this article introduces a bidirectional fusion feature pyramid structure to shorten the propagation path of high-resolution features and reduce the loss of high-resolution features. Experiments show that our proposed MFEFNet achieves stable performance gains in object detection task. Specifcally, the improved method improves RetinaNet from 34.4AP to 36.2AP on the challenging MS COCO dataset, and especially achieves excellent results in small object detection with an improvement of 2.9%.


Introduction
As a fundamental problem in the feld of computer vision, object detection is the basis for many tasks such as image segmentation, object tracking, and image description. With the development of the convolutional neural network [1], many one-stage detectors [2][3][4][5] and two-stage detectors [6][7][8][9] with remarkable performance have been developed in recent years. Two-stage object detection algorithms are developing rapidly, such as R-CNN [2], Faster R-CNN [4], and Mask R-CNN [5]; the detection accuracy is constantly improving, but the problem of their own architecture limits the detection speed. Te one-stage target detection algorithm was proposed later than the two-stage target detection algorithm, due to its relatively simple structure and superior detection speed, it has also attracted the attention of many researchers. Te representative algorithms include YOLO and its variants [6][7][8][9], SSD and its variants [10][11][12], Reti-naNet [13], and EfcientDet [14].
Although the one-stage object detector is signifcantly faster than the two-stage object detector, its accuracy has not been comparable to the two-stage object detector. Some one-stage object detection algorithms have improved the detection efect by introducing two-stage object detection algorithms such as Feature Pyramid Network (FPN) [15] and changing the backbone network. FSSD [12] reconstructs the pyramid feature map to fuse features of diferent scales, which is benefcial to small object detection. EfcientDet [14] uses weighted bidirectional feature pyramid network for feature fusion and scales the model through composite feature pyramid network. Lin et al. believed that the real reason for the low accuracy of the integrated convolutional neural network was the mismatch between the target and background levels in the image, and then proposed RetinaNet [13]. RetinaNet solves the sample imbalance problem by introducing focal loss, which greatly improves the object detection efect. However, the detection efect of small objects (objects below 32 pixels × 32 pixels [16]) is not competitive with the two-stage target detection algorithm.
We analyzed the RetinaNet [13] and found that the following three points are not conducive to small object detection. Firstly, it does not fully utilize the shallow feature layer C 2 . Due to the low resolution and little visual information of small targets, the information of small targets may be lost in the process of network up-sampling, making it difcult for deep feature layers to extract discriminative features. However, the shallow feature layer C 2 has a smaller receptive feld, higher spatial resolution, and contains more accurate position information, which are very benefcial for small object detection. Meanwhile, in the FPN-based methods, the network generally uses a simple convolution method to extract the shallow feature C 2 , such as [5,17]. Due to the limitation of the size of the receptive feld and the depth of the network, it is difcult to extract shallow features more fully.
Secondly, to reduce the computation, FPN-based methods adopt 1 × 1 convolutional layers to reduce channel dimensions of the output feature maps C i from the backbone. C i generally extract thousands of channels in high-level feature maps. Especially, the high-level features C 4 and C 5 have large channel dimensions, which contain rich semantic information that is benefcial to object detection. Te drastic channel dimension reduction (e.g., 2048 to 256) results in the loss of a large amount of channel information, which has a negative impact on small target detection. Te existing methods [14,18] to extract the channel information mainly extract channel-reduced maps by adding additional modules, and act on fewer channel features through more complex network connections to achieve better accuracy. Although [19] makes full use of C i , it does not fully mine contextual information of the transformed features.
Finally, RetinaNet [13] introduces the top-down feature pyramid structure and performs multiscale feature fusion to improve the detection efect of targets of diferent scales. It is worth noting that the low-level features are critical for the detection of small objects, which are helpful for more accurate localization. However, due to the limitation of the FPN structure, the path between high-level features and lowlevel features is long (tens or even hundreds of network layers such as ResNet50 and ResNet101), resulting in less low-level features at the top of the pyramid, which makes the small object detection efect not as good as expected.
Combined with the above analysis and inspired by BFE-Net [20], we believe that improving the utilization of high-resolution features in the network and reduce the loss of features in the process of network propagation is of great signifcance to improve the efect of small object detection. For one thing, to improve the utilization of high-resolution features, we reuse the shallow features C 2 . Inspired by Densenet [21], we designed the multiscale context extraction module to fully extract shallow features. To pursue the balance between accuracy and computational load, this work uses the dense connection mechanism combined with dilated convolution to efectively expand the receptive feld and increase the depth of the feature extraction network to some extent, which can extract richer semantic features and location features while efectively realizing feature reuse.
For another thing, to reduce the information loss, this work utilizes subpixel convolution and bidirectional feature pyramid structure. First, inspired by [19,22], this work designs a subpixel convolution enhancement module to reduce the information loss caused by channel reduction. Specifcally, this work uses subpixel convolution to convert low-resolution feature maps into high-resolution feature maps in the horizontal connection of top-down propagation, making full use of channel information and reducing the loss of information during lateral connection. At the same time, the spatial attention mechanism is used for the transformed features to obtain richer contextual information. Second, to reduce the loss of shallow information in the propagation path and inspired by PANet [17], this work introduces a bidirectional fusion feature pyramid structure. We designed a bidirectionally connected feature pyramid structure, which can greatly shorten the propagation path of shallow features to reduce feature loss and better retain shallow feature information. At the same time, the bidirectional feature pyramid network further strengthens the multiscale feature fusion, which greatly enriches the shallow multiscale context information.
Based on the above analysis and strategies, the detection method proposed in this article compared to standard RetinaNet has the following advantages: (1) To improve the utilization of shallow features, this article designs a multiscale context extraction module (MCEM) consisting of densely connected dilated convolutions, which use convolutional layers with diferent dilation rates to obtain more efective receptive felds. (2) To make full use of channel information in the lateral connection and reduce the channel information loss, this article designs a subpixel convolution enhancement module (SCEM), which uses subpixel convolution to convert low-resolution features into high-resolution features to avoid information loss caused by channel dimension reduction in the lateral connection. (3) To reduce the low-level features loss in propagation process, this article designs a bidirectional fusion feature pyramid structure (BidiFPN), which uses bidirectional feature pyramid structure to shorten the propagation path of shallow features, reducing the shallow feature loss in the propagation process.

Object Detectors.
At present, there are two types of mainstream deep learning target detection algorithms, twostage target detection based on region proposal and onestage target detection based on regression analysis.

Two-stage Detectors.
Te two-stage target detection algorithm generally uses selective search or region proposal network to extract candidate frames from the image, and then performs secondary correction on the candidate frame target to obtain the detection result. R-CNN [2] introduces convolutional neural network combined with candidate region proposal to achieve target detection. SPP-Net takes the entire image as input, and realizes feature extraction of any scale area, reducing the amount of computation. Faster R-CNN [4] proposes a region proposal network to extract candidate regions, which improves detection efciency. Mask R-CNN [5] uses the RoI Align layer to reduce the deviation of the feature map from the original map. Cascade R-CNN [23] introduced multilevel refnement in Faster R-CNN to achieve more accurate target location prediction. Te two-stage target detection algorithm is developing rapidly, and the detection accuracy is constantly improving, but the problem of its own architecture limits the detection speed. It cannot meet some downstream tasks with strong real-time performance.

One-stage Detectors.
Compared with two-stage detectors, one-stage object detection algorithms do not require classifcation on candidate regions, and the training process is relatively simple. YOLOv1 [6] is the frst one-stage detector in the feld of deep learning proposed by Redmon et al., whose biggest advantage is the fast speed. Some scholars have improved on the basis of YOLOv1 [16] and proposed YOLO9000 [7], YOLOv3 [8], and YOLOv4 [9]. SSD [10] proposed in 2015 combines the advantages of YOLO's fast detection speed and Faster R-CNN's accurate positioning. DSSD [11] backbone adopts Resnet-101 and adds deconvolution module to improve the efect of small object detection. FSSD [12] reconstructs the pyramid feature map to fuse features of diferent scales to enhance the detection efect of small objects. Although the one-stage object detector is signifcantly faster than the two-stage object detector based on candidate region recommendation, its accuracy has not been comparable to the two-stage object detector. RetinaNet [13] solves the problem of instance sample imbalance by introducing focal loss and realizes a detection framework whose accuracy is comparable to that of two-stage target detectors. However, RetinaNet [13] detection efect on small objects still has room for improvement compared to two-stage target detection algorithms. In addition, EfcientDet [14] uses a weighted bidirectional feature pyramid network for feature fusion. YOLOF [24] designs a dilated encoder and a balanced matching strategy to improve the detection performance.

Feature Augmentation.
As the number of network layers increases, the semantic information and location information of the target are lost layer by layer. Multiscale feature fusion and contextual feature enhancement are effective methods to compensate for information loss.

Multiscale Feature Fusion.
To make full use of the features extracted by diferent feature layers, many researchers optimize the detector architecture to achieve multiscale feature fusion. Most detectors utilize the FPN [15] to detect objects of diferent sizes, which extracts the features from the bottom to the top, and then performs a top-down feature fusion structure, and fnally sends them to the prediction module to output the results. PANet [17] connects the features of the lowest layer of the model with the features of the highest layer, shortens the information path between the top layer and the bottom layer, and further strengthens the connection between the feature maps of each layer. EfcientDet [14] proposes a weighted bidirectional feature pyramid network BiFPN to achieve more efcient multiscale feature fusion. AugFPN [25] utilizes consistency supervision to close the semantic gap before feature fusion and employs residual features to reduce information loss during convolution pooling to better utilize multiscale features. NAS-FPN [26]

Context Feature Enhancement.
Te detected target has an inseparable relationship with other surrounding objects and the environment. In order to improve detection accuracy by exploring contextual information, CoupleNet [27] improves the detection accuracy by introducing the global and semantic information of the proposal and combining local information and global information. Te DetectoRS [28] proposes Recursive Feature Pyramid (RFP) and incorporates additional feedback connections from the feature pyramid network to the bottom-up backbone layers. Lim et al., [29] improved the detection accuracy of small objects by fusing multiscale features and using additional features at diferent levels as contextual information. Nonlocal [30] proposed a strategy to obtain the dependencies between two locations, solving the problem of limited receptive feld obtained by convolution operation at each layer.

Methods
Tis section introduces the small object detection network based on multiple feature enhancement to reduce the loss of high-resolution information and make up for the loss of information during the propagation process and lateral connection. As shown in Figure 1, three components are proposed in MFEFNet: multiscale context extraction module (MCEM), subpixel convolution enhancement module (SCEM), and bidirectional fusion feature pyramid structure (BidiFPN). We have described them in detail as follows.

Multiscale Context Extraction
Module. Small objects have fewer pixels available than normal-sized objects, and features are difcult to extract. With the deepening of the number of network layers, through continuous downsampling and feature extraction, the feature information and location information of small objects are also lost layer by layer. Te shallow target of convolutional neural network contains much small object information due to its small Scientifc Programming receptive feld, high resolution, and rich location information. Terefore, making full use of the shallow feature layer can improve the small object detection efect to a certain extent. RetinaNet [13] does not use the highresolution pyramid level P 2 . We designed the multiscale context extraction module (as shown in Figure 2) to fully extract the features of the high-resolution feature layer C 2 through densely connected dilated convolutions. Although the shallow feature layer of the convolutional neural network contains rich small object information, its ability to express feature semantic information is weak. Inspired by [21], we perform feature extraction through the dilated convolutional layer with diferent dilation rates, which enriches semantic information while ensuring rich spatial information, and enhances the high-level semantic information of shallow features.
First, we divide the feature map C 2 into three branches for dilated convolution. Since each dilated convolutional layer has a diferent dilation rate, three feature maps with diferent receptive feld sizes will be obtained.
where F d (·) represents the dilated convolution operation function with a convolution kernel of 3 × 3 and expansion rates of 3, 5, and 9, respectively. Te symbol ⊕ denotes feature fusion by addition. Ten, the three output feature maps containing multiscale context information and C 2 after 1 × 1 convolution are fused in the concatenate method and then D 2 is obtained through 1 × 1 convolution layer for channel dimension reduction.
where F concat (·) represents the operation of feature connection in the way of concatenate.

Subpixel Convolution Enhancement Module.
As the number of convolutional layers increases, the network can obtain more efective features. In the RetinaNet [13], with the deepening of the backbone network, feature layers with rich dimensions will be generated in the bottom-up propagation path, especially the high-level features C 4 and C 5 , and the feature dimensions are 1024 and 2048, respectively. Tese high-level features are rich in semantic information. However, in order to reduce the complexity of the network and improve the calculation speed, a 1 × 1 convolutional layer will be used for dimension reduction in the lateral connection. For example, the dimension of C 5 is reduced from 2048 to 256. Te dramatic reduction in dimension will lead to a lot of semantics loss of information.
Te loss of semantic information in the top-down propagation process will further afect the detection results, especially the loss of small object features becomes more and more serious. To reduce the loss of semantic information in the lateral connection and make full use of the rich channel information of high-level feature maps, we are inspired by [19] to use subpixel convolution to achieve channel dimension reduction and fully fuse the information  of adjacent feature layers and designed the subpixel convolution enhancement module (as shown in Figure 3(a)). Subpixel convolution [22] implements the reconstruction process from up-sampling reconstruction from lowresolution images to high-resolution images. Tis operation is to rearrange the pixels on diferent channels of the feature map into the same channel space, so as to achieve the purpose of more pixels in the same channel space, mainly by transforming the channel size to increase the width and height. Considering that C 4 and C 5 have 1024 and 2048 channels, respectively, subpixel convolution is performed directly without expanding the channel size. Te pixel shufe operator rearranges the feature of shape H × W × C · r 2 to rH × rW × C, which can be formulated as follows: where r denotes the up-scaling factor, in this work, r � 2. F is the input feature, and F is C i+1 in this article as shown in Figure 3(a), and PS(F) x,y,c denotes the output feature pixel on coordinates x, y, and c. Te index x, y, and c start from 0, which represents the coordinates in the high-resolution feature map. M i is the output obtained by element-wise addition of the low-resolution feature map C i+1 and the high-resolution feature map C i after subpixel convolution.
where the symbol ⊕ denotes feature fusion by addition. Conv 1 × 1 (·) represents a 1 × 1 convolution layer for channel dimension reduction. GE(·) represents the processing process of GE block. Te standard RetinaNet [13] introduces the feature pyramid network to detect objects of diferent scales through multiscale representation, enriching the semantic information of shallow features to make it more efective for small objects detection. However, the convolutional neural network can only obtain the local receptive feld. Although the receptive feld can be expanded through deeper network layers, the global information cannot be obtained. Context information means that in an image, a single pixel or a single target does not exist alone, but has some relationship with the surrounding pixels and targets. Mining and utilizing the contextual information between objects will be benefcial to object detection, especially for small objects that rely heavily on context. Inspired by [30,31], we design GE Block to model the global context through the self-attention mechanism to efectively capture long-distance feature dependencies. Trough the information interaction of the global context, the feature map contains richer semantic information, thereby enhancing the feature response of small objects.

GE Block.
To enhance the information fusion between high-resolution feature layer and low-resolution feature layer, we designed a global feature enhancement block (as shown in Figure 3(b)) in SCEM, which utilizes a selfattention mechanism to enhance the representation of features by learning the global dependencies of features. Encode broader contextual information into local features, thereby enhancing its representational power. Te processing steps of GE(·) are as follows.
M i is redefned as X, and X is used as the input of this model, and Q, K, V { } are obtained through three convolutional layers, respectively. Ten, perform matrix transpose operation on Q to get Q T . We performed matrix multiplication of the reshaped K and Q T to obtain the spatial attention map W. Next, we performed the matrix multiplication operation on the reshaped V and W to weight the spatial information and perform an element-wise addition with M to obtain the fnal output D as the output of SCEM. In particular, we formulated this procedure as follows.
where q i is the i th query; k j and v j are the j th key/value pair. f q (·), f k (·), and f y (·) denote the query, key, and value transformer functions [31,32], respectively. Tese functions specifcally refer to matrix operations using the mapping matrix of q, k, and v and the input features. X i and X j are the i th and j th feature positions in X, respectively. F sim (·) is the  Scientifc Programming similarity function dot product; F nom (·) is the normalizing function softmax; F mul (·) is the weight aggregation function matrix multiplication; and D i is the i th feature position in the output feature map X. X as the output of GE(·) is redefned as D i , and the subscript i is corresponding to the input feature M i of GE(·).

Bidirectional Fusion Feature Pyramid Structure.
Multiscale feature fusion integrates low-level features and high-level features through top-down lateral connection and constructs a feature representation with fne-grained features and rich semantic information. Te fused features have stronger expressive ability, which is conducive to the detection of small objects. Te standard RetinaNet [13] uses a top-down fusion feature pyramid structure, which uses feature pyramid levels P 3 to P 7 , where P 3 to P 5 are computed from the output of the corresponding ResNet residual stage (C 3 through C 5 ) using top-down and lateral connections just as in [15].
Although the feature pyramid structure adopted by the RetinaNet [13] (see Figure 4(a)) can fully integrate multiscale features, the low-level features need to go through hundreds of convolution layers of backbone, resulting in the loss of a large amount of underlying information that is conducive to small object detection during the propagation process. Inspired by PANet [17] (see Figure 4(b)), we designed a bidirectional fusion feature pyramid structure (see Figure 4(c)). Te structure adds a bottom-up path enhancement module built with a smaller number of convolutional layers, which ensures that the information of high-level features and low-level features is more fully integrated, while retaining as much low-level information as possible. As in [17], all pyramid levels have C � 256 channels.
In the bottom-up backbone network we keep the C 3 through C 6 layers in the standard RetinaNet [13], while making full use of the C 2 which contains rich low-level features.

Top-Down Path.
Te top-down path includes the features of N 2 through N 4 . N 4 is the output feature after SCEM with C 4 and C 5 as input features.
N 3 is composed of the up-samplingN 4 and the output feature D 3 after the SCEM (Section 3.2) with C 4 and C 5 as the input features. Te two parts are fused by the addition method (see Figure 5(b)), which is quite diferent from [17] (see Figure 5(a)).
where ⊕ is the feature fusion operation, and F up (·) is the upsampling operation to match the resolution of the feature image to be fused in the lower layer. N 2 is obtained by fusing two parts of features, which are N 3 after up-sampling operation and the output feature D 2 of the MCEM.

Bottom-Up Enhancement Path.
Te bottom-up enhancement path includes the features of P 2 through P 6 . P 2 through P 4 are generated in the same way just as in PANet [17].
where ⊕ is the feature fusion operation, Conv 1×1 (·) represents a 1 × 1 conv, and F do wn (·) is the down-sampling operation to match the resolution of the feature image to be fused in the upper layer. P 5 is obtained by fusing 1 × 1 conv on C 5 and down-sampling on P 4 . P 6 is obtained by fusing 1 × 1 conv on C 6 and down-sampling on P 5 .

Dataset and Evaluation Metrics.
We perform all experiments on the MS COCO detection dataset with 80 categories, in which objects with scale smaller than 32 × 32 pixels are considered small objects. MS COCO has a large number of small object objects, and the proportion of small objects accounts for 41.43% [16]. We train models on train2017 and report results of ablation study on val2017.
Te fnal results are reported on test-dev. Te COCO-style average precision (AP) is chosen as the evaluation metric. AP 50 and AP 75 represent the average precision when IoU is set to 0.5 and 0.75, respectively, and AP S , AP M , and AP L represent the average precision of small objects, mediumsized targets, and large-sized targets, respectively.

Implementation Details.
To demonstrate the efectiveness of the MFEFNet proposed in this article, we conducted a series of experiments on the MS COCO dataset for verifcation. For all experiments in this section, we used SGD optimizer to train our models on a machine, whose CPU is Intel i7-9700k, 32 RAM, × NVIDIA GeForce GTX TITAN X GPUs, the CUDA version is 10.1 and deep learning framework is Pytorch 1.7.1. We initialize the learning rate as 0.01 and decrease it to 0.001 and 0.0001 at 8th-epoch and 11th-epoch. Te momentum is set as 0.9 and the weight decay is 0.0001. Te classical net-works ResNet-50 and ResNet-101 are adopted as backbones for comparative experiments. Original settings of RetinaNet such as hyperparameters for anchors and Focal Loss are followed for fairly comparison. For all studies we use an image scale of 500 pixels unless noted for training and testing.

Main Results.
In this section, we evaluated the MFEFNet on the COCO test-dev and compare it with other state-ofthe-art one-stage detectors and two-stage detectors. Implementation details and evaluation metrics are set as above. All the results are shown in Table 1. By analyzing the experimental results in the table, it can be found that when Resnet101 is used as the backbone network, the standard RetinaNet [13] performs better in detecting large and medium targets, and achieves competitive results compared with the two-stage detectors, respectively, reaching 38.5% and 49.1%. However, when detecting small objects, it is only 14.7%, which is 0.9% and 3.5% lower than the two-stage detectors Faster R-CNN +++ [4] and Faster R-CNN w FPN [15], respectively. Faster R-CNN +++ refers to R-FCN + Resnet-101. In addition, compared with the one-stage detector YOLOv3 [8], it is 3.6% lower, and there is still much room for improvement. It is worth noting that the MFEFNet proposed in this article achieved excellent results in both large and small objects, and the AP S reached 17.6%, which was improved by 2.9% and 1.0%, respectively, compared with standard RetinaNet [13] and Faster R-CNN+++ [4]. Combining the above analysis and experimental data, it can be found that the model proposed in this article has greatly improved the detection efect of targets of various sizes, especially for small objects. Figure 6 shows the visual comparison of features through convolution layer. Specifcally, in this work, we use Grad CAM to calculate and visually display the output of the last convolution layer of the model in combination with the network structure and the weight after training. Column (a) is the original image, and column (b) is the feature visualization result of RetinaNet [13]. It can be found that the heat map does not cover small objects well, which shows that RetinaNet [13] is not sensitive to small objects. Te improved network in this article improves the utilization of features and reduces the loss of features. As shown in column (c), it can be found that the feature heatmap of MFEFNet can better cover the boundary of the object, and can pay more attention to more number of small goals. Tis proves that the improved network can efectively enrich the features of small-scale feature detection, making the network pay more attention to the neglected small objects.

Ablation Study.
In this section, we conducted extensive ablation experiments to analyze the efects of individual components in our proposed method. We also analyze the efect of each proposed component of MFEFNet on COCO val2017. Te purpose of this study is as follows.
To analyze the importance of each component in MFEFNet, we gradually applied multiscale context extraction module, subpixel convolution enhancement module, and bidirectional fusion feature pyramid structure to the model to verify the efectiveness. Meanwhile, the Scientifc Programming improvements brought by the combination of diferent components are also presented to demonstrate that these components complement each other. Te baseline method for all ablation studies is ResNet50. All results are shown in Table 2.
By analyzing the experimental data in the table, it can be found that compared with the standard RetinaNet [13], the three structures proposed in this article have diferent degrees of improvement in the detection AP of targets of diferent scales. After adding the BidiFPN to the standard  RetinaNet [13], the AP is increased by 1.2%, and the small object average precision (AP S ) reaches 14.9%, an increase of 1.0%. In addition, after adding MCEM and SCEM, the average precision of small objects (AP S ) is increased by 1.1% and 0.8%, respectively, which also indicates that the shallow features fully extracted by MCEM and channel information at high-level are very helpful for small object detection. In addition to the large improvement in the detection efect of small objects, the detection average precision of large-sized objects and medium-sized objects has also been improved to varying degrees. Te improved model improves the AP from 32.5% to 34.8%. Especially, the small object average precision (AP S ) also achieves a very meaningful improvement, from 13.9% to 16.8%, an increase of 2.9%.
To verify the efectiveness of densely connected dilated convolutions with diferent dilation rates in MCEM, we conducted the following ablation experiments. Feature extraction is performed in the following three ways: Pardilated means that the three dilated convolutional layers are only connected in parallel to perform feature extraction on the shallow feature C 2 . Ser-dilated means that the three dilated convolutional layers are only connected in series for feature extraction, and the convolutional layers are connected in increasing order according to the dilation rate. Den-dilated represents the MCEM used in this article for feature extraction. Te experimental results are shown in Table 3. Te visual structure diagram of three connection modes is shown in Figure 7.
By analyzing the data in the table, it can be found that the shallow feature extraction in the Den-dilated is more conducive to small object detection, and the average accuracy of small objects reaches 16.8%, an increase of 1.4%. We analyzed when features are extracted by Den-dilated, it fully expands the receptive feld and strengthens the information fusion between diferent feature layers, which can extract more sufcient location information and semantic information. Although the other two methods have diferent degrees of improvement in the detection results, the efect is weaker than that of Den-dilated. In particular, the detection efect of Par-dilated is better than that of Ser-dilated, especially in small object detection. Par-dilated is 0.4% higher than that of Ser-dilated in small object detection. We believe that the parallel dilated convolution can greatly expand the receptive feld, and can more fully extract high-resolution features that are conducive to small object detection.
To verify the efectiveness of GE Block in SCEM, we conducted the following ablation experiments. SCEM can be divided into a channel dimension reduction part based on subpixel convolution and the nonlocal feature extraction part based on GE block. Te experimental results are shown in Table 4.
We analyzed the experimental data in the table and found that when only using subpixel convolution for channel dimension reduction, the detection accuracy has been greatly improved, and the average accuracy has increased from 34.1% to 34.6%, an increase of 0.5%. In addition, the small object detection accuracy is improved by 0.6%. However, after adding the GE Block, the detection accuracy of targets of various sizes has been further improved, and the APs has reached 16.8%, an increase of 0.9%. Tis is due to the fact that GE Block uses the spatial attention mechanism to fully obtain spatial context information, which is very helpful for small objects that rely heavily on context information.

Visualization of Results.
In order to more intuitively demonstrate the efectiveness of the model proposed in this article, we visualized the detection efect of the standard RetinaNet [13] and the MFEFNet proposed in this article on the MS COCO dataset, as shown in Figure 8. Te frst column in the chart represents the original image, the second column is the detection result of RetinaNet [13], and the last column is the detection result of MFEFNet.
From the detection results, it can be found that compared with the standard RetinaNet [13], MFEFNet can detect more small objects. In the frst line of detection results, it can be found that MFEFNet is able to detect people, which are     targets not detected by RetinaNet [13]. In the second row, RetinaNet [13] detected false objects and missed some objects. Te white tent was mistakenly identifed as sheep, the grass was mistakenly identifed as cow, and the distant cow was not detected, which were successfully avoided in MFEFNet. From the experimental results in the third and fourth rows, it can be found that MFEFNet can also accurately identify a larger number of small objects such as cows.
Tese experimental results show that the improved model in this article can further enhance the representation ability of the model and can greatly improve the missed detection and false detection of small objects.

Conclusions
Tis article deeply analyzes the key factors afecting small object detection and points out the shortcomings of the excellent single-stage object detector RetinaNet in small object detection. Tis work proposes a small object detection network based on multiple feature enhancement (MFEFNet) starting from improving high-resolution utilization and reducing information loss during propagation. First, it uses densely connected dilated convolutions to adequately extract shallow layer C 2 , improving the utilization of highresolution features. Second, this work introduces a bidirectional feature pyramid structure to shorten the shallow feature propagation path. Finally, this work makes full use of channel features containing rich semantic information through subpixel convolution to avoid channel information loss caused by channel dimension reduction in lateral connections. Tis article conducts sufcient experiments and stable detection improvements on the challenging MS COCO dataset, and the experimental results show that the detection efect of the improved method is greatly improved, and the AP is improved by 2.3%. Te AP S is increased by 2.9%, which efectively improves the detection efect of small objects. Tis article demonstrates the efectiveness of the model through sufcient experiments, and we believe this work can help future object detection research. [34].

Conflicts of Interest
Te authors declare that they have no conficts of interest.