Research on Small Target Detection Technology Based on the MPH-SSD Algorithm

To address the problems of less semantic information and low measurement accuracy when the SSD (single shot multibox detector) algorithm detects small targets, an MPH-SSD (multiscale pyramid hybrid SSD) algorithm that integrates the attention mechanism and multiscale double pyramid feature enhancement is proposed in this paper. In this algorithm, firstly, the SSD algorithm is used to extract the feature map of small targets, and the shallow feature enhancement module is added to expand the receptive field of the shallow feature layer so as to enrich the semantic information in the feature layer for small targets and improve the expression ability of shallow features. The processed shallow feature layer and deep feature layer are fused at multiple scales, and the semantic information and location information are fused together to obtain a feature map with rich information. Secondly, the cascaded double pyramid structure is used to transfer from the deep layer to the shallow layer so that the context information between different feature layers can be effectively transferred and the feature information can be further strengthened. The hybrid attention mechanism can retain more context information in the network, adaptively adjust the feature map after addition and fusion, and reduce the background interference. The experimental analysis of MPH-SSD algorithm on Pascal VOC and MS COCO datasets shows that the map of this algorithm is 87.7% and 51.1%, respectively. The results show that the MPH-SSD algorithm can make better use of the feature information in the shallow feature layer in the process of small target detection and has better detection performance for small targets.


Introduction
Target detection technology is a technology to fnd the target object in the image to be tested with the help of computer vision. Tis technology can not only realize the target location but also judge the category of the target at the same time. As a key technology in the feld of computer vision, small target detection technology has become a hot research topic and thus has a wide range of applications in remote sensing image processing [1,2] and industry and medicine [3][4][5][6][7][8].
From the defnition of small targets, they are of two types [9], where the frst one is to defne objects with less than 32 × 32 pixels in the MS COCO dataset as small targets and the second one is to compare the target with the image and defne the target as a small target if it accounts for no more than 10% of the image. Because small targets themselves carry less feature information, this is extremely challenging for feature extraction, recognition, and detection [10]. In addition to carrying little information, the small area under the coverage of small targets, single location information, and weak feature representation are all factors that limit the performance of small target detection [11]. Te emergence of multiscale feature fusion has provided new ideas to solve these problems, while multiscale fusion strategies have been applied to various image processing tasks, such as superresolution image processing [12,13] and semantic segmentation [14,15].
At present, the mainstream target detection algorithms are mainly divided into one-stage target detection algorithm and two-stage target detection algorithm. Among them, the one-stage target detection algorithm is represented by YOLO series and SSD, which adopts the regression strategy to achieve target detection [16][17][18][19][20][21][22][23][24], and the two-stage target detection algorithm is represented by classical convolutional neural network algorithms such as R-CNN and fast R-CNN [25][26][27][28][29][30], and based on the original input image, the candidate region is extracted, and then the image of the candidate region is sent to the convolutional neural network for feature extraction so as to achieve the classifcation and detection of targets. From the existing research results, the one-stage target detection algorithm does not need to generate a large number of candidate regions, which reduces the detection time and improves the real-time performance of target detection. It has certain advantages in engineering applications and has made good research progress. For example, in 2016, aiming at the problem of diferent target scales in target detection, researchers such as Liu et al. proposed the SSD (single shot multibox detector) algorithm, which uses feature layers of diferent scales to detect targets of diferent scales [31]. However, the algorithm lacks information interaction between shallow and deep information and does not use the feature information between diferent feature layers, which reduces the detection accuracy of small targets. In order to improve the detection accuracy of small targets, Fu et al. proposed the DSSD algorithm in 2017, which introduces a residual structure based on the SSD algorithm and uses ResNet-101 as the backbone network to form a "wide-narrow-wide" hourglass structure using deconvolution to fuse high-level semantic information with low-level semantic information, enriching the multiscale feature maps for predictive regression and classifcation detection tasks and improving the detection performance of small targets [32]. Due to the replacement of the backbone network, the number of parameters of the model increased sharply, which could not meet the needs of real-time detection. In 2018, Cui et al. and other researchers proposed the MD-SSD algorithm, which enhances the characteristics of small targets by introducing shallow and deep cross level connections and improves the detection accuracy of small targets [33]. MD-SSD simply fuses the shallow and deep feature layers to increase the information exchange between the feature layers and does not process the shallow feature layers to mine its own small target feature information. In 2020, researchers such as Zhai et al. proposed the DF-SSD algorithm, replacing the backbone network with DenseNet-S-32-1, which enhanced the feature extraction ability [34]. In 2021, researchers such as Chen and Luo proposed a method of integrating multiscale semantic information to enhance shallow features to detect small targets, which enhanced the detection efect of small targets and dense targets [35]. In reference 35, they perform a simple convolution process on the shallow feature layer, which loses a part of the required feature information during the convolution process and does not efectively extract the feature information of small targets. From the existing research results, on the basis of SSD detection algorithm, the feature information of the target to be measured is fused, which improves the robustness of the algorithm and improves the performance of small target detection. However, the existing research results do not efectively use the feature information contained in the shallow feature layer, resulting in the insufcient performance of the algorithm model in small target detection, and the detection accuracy still needs to be further improved.
In order to improve the detection performance and accuracy of the algorithm model for small targets, this paper proposes a MPH-SSD algorithm that combines the attention mechanism and multiscale double pyramid feature enhancement. First, the shallow feature enhancement module is designed to expand the receptive feld of the shallow feature layer, enrich the semantic information in the feature layer for small targets, and enhance the expression ability of the shallow features. Te multiscale information fusion module is used to select diferent scale feature layers for fusion, and the detail information in the shallow feature layer and the semantic information in the deep feature layer are fused. Secondly, the cascaded double pyramid structure is used to transfer from the deep layer to the shallow layer so that the context information between diferent feature layers can be efectively transferred, and the feature information can be further strengthened. Te attention mechanism is used to enhance the feature response of the target to be detected and reduce the infuence of the background on the detection performance of the algorithm.

Basic Model of the SSD Algorithm
An SSD algorithm is a single-stage target detection model based on regression. Using the idea of regression, the anchor box position and classifcation information of target information are directly regressed to the image. Using a priori frame mechanism, 4 ∼ 6 diferent numbers of default frames are extracted in each cell of the feature layer according to diferent length width ratio, and fnally, 8732 default frames are obtained. Te shallow feature map has rich spatial information and small receptive felds, which has a good efect on target positioning, but it lacks semantic information for classifcation, which has a poor efect on classifcation. On the contrary, the deep feature map has rich semantic information and large receptive felds, which can accurately classify targets. Te shallow feature map has many pixels, which is suitable for small target detection, and the deep feature map has few pixels, which is suitable for large target detection [36].
Te SSD algorithm takes VGG16 as its backbone network, replaces the last two layers of full connection with convolution layer network based on the original VGG16, uses the convolution layers with diferent convolution cores to extract more feature maps, and then extracts diferent size feature maps and VGG16 network feature maps from the convolution layer network to predict the detected targets independently. Te structure of the SSD algorithm framework is shown in Figure 1. As you can see along the way, the algorithm resizes the input image to get a fxed size of 300 × 300 × 3 image, and the image is sent to the SSD network for processing, and then after the SSD feature extraction network, six diferent sizes of feature maps can be obtained, which are 38 × 38 × 512 of layer Conv4_3, 19 × 19 × 1024 of layer Fc7, 10 × 10 × 512 of layer Conv8_2, 5 × 5 × 256 of layer Conv9_2, 3 × 3 × 256 of layer Conv10_2, and 1 × 1 × 256 of layer Conv11_2. Feature extraction for targets of diferent scales is accomplished by diferent feature layers, in which the shallow feature layer mainly extracts feature information for small targets, and the deep feature layer mainly focuses on the feature information for large targets. Finally, the SSD algorithm sets a diferent number of priori anchors according to the aspect ratio of the size of the feature graph and obtains the boundary anchor of the object after processing by the nonmaximum suppression (NMS) algorithm.

Analysis of the MPH-SSD Algorithm Model
Te traditional SSD algorithm detects targets by extracting diferent sizes of feature maps according to the algorithm itself, which can detect diferent targets. However, the shallow feature map lacks the semantic information of small targets, so the traditional SSD algorithm has insufcient performance in small target detection. To solve this problem, this paper presents an MPH-SSD optimization algorithm that combines the attention mechanism with multiscale double pyramid feature enhancement. Te overall structure of the algorithm is shown in Figure 2.
From the diagram, it can be seen that there are four main modules in the algorithm, which are shallow feature enhanced module (SEM), multiscale feature fusion module (MFM), double pyramid feature enhancement module (DPEM), and hybrid attention module (HAM). Tis algorithm uses VGG16 as the backbone network and incorporates dilatation convolution to increase the perception feld when the shallow features are extracted from the backbone network so as to obtain more and more abundant context information. At the same time, the deep feature layer is rich in semantic information, and the cascade multiscale double pyramid structure is used to enhance the deep feature layer and transfer the semantic information to the shallow feature layer. While transferring the semantic information from the deep feature layer to the shallow feature layer, it also ensures that the rich spatial information in the shallow feature layer can be transferred to the deep feature layer. Finally, the feature layer of diferent scales is used to fuse, and a multiscale information fusion module is constructed, which further fuses the semantic information and location spatial information to extract more context information, thus improving the detection efect of small targets.

Shallow Feature Enhanced
Module. When processing feature information in the shallow feature layer, the semantic information of small targets is usually lost because of the small perceptual feld. To solve this problem, a shallow feature enhancement module is designed to enable the algorithmic model to obtain a larger perceptual feld and extract feature information at a higher semantic level. Te structure of the shallow feature enhancement module is shown in Figure 3. Te SEM designed in this paper has two main parts: the feature enhancement part and the connection part. Te feature enhancement part is L2 to normalize the feature map to weaken the efect of large values of certain variables on the model and then perform three parallel expansion convolutions with expansion rates of 1, 2, and 3. First, the expansion convolutions with expansion coefcients of 1 and 2 are used to extract the spatial and  Computational Intelligence and Neuroscience location information of the small target, and then the expansion convolution with expansion rate of 3 is used to provide contextual information to the small target. Parallelizing the networks with three diferent sensory felds can efectively improve the continuity of features. Te connection part preserves the feature information in the diferent infated convolutional layers by the concatenate stacking operation. Te concatenate operation is followed successively by a 3 × 3 convolution with the ReLu activation function and sigmoid activation function to adjust the size and the number of channels of the feature map. At the same time, a BN layer is added to speed up the training and convergence of the network to control the gradient explosion and to prevent overftting. Finally, element-by-element multiplication is performed with the feature maps without any operation on the original input. Te expansion convolutions of diferent expansion coefcients are connected together, which can efectively solve the problem of voids that exist after the convolution of diferent expansion coefcients, thus avoiding feature loss and improving the extraction performance of the shallow feature layer for small target features.

Multiscale Feature Fusion Module.
In the SSD target detection algorithm, the feature information responsible for small target detection is mainly concentrated in the shallow feature layer. Te shallow feature layer has higher resolution and richer detail information, but the shallow feature layer contains less semantic information, which leads to the general detection performance of small targets.
To solve this problem, this paper designs a multiscale feature fusion module, whose structure is shown in Figure 4.
In MFM, three feature layers with diferent depths and scales are selected for fusion. Te purpose is to fuse the location and spatial information contained in the shallow feature layer with the semantic information in the deep feature layer so as to improve the detection performance of this algorithm MPH-SSD for small targets. Te principle of multiscale feature fusion is Among them, X 1 , X 2 , and X 3 represent the feature layer that needs feature fusion in VGG16 feature extraction network; T 1 , T 2 , and T 3 represent the transformation that  needs to be carried out before the feature layer is fused; and the feature layers of diferent scales that need to be fused are transformed to the same scale for stacking fusion. Te subscript f represents the feature fusion function. In the multiscale information fusion module designed in this paper, the concatenate operation is adopted. X f is the fusion features obtained after the fusion operation. Te process of feature fusion is such that the third and fourth layers processed by the shallow feature enhancement module and the ninth feature layer extracted by the VGG16 backbone network are used as the input of the multiscale fusion module. First, layer Conv3_3 and layer Conv4_3 use 3×3 convolution scales to 64 × 64 × 512, and then layer Conv9_1 feature layer uses 1 × 1 deconvolution scales to 64 × 64 × 512. Finally, concatenate stacking is used to join the three different scales of feature layers to fuse the rich detail information in the shallow feature layer and the rich semantic information in the deep feature layer into a feature map. After fusing layer Conv4_3, layer Fc7, layer Conv10_1 and layer Conv5_3, and layer Conv8_2 and layer Conv11_1, we can get the fuse features of Fuse1, Fuse2, and Fuse3 at diferent scales.

Double Pyramid Feature Enhancement Module.
Feature pyramid is a top-down structure for multiscale fusion of feature information, which can solve the problem that it is difcult to deal with multiscale features in the process of target detection. Because the feature pyramid is one-way transmission of feature information, it is easy to lose some detail feature information in the process of feature enhancement. Terefore, the efect of feature pyramid on the improvement of algorithm model is limited [37].
To solve this problem, a double pyramid feature enhancement module is designed, and its structure is shown in Figure 5(a). From the fgure, it can be seen that on the basis of the original pyramid structure, a channel that can be directly connected to the output is added, which is similar to the network structure of jump connection in U-Net network [38]. Before DPEM, the deep features extracted by VGG16 are enhanced, and the context information in the deep feature layer of diferent scales is extracted in parallel, and they are fused with the semantic information in the deep feature layer. Te three fusion features and enhanced deep features generated in the multiscale fusion module are used as the input of DPEM. Te addition of ECA (Efcient Channel Attention) makes the deep features more focused on capturing cross-channel interaction information, avoiding the deep network from missing information due to dimensionality reduction. Te structure is shown in Figure 5(b). Diferent from the traditional pyramid structure, the double pyramid structure designed in this paper adds a bottom-up feature transmission channel on the original basis and adds a channel between the same nodes that can be directly connected to the output, which enriches the feature information and makes the feature information in the network structure more accurately expressed.

Hybrid Attention Module.
Since the feature maps have diferent scales and diferent perceptual felds, there are diferences between feature information, and the use of common feature fusion methods does not efectively refect the correlation and importance of channel and spatial features in features of diferent scales, which will produce overlap efects and position shifts in the target detection process and eventually brings negative impacts to the target detection task. To address this problem, the hybrid attention mechanism module structure designed in this paper is shown in Figure 6, with the upper part performing channel attention and the lower part performing spatial attention [39]. Te CBAM attention mechanism is changed from the Computational Intelligence and Neuroscience original serial connection to parallel fusion to form an improved CBAM attention mechanism module, which focuses not only on the correlation between channels but also on the relationship of feature information in terms of location, which helps the network to detect the target more accurately.
To further focus on the location information of the target in the detection image, the improved CBAM attention mechanism module in this paper is followed by a tandem fusion of the location attention module. Te rich location information in the shallow feature map is used to extract the dependency between two random locations in the feature map, and for the feature information at a specifc location, it will be weighted by all the feature information at that location and updated for that location. Te weight is the similarity between the corresponding two positions, so this makes any two positions with similar features optimize each other, while ignoring the distance between these two positions. Te impact of the background and the negative information generated during feature fusion is reduced, more contextual information is retained in the network, attention is preferentially allocated to key feature information that is more valuable for the target detection task, and the feature map after summation fusion can be adaptively adjusted.
Te channel features E i and spatial features S i are obtained by the channel attention module and spatial attention mechanism, respectively, and then the two feature maps are summed element-by-element to generate the feature map D i . After three parallel 1 × 1 convolutions with the ReLU activation function, three feature maps D i1 , D i2 , and D i3 are generated; subsequently, the feature map D i1 is converted into an N × C matrix D T i1 by reshape operation and transpose operation, and the feature map D i2 is converted into a C × N matrix D i2 ′ by transpose operation, where N � W×H; then the matrix D T i1 and the matrix D i2 ′ are subjected to matrix product operation to obtain the correlation matrix R, the process of which can be expressed as follows: Spatial attention

Computational Intelligence and Neuroscience
In formula (2), Conv() denotes the 1 × 1 convolution function with the ReLU activation layer, Tran() denotes the transpose function, and Re() denotes the reshape function. After obtaining the correlation matrix R, it is necessary to perform reshape operation on R to convert R into feature map R R , and then it is necessary to perform average pooling and sigmoid activation operation on R R to obtain the attention matrix A. Finally, the attention matrix A and the feature map D i3 are multiplied element-by-element and then added with the feature map D i element-by-element to obtain the fnal feature map containing the detection target location information P i , then the process can be expressed as In formula (3), ⊕ denotes element-by-element summation, and P i is used as the output of the location attention module and also as the input of the prediction layer of the model in this paper. Te hybrid attention module can enhance the representation of the feature layer at key locations and highlight the importance of the target location.

Experimental Set-Up.
In order to verify the efectiveness of the design method, two datasets, VOC07 + 12 (PASCAL VOC2007 plus PASCAL VOC2012) and MS COCO, were used to validate the algorithm. Te PASCAL VOC dataset is a standardized dataset provided by the ofcial PASCAL VOC Challenge. Te VOC07 + 12 dataset contains 20 categories (plus background for a total of 21 classifcations), which contains 16,551 training images for a total of 40,058 targets, 8,333 validation images for a total of 20,148 targets, and 4,952 test images for a total of 12,032 targets. Te MS COCO dataset selected in this experiment is COCO2017, which is divided into 80 object categories in the target detection task, and the dataset contains 118,287 training images, 5,000 validation images, and 40,670 test images, in which the majority of targets are from real-life example images, with rich target scenes and a large number of small target objects, which are suitable for the performance evaluation of target detection algorithms.
Te model of the optimization algorithm in this paper is based on the TensorFlow 2.0 framework with Python version 3.7. Te experimental environment used in the training experiments is NVIDIA Tesla V100s 32GB GPU, and the weights of the backbone network are obtained by pretraining on ImageNet. Te SGD (stochastic gradient descent) algorithm was used in the experiments, and the initial learning rate was set to 0.002. Cosine annealing was used to adjust the learning rate of the model during the training process, and the momentum was set to 0.937. To prevent overftting and promote convergence, the decay weight was set to 0.0005. In the experimental process, two images with diferent sizes are used as the input of the algorithm model in this paper. When the input image size is 300 × 300, the batch_size of the model is set to 32, and when the input image size is 512 × 512, the batch_size of the model is set to 16. Figure 7 shows the training loss profle of the MPH-SSD model. As can be seen in Figure 7(a), the loss value decreases rapidly from 17 on the VOC dataset and fnally converges around 3.5, and from Figure 7(b), it can be seen that the loss value decreases rapidly from 25 on the COCO dataset and fnally converges around 4, indicating that the model achieves good results after sufcient training. When MPH-SSD was trained on the VOC dataset, the learning rate decreased by a factor of 10 after the completion of the 125th and 175th iterations, respectively, and the whole experiment converged after the completion of the 196th iteration to obtain the fnal network model weights. When the training experiments were performed on the COCO2017 dataset, the learning rate was decreased by a factor of 10 after the completion of the 200th and 250th iterations, respectively, and the whole experiment converged after the completion of the 292th iteration to obtain the fnal network model weights.

Analysis of Results.
In order to verify whether the algorithm of this paper is efective, in this section, we mainly compare the performance of the proposed SSD optimization algorithm MPH-SSD, which integrates the attention mechanism and multiscale double pyramidal feature enhancement, with the target detection algorithm based on CNN in recent years, and we can see from the comparison results that the target detection algorithm proposed in this paper has better performance in small target detection. Table 1 lists the experimental results comparing the algorithms in this paper and recent years' convolutional neural network-based target detection algorithms on the PASCAL VOC dataset. Te experimental results of these algorithms were obtained using the training and validation sets of both VOC2007 and VOC2012 as training sets. Te mAP (mean Average Precision) in Table 1 is the average of the average precision values of each category detected with a positive and negative sample region intersection ratio of 0.5. Fps indicates the number of images that the algorithm can process per second, which is not only related to the algorithm model but also the hardware confguration of the experiment.

Performance Comparison on the PASCAL VOC Dataset.
From the results listed in Table 1, we can see that the MPH-SSD algorithm proposed in this paper achieves 82.1% detection accuracy and 53.5 frames detection speed when the input image size is 300 × 300, and 87.7% detection accuracy and 24.6 frames detection speed when the input image size is 512 × 512, which ensures real-time detection speed while maintaining a high detection accuracy. Te MPH-SSD algorithm in this paper also shows a 1.6% improvement in detection accuracy compared to the RFB algorithm for an input size of 300 × 300, a 4.9% improvement in detection accuracy compared to the classical SSD algorithm, and a 3.5%, 3.5%, 3.2%, 2.5%, 3.6%, and 3.3% improvement compared to the improved family of SSD-based DSSD, MD-SSD, DF-SSD, SEFN, RSSD, and FSSD algorithms, respectively, and 3.5%, 3.2%, 2.5%, 3.6%, 3.3% and 28.8%, 12.15%, Computational Intelligence and Neuroscience 7 and 8.9% improvement in detection accuracy compared to several classical two-stage target detection (R-CNN, Fast R-CNN, and Faster R-CNN), respectively. Tere are also 18.7%, 8.4%, and 2.5% improvements in detection accuracy compared to the more popular one-stage algorithms YOLOv1, YOLOv2, and YOLOv3. Tere is a 5.5% improvement compared to the RFB algorithm in the case of 512 × 512 input size, 9.2% improvement in detection accuracy compared to the classical SSD algorithm, 6.2%, 6.7%, 6.5%, 6.9%, and 6.8% improvement compared to DSSD, MD-SSD, SEFN, RSSD, and FSSD algorithms based on the improved series of SSD, and several classical twostage target detection methods (R-CNN, fast R-CNN, and faster R-CNN) improved the detection accuracy by 34.4%, 17.7%, and 14.5%, respectively. Compared to the more popular one-stage algorithms YOLOv1, YOLOv2, and YOLOv3, there is a 24.3%, 14%, and 9.1% improvement in detection accuracy, respectively.   Figure 8 shows the detection accuracy of the algorithm in this paper and several SSD-based optimization algorithms, namely SSD, DF-SSD, DSSD, and RSSD, for 20 classifcations on the PASCAL VOC dataset. From Figure 8, it can be seen that the detection accuracy of this paper's algorithm MPH-SSD in these 20 categories has been greatly improved in all categories except for the slight shortage of detection accuracy in individual classes, especially in the small target classifcation, where the performance improvement is particularly obvious.
Diferent network construction methods have diferent accuracy and speed. To better demonstrate the trade-of between accuracy and speed of the algorithm in this paper, Figure 9 shows the comparison between the algorithm in this paper and other algorithms in these two metrics. Figure 9(a) shows map-fps performance for small input sizes, and Figure 9(b) shows map-fps performance for large input sizes. From the fgure, it can be seen that the MPH-SSD algorithm in this paper can strike a good balance between accuracy and speed without giving up one metric at the expense of the other in the excessive pursuit of one metric, no matter in the case of small-size input or large-size input. Table 2 lists the test results of diferent types of algorithms on the VOC07 + 12 dataset for the four small target categories such as boat, chair, plant, and bottle. From the experimental results with an input size of 300 × 300, it can be    seen that the detection performance of the proposed algorithm MPH-SSD in this paper for small target classifcation in the VOC dataset is more signifcantly improved compared with other algorithms.
To verify the efectiveness of the MPH-SSD algorithm on small target detection, this paper conducts ablation experiments on the VOC 07 + 12 dataset by gradually adding a shallow feature enhancement module (SEM), a multiscale feature fusion module (MFM), a double pyramid feature enhancement module (DPEM), and a hybrid attention module (HAM) to the basic SSD model and by comparing the detection accuracy diferences to analyze the performance of each module of the MPH-SSD algorithm. Te results of the ablation experiments are shown in Table 3.
In order to verify the efectiveness of SEM, the ablation experiment is based on the traditional SSD algorithm, and SEM is added to the shallow feature layer part to enhance the shallow features to improve the model's perception of small target features. Te mAP of the model is improved by 2.8% compared with the SSD algorithm, which proves that SEM can provide more feature information to the model and is benefcial to the small target detection performance.
To verify the efectiveness of MFM, the ablation experiment adds three MFM modules to the traditional SSD algorithm and fuses feature maps at diferent scales through concatenate operation. After adding MFM, the model mAP is improved by 4.2% compared with the SSD algorithm, which proves that MFM can provide help for the model to fuse location space information and semantic information in diferent feature layers.
To verify the efectiveness of DPEM, the ablation experiment added DPEM to SEM and MFM, and the mAP of the model improved by 3.8%, proving that DPEM can provide the model with contextual information at diferent scales and enable more accurate representation of feature information in the network structure.
To verify the efectiveness of HAM, the ablation experiment added HAM to the frst three modules, and the mAP of the model improved by 1.7%, proving that HAM contributed to reducing the background as well as the impact of negative information generated in the feature fusion process, alleviating the problem of information imbalance between feature maps in the feature fusion process.

Comparison of the Actual Detection Efect of Diferent
Detection Methods. Figure 10 shows the comparison of the image detection efect of this paper's algorithm MPH-SSD with the classical SSD algorithm and DSSD algorithm on the PASCAL VOC dataset. From the comparison results in Figure 10, it can be concluded that the SSD algorithm is obviously insufcient for the detection of small targets and often has the phenomenon of missed detection. Although the DSSD algorithm is optimized on the basis of the SSD algorithm, which improves the situation of missed detection of targets by the SSD algorithm, there is the situation of missed detection of small targets and clustered targets in the detection of small targets. In contrast, the SSD optimization algorithm that incorporates the attention mechanism and multiscale double pyramidal feature enhancement used in this paper can efectively improve the detection performance of small targets, which is especially efective in the case of large-size image input.

Performance Comparison on the MS COCO Dataset.
To further illustrate the performance advantage of this paper's algorithm for small target detection, it was also tested on the MS COCO dataset, and the results were also compared with other methods from the literature, as shown in Table 4, where "iou � 0.5:0.95" means that 10 thresholds are set in steps of 0.05 between 0.5 and 0.95, and the average accuracy corresponding to each threshold is taken as the average, where "S, M, L" means small target, medium target, and large target, respectively. From the results in Table 4, it can be seen that this method has signifcantly improved the detection accuracy and recall rate compared with Faster R-CNN, Mask R-CNN, YOLOv2, SSD512, DSSD513, DF-SSD, SEFN512, FSSD512, and RFB512. Te results in Table 4   10 Computational Intelligence and Neuroscience efectively prove that the algorithm in this paper has good performance in small target detection.

Conclusion
In order to improve the detection performance of small targets during target detection, an MPH-SSD optimization algorithm that fuses the attention mechanism and multiscale double pyramid feature enhancement is proposed in this paper. Based on the SSD algorithm, four modules are designed: shallow feature enhancement module, multiscale feature fusion module, dual pyramid feature enhancement module, and hybrid attention module. First, we increase the perceptual feld of the shallow feature layer to enrich the semantic information of the shallow feature layer by expanding the convolution and make the regions containing small targets in the shallow feature layer enhanced by the original spatial information and semantic information. Ten, the enhanced feature map is fused with the deep feature layer to improve the semantic information of the shallow feature layer and the detail information of the deep feature layer while also improving the feature extraction and feature generalization ability of the model. Ten, a double pyramid structure is used to construct a bottom-to-top feature transmission channel to enrich the feature information to further enhance the feature information of small  Computational Intelligence and Neuroscience targets. Finally, a hybrid attention mechanism is used to retain more contextual information in the network, adaptively adjusting the feature maps after summation and fusion to reduce background interference. Te MPH-SSD algorithm is experimentally analyzed on PASCAL VOC and MS COCO datasets, and the mAP of this paper's algorithm is 87.7% and 51.1%, respectively. Te results show that the MPH-SSD algorithm can better utilize the feature information in the shallow feature layer in the small target detection process and has better detection performance for small targets. Although the algorithm in this paper achieves improvement in small target detection accuracy, the dualfeature pyramid enhancement module increases the network computation to a certain extent, which makes the algorithm speed down slightly. How to reduce the redundancy and optimize the network structure while ensuring the accuracy will be the main direction of future research.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that there are no conficts of interest regarding the publication of this paper.