Remote Sensing Data Detection Based on Multiscale Fusion and Attention Mechanism

Remote sensing images are often of low quality due to the limitations of the equipment, resulting in poor image accuracy, and it is extremely difficult to identify the target object when it is blurred or small.&emain challenge is that objects in sensing images have very few pixels. Traditional convolutional networks are complicated to extract enough information through local convolution and are easily disturbed by noise points, so they are usually not ideal for classifying and diagnosing small targets. &e current solution is to process the feature map information at multiple scales, but this method does not consider the supplementary effect of the context information of the feature map on the semantics. In this work, in order to enable CNNs to make full use of context information and improve its representation ability, we propose a residual attention function fusion method, which improves the representation ability of feature maps by fusing contextual feature map information of different scales, and then propose a spatial attention mechanism for global pixel point convolution response. &is method compresses global pixels through convolution, weights the original feature map pixels, reduces noise interference, and improves the network’s ability to grasp global critical pixel information. In experiments, the remote sensing ship image recognition experiments on remote sensing image data sets show that the network structure can improve the performance of small-target detection. &e results on cifar10 and cifar100 prove that the attention mechanism is universal and practical.


Introduction
At present, sensor information is a hot research target. e acquisition of sensor information and mobile computing are relatively mature [1].
ere have been many outstanding achievements in the research of basic data types of sensor information, and many more successful algorithms have been proposed [2,3]. For image sensors, information processing is faced with difficulties; because of the limitations of some devices, the pictures obtained by the sensor have the characteristics of large noise, small targets, and blurred targets.
With the development of deep learning, target detection technology based on deep learning has made significant progress [4][5][6][7][8]. However, due to the image sensors information's weakness, small target detection faces huge difficulties and challenges. First, it is easy to lose feature information during the convolution process, which affects the convergence and accuracy of the network. Second, traditional convolution mainly uses the accumulation of small convolution kernels to expand the receptive field. In the early stage of the network, the local convolution method of the small convolution kernel will treat the feature points and noise points equally, which is not conducive to the network's grasp of the feature points and affects the network convergence.
To improve the detection performance of small targets, researchers have carried out much research from network structure, training strategy, data processing, etc. However, compared with large-and medium-target detection, there is still a significant gap in the performance of small-target detection. e target scale is one of the critical factors that affect the performance of target detection. At present, the detection accuracy of small targets is far lower than that of large targets and medium-sized targets in both open data sets and real-world images, and there are often missed and false detections. However, small-target detection has essential applications in many natural scenes.
In recent years, the deep convolutional neural network (CNN) has made significant progress in target detection [9][10][11]. CNNs realize the extraction of features, candidate regions, bounding boxes, and the discrimination of object categories. However, the CNN detector is not suitable for small-target detection due to the nature of the convolutional and pooling layers. ese layers reduce the number of parameters in the network and the dimensionality of the image. e resolution of the feature image is therefore much lower than that of the original input image, which makes the classification and boundary box regression very difficult. erefore, whether it is one-stage [12] Yolo [13] and SSD [14] or two-stage Faster R-CNN [15], the effect of smalltarget detection is not ideal. Since then, there have been some improved methods for small-target detection in the field of deep learning, such as multiscale fusion, scale invariance, and so on. Feature Pyramid Networks (FPNs) [16,17] use low-level location information with high-level semantic information by propagating the high-level features down. e problem of small-target detection results in part from deep learning target detection algorithms only using top-level feature mapping for classification and prediction and ignoring the location information of low-level features. e pyramid structure of FPNs helps resolve this issue. Scale Normalization for Image Pyramids (SNIP) realizes multiscale image input, improves the precision of the preselector, and promotes the effect of small-target detection.
Since single-scale feature mapping is not good at representing objects of different sizes and shapes, extracting relevant information from different layers can naturally alleviate this contradiction. Multiscale deep convolutional neural network (MS-CNN) [18] extracts the proposal region from different-scale feature maps and uses the deconvolution replacement to sample the input image to improve the speed accuracy. Single-shot multibox detector (SSD) extends several additional convolution layers on the truncated Vgg-16 [19] as its backbone network and sets different default frame sizes according to different receptive fields, so it can better predict targets of various scales.
is bottom-up pyramid hierarchy can detect objects of different sizes separately. However, although the use of low-level features is intentionally avoided, the shallow layer of a convolutional neural network cannot fully extract features, which still limits the performance of the detector in small-scale target detection.
Recently, a detection network was devised based on multiscale fusion features. HyperNet [20] is better than Fast R-CNN in processing small objects and generating a higherquality proposal thanks to the interaction between multilayer feature fusion and different sampling strategies. FPNs alleviate this contradiction through an additional top-down architecture, enhancing semantic information through upsampling and adding details through horizontal connection to construct a high-level semantic feature map. Deconvolutional single-shot detector (DSSD) [21] uses a deconvolution module to build a feature pyramid on the SSD benchmark network. e detection network based on multiscale fusion features improves the detection accuracy by injecting large-scale context information. However, the convergence of the corresponding layers on the bottom-up and top-down architectures is not effective enough, and it depends on the quality of the top-level features.
To further alleviate the problem, spatial features of small areas can be lost in a deep network. e proposed method combines two adjacent layers to enrich the context information. Compared with other complex fusion methods, the representation ability of the feature graph fused by our method is not inferior because the two close-range features are highly complementary and related, and some fusion that seems to be beneficial to the feature representation is a kind of damage to both sides of the feature. Global pixel convolution (GPC) attention mechanism is introduced into the convolutional neural network so that the characteristics of different modules will change adaptively with the deepening of the network. Experimental results show that the proposed model improves the accuracy of small-target recognition in remote sensing images. In summary, our main contributions are threefold: (i) We propose a multiscale feature fusion structure in this work, which can make full use of the attention mechanism with multiscale features to consider the supplementary effect of context information on semantics. (ii) We propose a global pixel convolution attention mechanism, which helps to learn global pixel information, to a certain extent, overcomes the locality of traditional convolution, and better grasps the key features of the image. (iii) In the benchmark data sets cifar10 and cifar100, our GPC attention mechanism is better than the current attention mechanisms such as CBAM [22] and SENet [23] on the accuracy and achieves the best SOTA results.

Network
Structure. e research of network structure cannot be ignored in deep learning. At present, the primary way to improve the accuracy is to improve the network structure [24,25]. Since the successful use of CNN, many studies have been presented in the improvement of network structure, and a variety of structures have been proposed.
e VGGNet model shows that as the depth of the network increases, the accuracy of the network continues to improve, but as the gradient propagation increases with the depth of the network, the disappearance of the gradient becomes more and more obvious, which hinders the further optimization of the network. ResNet [26] proposed an identitybased jump connection to alleviate this problem. Based on the ResNet structure, the deepening of the network has become possible. In order to deal with the instability of gradient descent caused by the large difference in pixel convolution values, batch normalization improves the stability of the learning process and makes the gradient descent smooth. At present, the network is usually optimized in three aspects: depth, width, and base. For example, WideResNet [27], which has more convolution filters and smaller depth, broadens its width. ResNeXt [28] uses block convolution, through grouped convolution operation and multibranch convolution, to avoid the parameter explosion problem caused by increasing depth and shows that cardinality can better improve classification accuracy. Dense-Net [29] iteratively concatenates the input features with the output features so that each convolution can accept the original information and further smooth the vanishing gradient problem. However, we focus on another area: the attention mechanism [30,31]. Attention is one of the curious aspects of the human visual system. We fuse the attention mechanism with multiscale features and get better results.

Attention
Mechanism. For small target detection, whether the feature can be captured is particularly important. Helping the network learn features that are useful to the task while suppressing features that are not important to the task, the attention mechanism has gradually stepped onto the stage in recent years. e attention mechanism is different from the previous methods of enhancing the network. It does not increase the width, depth, or cardinality of the network but improves the performance of the network through selective and finer calibration and weighting of the existing feature maps.
Recently, there has been a relatively successful application of the attention mechanism in deep vision learning. Squeeze and excitation network (SENet) considers the relationship between feature channels and adds an attention mechanism to feature channels. SENet automatically obtains the importance of each feature channel through learning and uses the obtained importance to enhance features and suppress features that are not important to the current task. SENet focuses on channel attention, which significantly improves network performance but does not consider spatial attention. Collaborative block attention module (CBAM) combines the attention mechanism of feature channel and feature space. CBAM automatically obtains the importance of each feature channel by learning, similar to SENet. In addition, the importance of each feature space is automatically obtained through a similar learning method. Moreover, the importance degree is used to enhance the features and suppress the features that are not important to the current task. e CBAM method of extracting spatial feature attention is as follows: after channel attention, the feature graph selected by channel importance is finally sent to the feature spatial attention module. Similar to the channel attention module, the spatial attention is pooled by channel as the unit, and the results of the two are concatenated and then convoluted to 1 * w * h feature graph spatial weight. en, the dot product of the weight and the input feature is used to realize the spatial attention mechanism. is kind of spatial attention mechanism improves the network performance to a certain extent, but it is still a pooled connection mode, which takes less consideration of the overall situation of space. Nonlocal neural networks are a self-attention model proposed by Wang Xiaolong in CVPR 2018. Nonlocal neural networks [32] and nonlocal means have a similar implementation. Ordinary filtering is a 3 × 3 convolution kernel and then moves to the whole picture. e processing is 3 × 3 local information. e nonlocal means operation combines a relatively large search range and carries out weighting. e proposed nonlocal operations directly capture remote dependencies by calculating the interaction between two locations instead of being limited to adjacent points. It is equivalent to constructing a convolution kernel with the same size as the feature map to maintain more information. Although this method has been proved to be effective, it involves the matrix multiplication of B × c × w × h characteristic graph, and the computational difficulty cannot be ignored. Later, GCNet [33] was optimized for the difficulty of calculation, but the optimization method still maintained the essential core of matrix multiplication and did not make any substantial changes. In our spatial attention mechanism of global pixel response, we use spatial attention based on an effective architecture while maintaining a small number of parameters and verifying this architecture's feasibility through experience. In addition, our module is proved to be effective in the identification task (cifar100 and cifar10) from experience. In particular, we can achieve the stateof-the-art performance just by placing modules on top of existing models in the cifar100 test set.

Network Structure.
To better capture the feature information of small target feature maps, our proposed network architecture not only uses the multiscale feature map information but also makes full use of the contextual information of the feature map to perform fusion processing of different sizes. In addition, in order to better process the fused feature maps, the global pixel point convolution attention mechanism is used to process the fused feature maps, which further improves the network performance.
It effectively detects small targets with low-level highresolution feature maps because the receptive field of small targets in the feature map is relatively small, and the lowlevel accurate positioning feature is undoubtedly helpful for detection. In this paper, a feature extractor that is consistent with the original SSD is used to generate a multi-scale feature map by Vgg-16 at twice the scaling step (an additional convolution layer is extended at the end of the truncated Vgg-16). e iterative process can be expressed as follows.
Among them, C i is the ith convolution block of the backbone network and f i (x) is the selected feature. With the deepening of the feature layer, the index i becomes more significant, and the feature layer becomes deeper. p n is the prediction layer responsible for converting the feature graph into classification confidence and boundary box. In this paper, the multiscale detector is further optimized by two interconnected modules, and the two modules are organically combined into a novel network. e network structure is shown in Figure 1.

Fusion Module.
Although most detectors use various multiscale structures to solve object diversity in images, the progress on small objects is not satisfactory. e reason why a large target can get reliable detection results is that its essential characteristics are not easy to be lost in propagating convolutional neural networks. Unfortunately, the detection of small objects is awkward because of the low-level features and the information defects of the high-level features.
is paper proposes a method to obtain complementary information by combining two close-range feature layers. (1) Because of the apparent difference between the two features, the reliability of deep prediction will be reduced due to the combination of shallow features. (2) e in-depth features with large receptive field usually introduce a lot of useless background noise to the shallow layer. (3) e feature layer of the close-range usually retains the most helpful information, and the convolution layer for small object detection is enhanced.
erefore, this paper focuses on the fusion between adjacent layers to capture their complementarity.
In the fusion module, as shown in Figure 2, a set of 1 × 1, 3 × 3, and 1 × 1 convolution kernels are used to process the shallow features with the size of 2W × 2H × D2 (W, H, and D2 represent the width, height, and channel number of the feature graph, resp.). D in Figure 2 represents the number of convolution kernels, while 2W × 2H × D in another higher layer. In the process of feature graph processing, 1 × 1 convolution is added as buffers.

Global Pixel Convolution Attention
Mechanism. e traditional convolution operation often uses a smaller convolution kernel to perform local convolution on all feature map channels, and the receptive field is small. It relies on continuously repeating the convolution operation to  obtain a transitive receptive field.
is method comprehensively processes the channel information of the feature map, but due to local convolution, it is possible to extract local noise points, which is not conducive to the extraction of key feature points by the network and affects the convergence and accuracy of the network. e global pixel point convolution operation proposed in this paper in Figure 3 convolves on the global pixel points under a specific channel, extracts useful information of the global pixel points, and can enhance the attention of the network, assist the traditional convolutional network to converge faster, and get better accuracy rate.
To make full use of the integrity of image pixels, we propose a spatial attention mechanism based on the convolution response of global pixels. It reduces the number of parameters and improves performance significantly. To avoid the problem of too many parameters caused by nonlocal, we propose the sequence change operation, using convolution to compress the pixels. en, we realize three techniques of global pixel convolution through the weight multiplication of the original pixels: sequence arrangement, single-channel global pixel convolution, and sequence recovery in Figure 4.
Traditional convolution operation uses smaller-scale convolution kernel such as 3 × 3 for convolution operation. It is a kind of calculation method similar to the sliding window for local pixels in a multichannel feature map. By stacking convolution modules, the receptive field is improved.
However, the receptive field obtained by this method is still unstable. ere is no way to help the network grasp the global receptive field key points at the first time. is paper proposes a spatial attention mechanism module, which aims at all the pixels in a specific channel of the feature map employing sequence change and convolution in Figure 5.
is method can be expressed as follows: where arrangement and recovery are a pair of reverse operations whose main function is to adjust the dimension. Before and after the dimension of the feature image changes, the position of each pixel remains unchanged so that it can carry out the weighted multiplication operation with the original feature image pixel, X represents the input feature x ∈ R C×H×W . For arrangement, x ⟶ U ∈ R (H×W)×C×1 . For recovery, recovery is a restore operation, x ⟶ U ∈ R C×H×W . σ refers to the sigmoid function and f 3 * 1 represents a convolution operation with the filter size of 3 × 1. For f 3×1 co , x ⟶ U ∈ R (H×W/r)×C×1 . For f 3×1 ex , x ⟶ U ∈ R (H×W)×C×1 , where r is the reduction ratio and n is the filter size. In experiment, we set r to 16 and n to 3. It can be adjusted in Table 1.
e global pixel convolution attention mechanism proposed in this paper can be easily combined with almost any deep learning model, and the number of additional parameters is less, which will not affect the model itself. As a spatial attention mechanism, it can also cooperate with other attention models. In the later cifar100 ablation experiment, the global pixel convolution attention mechanism can be easily combined with almost any deep learning model. Our proposed attention mechanism can surpass the current mainstream SENet and CBAM when used alone. We take CBAM as the basic model. From the experimental results, we can see that our global pixel convolution attention mechanism, as a spatial attention mechanism, has much higher accuracy than the spatial attention mechanism in CBAM. When our GPC attention replaces CBAM's spatial attention mechanism, we even get the best result of cifar100. Taking ResNet-50 as an example, we give the way of GPC attention combination for reference in Figure 6

Target Detection on Airbus Ship Data Set.
In this paper, the experiment is carried out on a computer with two NVIDIA 2080ti GPUs using the MXNet framework. As with SSD, the pretrained Vgg-16 (on the ILSVRC CLS-LOC data set) is used to initialize the model for effective comparison. We use the remote sensing image ocean ship detection data set.
e remote sensing image has been preprocessed by denoising, smoothing, and filtering. e data set includes a training set and test set. ere are 1925526 images in the training set and 15606 images in the test set. e CSV file provides the stroke length code for the training image, which is used to locate the ship and generate the mask and bounding box of the image. e coding information and statistical results of the training set images are shown in Table 2. e codes in Table 2 represent some rectangular boxes used to frame the vessels in the image. If the code is Nan, it means that there are no vessels in the image. e encoding string format is the start point, length, start point, length, . . ., and length of each pair (start point, length) representing a certain length of pixel line from the start point. e starting position is not the twodimensional coordinates, but the index of a one-dimensional array so that the two-dimensional image is compressed into a one-dimensional pixel sequence. Read the run-length code. After decoding, 1 in the array represents the mask, and 0 represents the background. e mask information is covered in the corresponding image and visualized with transparent color. e result is shown in Figure 7. Figure 8 is a statistical chart of the images in the training set. It can be seen that 78% of the images have ships, and the number of ships in all images is 81723. Due to the class imbalance of the samples, the image without vessels is downsampled to prevent excessive noise during model training. In the images with ships, the images with 1-2 ships account for the vast majority. Too few tiny targets mean little information about small targets, which may cause the trained model to pay more attention to other information features. erefore, to ensure the relative balance of sample types, this paper oversamples the samples containing a certain number of vessels.
Each model uses the same aerial remote sensing image as the detection data set. e processing process of different models remains the same, training and testing in the case of taking different thresholds, recording its F2 score, and finally taking the average value of F2 score as the final F2 score of the model. Table 3 shows the evaluation results of the seven models. e evaluation includes recall rate, accuracy rate, F2 score, and evaluation time. Mask R-CNN effectively utilizes mask information in the data set, so its indicators are better than the other three comparison models, but the time consumption is slightly higher. Our model adds a GPC attention mechanism in many stages, and the recommended region anchor frame is accurate, so the indicators are significantly better than other models, but the evaluation time is also slightly longer. Experimental results show that the combination of attention mechanism and SSD can effectively improve the small-target detection effect.
As can be seen from Figure 9, SSD has the phenomenon of missing detection. Our method is better than SSD. For small targets, the detection accuracy of our method is much higher than SSD with the GPC attention mechanism.

Classification on cifar10.
We combine the proposed GPC attention mechanism with the current CNN model and compare it with SOTAS based on CNN. We also compare it with the most widespread attention mechanism and ablation experiment. e experimental environment only uses a GPU based on customers: GeForce RTX 2080ti with 24 GB and a 12 Intel® Core ™ i7-8700k CPU. e performance test of the GPC attention mechanism Table 1: ese three columns refer to ResNet-50, SE-ResNet-50 based on the ResNet-50 backbone network, and the corresponding GPC-ResNet-50. Inside the brackets is the general shape of the residual block, including the filter size and feature size, and the optimal position for the insertion of the attention mechanism. e number of stacked blocks at each stage is shown outside the brackets. #P indicates the amount of network parameters.      Mobile Information Systems 7 proposed in this paper is mainly carried out on benchmark data sets cifar10 and cifar100. For cifar10 and cifar100, the same data enhancement combination is adopted on the data set: the size of the random crop is 32 * 32, padding is 4, random horizontal flip, and normalization [38].
For cifar10, the setting of training parameters is as follows: every training process was implemented in a cosine annealing learning schedule with a half cycle. e initial learning rate is 0.1, momentum is 0.9, weight decay is 0.0005, and batch size is 100. e r (reduction ratio) is set to 16. e total epoch number is 200. e main detection indicators are Top1 accuracy, and the parameters are compared to prove that our module will not cause a huge computational burden on the original model. e structure of reimplemented SOTAs followed the work in the link https://github.com/kuangliu/pytorch-cifar.
From Table 4, the addition of the GPC attention module improves the accuracy of the ResNet-50 baseline without causing significant influence on the parameters. rough the GPC spatial attention mechanism, the network can capture the critical features of the feature map faster, and the network can converge faster so that better results can be obtained without too deep networks. For example, the results of ResNet-50 + GPC in the table are even better than the ResNets101 and DenseNets201. After combining with channel attention in CBAM, the best result on the cifar10 data set is obtained.

Classification on cifar100.
For cifar100, the setting of training parameters is as follows: the initial learning rate is 0.1, drops every 60 epochs, gamma is 0.2, momentum is 0.9,  e primary detection indicators are Top1 error and Top5 error, and the parameters are compared to prove that our module will not cause a substantial computational burden on the original model. e structure of reimplemented SOTAs followed the work in the link https://github. com/weiaicunzai/pytorch-cifar100.
From Table 5, GPC spatial attention mechanism has better results on cifar100, which surpasses most models. From Figure 10, after adding the GPC spatial attention mechanism, ResNet-50 is better than the original model during the entire training process. rough the convolution response to the global space pixels, the network focuses on more valuable pixels to be able to achieve faster convergence and a higher accuracy rate. It improves the accuracy of the original model and has higher efficiency than the current popular SE attention mechanism and CBAM attention mechanism. Without using additional data sets and transfer learning, it gets the highest SOTAS results, and the increase of parameters is slight. In terms of accuracy, ResNet-50 + channel (CBAM) + GPC is the only model with an accuracy rate of over 80% without too many parameters.

Ablation Experiment on cifar100
In this section, we carry out ablation experiments to understand the impact of different parameters and configurations on the GPC module. All ablation experiments are performed on the benchmark data set cifar100, using ResNet-50 as the fundamental backbone architecture. All training parameters are the same as in Section 4.3. Only the GPC module is modified to study its impact on the results; the main comparison basis is top-1err, top-5 err5 and parameter amount.

Reduction Ratio.
e reduction ratio r introduced in the GPC attention module is a compression ratio of the global pixels.
e setting value of r will significantly affect the number of parameters and computational complexity. We will use ResNet-50-GPC for experiments and set a series of different r values. Table 6 shows that r � 16 has achieved good results, but the practical effect does not change monotonously with r. It may be that r � 16 is just suitable for the cifar100 data set. e size of the feature map will change    when the number of layers of the network changes. With constant changes, such as in ResNet-50, the feature map will continue to decrease in half, so dynamic adjustment of the rvalue may further improve the accuracy.

Convolution
Kernel. e default convolution kernel set in the GPC attention module is 3 × 1 because the feature map has been flattened in the GPC module to facilitate convolution operations, so the convolution kernel can only be set to n × 1. We set a series experiment with the value of n. Table 7 shows that when the value of n is slight, such as 1, 3, better results can be obtained. Multichannel improves the robustness of the network, but when there are too many channels, it will interfere with the results of global pixel convolution under different channels, which will lead to the slow network convergence.

Comparison of Different Spatial Attention Methods.
We have compared the GPC spatial attention mechanism with the current mainstream attention mechanism. Table 8 shows that using GPC alone as a spatial attention module is effective, and when GPC is combined with other channel attention modules, it can produce better results, which fully illustrates the high efficiency and practicality of GPC as a spatial convolution attention mechanism.

Conclusion
is paper presented a global pixel convolution spatial attention mechanism that could compress the global pixel response under the same channel.
us, it improved the overall grasp of space, integrated it with the multiscale feature fusion module proposed in this paper. e paper also developed a feature pyramid framework with an adaptive attention mechanism. e framework considered the positive influence of rich context information on classification and location, as well as a guidance of advanced semantic features on global features. Tasks on multiple data sets achieved the best performance. In addition, the GPC attention mechanism proposed in this paper was a new convolution method, which avoided the problem of too many parameters caused by the nonlocal matrix operation. e weight assignment of pixel points was completed by convolution, and good results were obtained. We believe that this convolution method offers an alternative way to consider the network architecture in the future.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
All the authors declare no conflicts of interest regarding the publication of this paper.