SACTNet: Spatial Attention Context Transformation Network for Cloud Removal

Optical remote sensing image has the advantages of fast information acquisition, short update cycle, and dynamic monitoring. It plays an important role in many earth observation activities, such as ocean monitoring, meteorological observation, land planning, and crop yield investigation. However, in the process of image acquisition, an optical remote sensing system is often disturbed by clouds, resulting in low image clarity or even loss of ground information, a ﬀ ecting the acquisition of feature information and subsequent applications. We propose a spatial attention recurrent neural network model combined with a context transformation network to overcome the challenge of cloud occlusion. This model can obtain the core information in remote sensing images and consider the remote dependencies in the network. Furthermore, the network proposed in this paper has achieved excellent performance on the RICE1 and RICE2 datasets.


Introduction
An optical remote sensing system has been applied to many fields in recent years, such as earth resource survey, vegetation classification, and land use. If the remote sensing image is polluted by cloud shadow, the ground information will become sparse or even completely blurred, resulting in the loss of information. Therefore, using auxiliary means to remove cloud shadow from remote sensing images is a very meaningful research direction. According to the degree of cloud cover, it can be divided into thin cloud cover and thick cloud cover.
Lin et al. [1] provided a dataset of remote sensing images called RICE and studied the cloud removal of remote sensing images. This dataset contains two subdatasets RICE1 and RICE2. The RICE1 dataset is collected based on the cloud display function of Google Earth, with a resolution of 512 × 512. The RICE2 dataset is collected based on Landsat images, mainly using natural colour images and quality images.
Thin cloud removal can obtain a priori knowledge by using typical mathematical models or by using the differences of spectra in different bands [1][2][3][4][5]. Thick cloud is difficult to remove, because it will seriously block the image and cause the overall loss of ground information. [6][7][8][9][10] used generative adversarial networks GAN and RNN, inpainting methods, etc. to scan the pixels near the cloud shadow and then remove the cloud-contaminated area to generate a declouded image [11][12][13][14][15][16][17][18][19]. Although these methods are complementary, the effect is usually limited, especially for large-area thick cloud removal and complex scene reconstruction. How to remove the thick cloud is still a very valuable and meaningful problem.
There are two main challenges: first, it is difficult to remove thick clouds and preserve the details of the original image because of the serious occlusion of thick clouds, which leads to the hiding of a lot of ground knowledge, and the ground occlusion information below it is completely invalid. Second, the existing network does not have the ability to consider the dependence of the remote relationship. This is because in the CNN model, the number of operations required to calculate the association increases with the distance between the two positions through convolution.
To address these problems, we propose a novel spatial attention context transformation network for image cloud removal named SACTNet. Specifically, it is divided into two basic networks: one is the backbone network used to remove cloud and the other is the transformer network composed of texture content extractors and correlation embedding modules.
Generally speaking, it is very difficult to obtain the information of thick cloud occlusion by using the backbone network, mainly because the network will add all functional nodes to the holistic dependence, thus missing the local key features. In order to gain the spatial domain features of the core in the remote sensing images with thick cloud, we proposed a basic network based on the spatial attention mechanism in order to transform the spatial domain features into an image. The attention mechanism performs spatial transformation through the spatial domain information in the image, so as to gain the key information. Second, in the existing deep learning image-to-cloud algorithms, the relationship of remote dependencies in the network is rarely considered, and these learning remote dependencies are the key challenge. A key factor affecting the ability to learn this dependence is the path length that the forward and backward signals must traverse in the network. By shortening the path of position combination in any input and output sequence, the learning of distance dependence can be made easier. For this reason, we designed the transformer network, which contains contexture extractors and correlation embedding modules to increase the dependency of the network, and the result is shown in Figure 1. In general, the main contributions of this paper are as follows: (1) An innovative contexture transformer network for image cloud removal named SACTNet is designed to achieve cloud removal in two types of cloudblocking datasets with thin and thick clouds (2) SACTNet is the first to apply transformer structure to remote sensing cloud removal. In particular, we design a contexture transformer network with three related modules (3) The method in this paper has achieved excellent results on the RICE dataset and is better than existing state-of-the-arts

Methodology
The structure of SACTNet is shown in Figure 2. SACTNet consists of a contexture transformer and a backbone network. (1) The contexture transformer network is inspired by [20], and it consists of a content extractor and a correlation embedding and soft attention module. The content extractor is introduced to get content feature of ground truth and cloudy image. The correlation embedding module is designed to calculate the correlation between ground truth and input. The distance between the real image and the cloud occluded image can be ensured. Joint feature learning can discover depth features, which can correspond to provide accurate texture features.
(2) First of all, from the back-bone network, the feature information is obtained by standard convolution and following bottlenet. Finally, it passes spatial attention recurrent neural network modules (SARNN) and concat the output of the transformer. The SARNN module is used to improve the network's attention to special cloud parts. Its principle is to search the focus mapping from the mapping of input elements. The attention map is a two-dimensional matrix; in particular, the value of the element is uninterrupted and represents how much attention needs to be assigned to pixels. In the backbone, GAN trains a generator and a discriminator, and the generator is used to generate the discriminator which cannot distinguish whether the data comes from the training sample or the generation. The discriminator is dedicated to learning to distinguish between real data and fake data, or it can be considered two parallel networks in this article. The role of the transformer network is to give more detailed information guidance when the backbone is well trained.

Contexture Transformer Network
(1) Content extractor: the content extractor of real images is critical in the task of image cloud removal. Declouded images can be produced with the aid of appropriate content texture details. As a learnable content extractor, this article uses the VGG19 pretrained classification model to extract semantic properties, with parameters changed during end-toend training. This enables the network to perform joint feature learning on real images and cloudy images and can capture more accurate texture features In this paper, the Pytorch pretraining model VGG19 is used to extract image features and get image embedding. If we pretrain the natural scene ourselves, it will be timeconsuming and computational resources, so we can use the pretrained model to apply it to our own tasks. And the pretraining model of VGG19 on the ImageNet dataset has very strong generalization, which can be suitable for pictures in many situations. Simply put, the function of VGG19 is only  Wireless Communications and Mobile Computing to extract the feature points of remote sensing images, which can be considered a tool. First of all, we choose the mean shift clustering algorithm as an iterative algorithm, taking the input as the starting point; then, we split the pretrained VGG19 model into three modules, each of which is composed of layers in the pretrained model, and finally, the contexture extractor method can be summarized as follows: where CEðÞ represents the outcome of the content extractor. V, K, and Q indicate two basic units of the attention structure existing in a transformer and will continue to be used in the next relevance embedding module. Q, K, and V are just a way to construct this potential energy function.
(2) Correlation embedding: correlation embedding is to embed the correlation between the real and the cloud image by estimating the similitude between Q and K. The modules expand both Q and K to patches, represented as q i and k j . The correlation R i,j between the two parts is calculated by standardized inner product: (3) Soft attention: this paper presents a soft attention module that uses real images to synthesize texture elements. The related texture quality will be improved throughout the synthesis process. We compute soft attention feature S i,j in the position The soft attention map S multiplies these fusion features on the element and adds them back to F to get the texture converter's final performance. This operation can be written as follows: where H out represents the output characteristics, H1 represents the output of the backbone network, and ⊙ operators represent element-wise multiplication between feature maps.
2.2. Spatial Attention Mechanism and SARNN. SARNN divides the RNN into two layers. The RNN's first layer will basically spread data throughout the whole picture for aggregation, resulting in more coherent semantic data. This module can form semantic consistency and obtain the context information of distance eigenvalues at the same time. The module would gather local background features in order to obtain the overall perceived eigenvalue, which are critical in cloud image elimination. The technique may also be used to track cloud. In order to create the global perceptual feature map, the RNN's second round accumulates extra nonlocal background knowledge. Direction is the key information in looking for significant clues between shadow/nonshadow parts. This two-round four-way RNN architecture is used to detect shadow regions. The module calculates feature H i,j in the pixel point ði, jÞ. The four directions of the recurrent neural networks are summarized.
The spatial attention model is built based on the tworound and four-direction recurrent neural network structure. The recurrent neural network performs descending projection in four key directions. In Figure 3, the spatial context information is obtained by another branch network to highlight the expected shadow feature information. SARNN can efficiently detect areas impacted by clouds based on the imported cloud image. First, three standard residual blocks are used to obtain features, and then, the feature information assists the following three attention residual blocks to learn negative residuals to eliminate shadows. Finally, multiple residual block units are used to reestablish the generated feature map into a cloud-removed image.

Loss Function
In this section, five items are introduced to mainly compose the loss function, which are presented by L CGAN , L 1 , L perctual , L style , and L attention , respectively. To start with, the L CGAN is defined: With the aim of measuring the accuracy of each reconstructed pixel, the standard L 1 is provided below.
In this definition, C, H, and W are used to represent the number of channels, the height of the image, and the width of the image. θ C is the weight of each channel, δ is the calculated value of this model, M input is the input value, and M output is the output value.
Attention loss L attention is the third item of the whole loss, which can be obtained by the attention map module. The   (7).
Furthermore, to promote the performance of this novel network, we added the perceptual loss L perctual [21] and the style loss L style [22], which are defined as follows.
In the former equation, δ i is the feature map of the pretrained model i-th layer. In this research, δ i is the first layer's feature information of each ReLu module in the ImageNet pretrained VGGNet-19 model. G δ is the reestablished gram metric in the style loss on the foundation of activation map δ j . Finally, L CGAN , L 1 , L perctual , L style , and L attention are weighted and combined into the total loss by the following formulation.
where θ dis = 1, θ adv = 1, θ p = 0:1, θ s = 250, and θ att = 1. [23] provides an open source dataset of remote sensing images, including RICE1 and RICE2, which can be used for the research of cloud removal of remote sensing images. In the model training stage, the main experimental parameter settings are shown in Table 1, and the hardware environment configuration is shown in Table 2.

Evaluation Metrics.
We used PSNR and SSIM to evaluate the results. Although the effect diagram of cloud removal can intuitively show the effect of the model, it can better carry out quantitative analysis by using indicators. PSNR and SSIM are defined as follows: ð12Þ where α x and α y are the means of x and y and β x , β y , and β xy represent the variance and covariance of x and y, respectively. c 1 and c 2 are used to maintain stable constants.

Results and Comparison.
For quantitative analysis of cloud removal performance, SSIM, PSNR, and RMSE can be used. Through the above three parameters and the generated map, the cloud removal effect of the image can be evaluated.

Wireless Communications and Mobile Computing
We compare SACTNet with existing research results including Cloud-GAN [24], CGAN [25], Cycle-GAN [26], SpA-GAN [27], RSCNet [28], GCANet [29], and MLA-GAN [30]. In order to make a reasonable comparison with the existing research results, we mainly evaluated the RICE dataset, and the results are shown in Table 3. The performance of SACTNet is better than the existing research, and the images generated by the network have better consistency and balance. In particular, Figure 4 is an intuitive comparison of various network-generated images. The image generated by our proposed network is closer to the ground image than the existing network.

Ablation Study.
In order to further study the performance of the context network, we conducted ablation experiments. The contexture transformer contains three parts: a content extractor (CE) that can be learned, correlation embedding module, and soft attention module. The ablation results are shown in Table 4. We progressively add the content extractor and soft attention once the backbone network is complete. For comparison, we use the VGG network rather than the CE network. PSNR output is effectively increased after soft attention is added, important texture features are enhanced, and less relevant texture features are weakened. PSNR improved to 33.251 after adding the content extractor, proving that embedding joint functionality in the content extractor is superior. We also visually assess the RICE results' quantitative capabilities. For complicated remote sensing images, the combination of these parts will quickly and naturally restore information and achieve practical results.

Quantitative Study.
We use the related research of the article [31] to set the parameters of perception loss and style loss, so the point is on L att , L adv , and L dis and their corresponding parameters. In order to further study the influence of these parameters, this paper makes a quantitative analysis of parameters. Results of analysis are shown in Table 5. Through comparison and analysis, it can be seen that when the values of θ att , θ adv , and θ dis are, respectively, 0.1, 1, and 1, the corresponding PSNR and SSIM results are the highest. It shows that detailed information such as rivers and mountains in the remote sensing image is more complete, and the effect of cloud removal is optimal. At the same time, we conducted an ablation study on the three loss functions. It can be found that after adding L adv , L dis , and L att , the PSNR is increased by 0.02, 0.05, and 0.027, respectively, and the SSIM is increased by 0.01, 0.01, and 0.05, as shown in Figure 5. It is found that just a single loss function does not greatly improve the effect, and there will still be some problems about missing information.
4.6. Discussion and Future Work. Although our method has achieved excellent results, the network can still be improved regarding subsequent research, such as improving the performance of network cloud removal. By adjusting the structure of the soft attention module, more important texture features can be obtained, and the information of complex remote sensing images can be quickly restored.
At the same time, there are other low-vision tasks similar to cloud occlusion in remote sensing images, such as dust occlusion, snow occlusion, and aerosol occlusion. Therefore,   the generalization effect of the model can be continuously enhanced in the future. By using the same network to achieve the task of removing the cover of multiple objects in remote sensing images, practical performance can be improved and related costs can be reduced. The model proposed in this paper can not only be used for cloud removal of remote sensing images but also be applied to many scenes such as image highlight removal, artifact removal, blur removal, and reflection removal. For example, the model can detect and remove the highlight of the natural image, remove the reflection problem in the flickering image, remove the fuzzy area of the surveillance camera, etc.

Conclusions
This paper proposes a spatial attention context transformation network for image cloud removal named SACTNet. We introduce the spatial mechanism in the generative adversarial network, which allows the encoder to learn the prior information on the real images and enhances the reasoning ability of the encoder. In addition, we designed the transformer network, which contains content extractors and correlation embedding modules to increase the dependency of the network. By shortening the path of position combination in any input and output sequence, the learning of distance dependence can be made easier. In the experimental part, a large number of comparative tests and ablation tests have been completed in the experimental part of this paper. The quantitative and qualitative comparison with the excellent algorithms at the present stage proves the superiority of the method proposed in this paper in handling cloud removal tasks.
By combining L adv , L dis , and L att loss function to form a new loss function, our model has achieved good cloud removal effect. In the quantitative comparison with the existing research results, our model has achieved the best results in PSNR, SSIM, RMSE, and other indicators. Furthermore, our proposed SACTNet achieves excellent results and is superior to the existing state-of-the-arts.

Data Availability
The RICE datasets that support the findings of this study are openly available in GitHub at https://github.com/BUPTLdy/ RICE_DATASET [14].