Remote Sensing Image Change Detection Network Based on Twin High-Resolution Representation

,


Introduction
Change detection of acquired remote sensing images of the same geographical area, at diferent times, is an important part of many practical applications such as land use, vegetation change detection, ecosystem detection, and damage assessment [1].Te traditionally time-consuming and laborious way of analyzing changes in remotely sensed images based on manual work makes the automation of this process an important and practically needed research area.Automatic implementation of time-series image change detection is of great scientifc research and application value, and research in this feld has been carried out in the remote sensing community for decades [2][3][4][5][6].In recent years, artifcial intelligence algorithm techniques represented by deep learning have developed rapidly and have been applied in various felds, such as computer vision [7], speech recognition [8], and information retrieval [9], especially in the feld of computer vision.J. Bruna et al. proposed a fully convolutional neural network [10] to achieve end-to-end pixel classifcation of images using convolutional layers instead of fully connected layers.Hughes et al. proposed a pseudo-twin convolutional neural network-based method applied to the detection of changes between SAR images and optical images [11].As the research progressed, in 2017, Ashish Vaswani, Niki Parmar, and several other researchers joined together to publish the landmark paper that ushered in the era of large models [12].In this paper, they proposed the famous transformer architecture.In 2018, a model called BERT blew up the NLP community, setting a new SOTA record of 11 NLP tasks, and it was transformer that was responsible for it.Transformer has been extensively used in the feld of remote sensing, especially in the feld of semantic segmentation.Gori et al. [13] used RNN to compress node information and learn graph node labels and frst proposed the concept of graph neural network (GNN).Graph neural networks (GNNs) are a deep learning-based method for operating in the graph domain.Later, graph convolutional network (GCN) was proposed in the literature [14], which formally used CNNs for modeling graph-structured data.Although both transformer and graph neural networks are popular deep learning methods in recent years, relatively little research has been conducted in dual-time remote sensing image change detection.Terefore, twin convolutional neural networks are still used in this paper.
At present, scholars at home and abroad have proposed various remote sensing image change detection methods based on deep learning.In terms of the framework of the methods, they can be broadly divided into 3 categories.Te frst type is the method of extracting features frst and then detecting them, i.e., feature extraction of the image using a deep network followed by change detection based on the features [15].Te second category is the method of preclassifcation followed by detection.Tat is, the images are primarily preclassifed using traditional algorithms, and then, the deep network is trained with explicitly changed and unchanged samples.Finally, uncertain samples are fed into the trained network to obtain the result graph [16].Although both types of methods are based on deep learning and their results also are better than traditional methods, they are still subject to human experience and prone to errors in the steps required for threshold judgment, clustering, and sample selection in the detection process.Te third category is the methods based on fully convolutional networks.It is a completely end-to-end learning framework with no human interference in between, and the whole process is more robust and efcient [6].Depending on the input method of the image, this type of method can be subdivided into networks with a single input and networks with dual input.
Although the full convolutional network-based approach achieves better change detection performance, there are still some shortcomings.Te cascading pooling operation of encoders from high resolution to low resolution will lead to a decrease in spatial resolution, and it is difcult for the decoder to recover its resolution.Direct use of the full twin convolutional change detection network sufers from low detection completeness, easy false detection, and missed detection.Tis is primarily limited by the lack of network feature extraction capability and the inefective use of contextual semantic information in the spatial and channel domains.Full convolutional networks do not easily obtain the edge information of images when extracting the deep features of dual-temporal images.To this end, this paper proposes a twin context aggregation network (TCANet) to address the above problem.First, to obtain a high-resolution representation, we use the HRNet network for our backbone network.Second, in order to enhance the network feature extraction capability as well as to efectively utilize the channel domain contextual semantic information, we propose the context aggregation module (CAM).Finally, in the decoding part, we introduce the side output embedding module (SOEM) in order to obtain the edge information and small target information of the image, while suppressing useless information and further improving the accuracy of change detection.Remote sensing image change detection, as a more cutting-edge research direction, is still one of the relatively less studied areas at present, and the best experimental results can be obtained using our proposed three modules.Te main contributions are as follows.
(1) We introduce dual HRNet as the backbone of our twin network.Tis network can maintain high resolution from beginning to end, and information interaction of diferent branches can supplement information loss caused by the reduced number of channels.
(2) In the coding part.Te context aggregation module (CAM) is proposed in order to improve the extraction of network features and to efciently utilize the channel domain contextual semantic information.Te module turns the output feature maps into four 1/4-channel feature maps and then uses dilation convolution with diferent dilation rates to integrate the multichannel contextual information in parallel.(3) In the decoding part.We introduce the side output embedding module (SOEM) in order to obtain the edge change information of the dual-time phase images as well as more detailed fne image details and complex texture features of the high-resolution remote sensing images.(4) Our network achieves impressive results on all three datasets.More specifcally, we have obtained 89.87% F1 score on the challenging DSIFN test set.
Te rest of the paper is organized as follows: Section 2 describes the work related to change detection.Section 3 discusses the proposal of the method.Section 4 presents the experimental data set and evaluation metrics.Section 5 presents the experimental design and results.Finally, Section 6 discusses our work and conclusions.

Related Work
In recent years, many neural network techniques and components for scene segmentation have been applied for change detection tasks to extract deeper representations.First U-Net [17] pioneered the benchmark model and then used the Siamese network [18][19][20][21][22][23][24] to become the standard method for change detection.To improve the performance of change detection, a lot of work has been conducted on depth feature extraction and refnement.Change detection methods based on fully convolutional networks are roughly divided into two categories.One type is the early fusion approach (single-input); the other is the twin network approach (dual-input).A single-input network is a cascade of dual-temporal images into one image before being fed into the network [25,26].For example, in the literature [26], dualtemporal image pairs are concatenated as input to an improved UNet++ network, and change maps at diferent semantic levels are merged to generate the fnal change map.In contrast to single-input networks, dual-input networks are borrowed from twin networks [18,22,27], where the front-end feature extraction part of the fully convolutional network is replaced by two networks with the same structure.For instance, the literature [18] proposed three fully convolutional neural network frameworks for change detection in remote sensing images, one of which is single input and the other two are dual input.Te results of many change detection experiments demonstrate that a dual-input network architecture is more suitable for change detection.Many scholars have done research on Siamese network based on change detection methods.Yu et al. proposed the NestNet [28] network, a model that introduces two parallel modules to extract the respective features of diachronic images and then uses absolutely diferent operations to process the features of the two images.Te literature [22] proposes an IFN-based method, which is a fully convolutional network-based method belonging to a dual input.In this method, dual-temporal images are extracted with depth features by a twin network, and the down-sampled change maps are directly fed into the middle layer of the network during training.And fnally, the network parameters are updated by calculating the losses independently.Fang et al. proposed a SNUNet-CD network [29].Tis network is a modifcation of UNet++, which difers from UNet++ in that it uses two twin convolutional flters to extract features from two images, and aggregates and refnes features from multiple semantic layers by integrating the channel attention module.Finally, the test results were obtained.

Contextual Information Aggregation.
Since each pixel point in an image cannot be isolated, one pixel must be related to the surrounding pixels in some way.Te interconnection of a large number of pixels is what produces the various objects in an image, so the contextual features of an image are of great importance.Not having sufcient access to rich contextual information during the change detection task can have an impact on our detection results.Many approaches add modules on top of the encoded network to expand their efective receptive feld and integrate more contextual information.In the paper [30], global pooling operations are introduced to learn the scenario-level global context, and the importance of the receptive feld is discussed.PSPNet [31] extends the application of global pooling to image subregions and proposes a parallel spatial pooling design that aggregates multiscale contextual information.Dilated convolution is another design that can amplify the receptive feld of CNNs without signifcantly increasing the computational efort [32,33].Combining the atrous convolution and multilevel pooling design in PSPNet, the atrous space pyramid pooling (ASPP) module was proposed in [34] and improved in [35][36][37].Te attention mechanism [38][39][40] uses the sigmoid function to generate "attention" descriptors after global pooling operations, which are another contextual aggregation design.In order to better serve the change detection of remote sensing images, this paper uses multiple parallel infated convolutions to obtain global and local contextual information.

The Proposed Methodology
In this section, we describe in detail the proposed twin context aggregation network (TCANet) for remote sensing image change detection.First, the general structure we proposed will be outlined.After this, we give an illustration of the HRNet (our baseline network) architecture.Finally, the design of each module is presented, including the CAM module and the SOEM module.

Overview of the Network.
As shown in Figure 1, a twin context aggregation network (TCANet) is designed in this paper.Te model feeds dual-temporal images into two networks with shared parameters to extract features separately.First, dual-temporal images are input into the backbone network HRNet to obtain four change feature maps with diferent sizes and a diferent number of channels.Te structure preserves spatial detail information but does not take full advantage of contextual information.Terefore, we integrate more contextual information to improve network performance by introducing a scale-dependent contextual aggregation module in four diferent branches.Since the four parallel outputs generated by HRNet are information-dispersed, we embed diferent levels of local contextual information from the context aggregation module (CAM) into these features to make the outputs informative.Ten, the two feature values obtained from the feature encoding stage are diferenced, and the absolute values are taken to obtain dual-temporal feature fusion information at diferent scales.Finally, we input the fused feature maps to the side output embedding module to facilitate the detection of edges and small targets.Te detailed process of the network is as follows.
As shown in Figure 1, B1, B2, B3, and B4 are the four parallel branches generated by HRNet.CAM module processing is performed frst, then the output of the CAM module is up-sampled (the output of the frst CAM module is not up-sampled), and then the associated feature maps are concatenated.

Mobile Information Systems
where CAM(•) denotes the output feature map after CAM processing, up[•] denotes the up-sampling operation, and • { } represents the tandem stitching operation.Immediately afterwards, U 1 , U 2 , U 3 , and U 4 are channel compressed using 1 × 1 convolution so that the four output feature maps have the same number of channels.Ten, a diference absolute value operation is performed with the output of the other encoder to fuse the features of the two images.Tat is, where conv(•) represents the 1 × 1 convolution operation to achieve channel compression.U i is the feature map extracted from the image at the moment of T1 by the encoding structure.U * i is the feature map extracted from the image at the moment of T2 by the encoding structure, and sub[•] represents the feature fusion of the two images after the diference in absolute value processing.Finally, the obtained F 1 , F 2 , F 3 , and F 4 are input to the side output embedding module (SOEM) to obtain the fnal predicted map.

Baseline: HRNet.
Most existing encoder methods perform cascading pooling operations (down-sampling) from high to low resolution to obtain.But the cascade pooling operation leads to a loss of spatial accuracy that is difcult to recover by the decoder.To overcome this limitation, HRNet [41,42] introduced a multiscale parallel design.It maintains high-resolution output throughout and fuses multiscale information so that the network can initially extract as many image features as possible.
As shown in Figure 2, HRNet consists of parallel subnetworks from high resolution to low resolution, with repeated information exchange across multiresolution subnetworks (multiscale fusion).Specifcally, the four parallel subnetworks are B1, B2, B3, and B4, where B1 always maintains a high-resolution representation.At the same time, the feature map after each convolution block is convolved 3 × 3 with a step of 2 (down-sampling) to reduce the space size of the feature map and up-sampling the feature maps after each convolution block to connect to diferent branches for multiscale fusion (the frst convolution block of B1 does not need to be up-sampled).Finally, the network generates four sets of feature maps with diferent resolutions.Tey are frst up-sampled to recover to the same size as branch B1 and then fused, and the fused feature maps can be used to generate segmentation results, which, of course, we do not need to generate here.Te four branches of HRNet are equivalent to 1/4, 1/8, 1/16, and 1/32 of the original input size.

Contextual Aggregation Module (CAM).
Contextual information is essential to determine the two categories of objects changed and unchanged, and many approaches add modules to the top of the encoding network to expand their efective receptive felds and integrate more contextual information.Atrous convolution is one design that can amplify the receptive felds of a convolutional neural network without signifcantly increasing the computational efort [32,33] since the dilation convolution structure is simple, easy to understand, and used directly or indirectly by most papers.Terefore, this paper proposes a contextual aggregation module that not only obtains global information but also provides more detailed local information.And it can be used together with the other two modules proposed in this paper for better performance.
Te detailed design of the CAM is shown in Figure 3.Given an input feature map size of T � C × H × W, the number of channels of the input feature map T is reduced to C � C/4 by 1 × 1 convolution.After that, four parallel dilation convolutions with dilation rate (dilation rate) of [1,2,4,8] are used to  Mobile Information Systems integrate more contextual information.Tis method can increase the receptive feld of the convolution kernel to obtain a larger range of information while keeping the number of parameters constant and can ensure that the output feature map size remains unchanged.Finally, the convolved feature map is connected to T to obtain X � 2C × H × W, and then, a 1 × 1 convolution is used to compress X to C × H × W. Te receptive feld is calculated by the following formula: where r denotes the dilation rate, and k is the original convolutional kernel size.In this paper, a parallel dilation convolution method is used to sample features with diferent dilation rates to obtain feature maps with diferent receptive felds and then fuse these features with channels to obtain information from diferent channels.As shown in Figure 4, we use four (F 1 , F 2 , F 3 , F 4 ) feature maps with diferent sizes after dual-temporal feature fusion to replace the top-down part of the pyramid network.First, the 1 × 1 convolved feature map is summed with the upsampled feature map to obtain a feature map with both coarse-grained features and fne-grained features.Te obtained small-size feature maps are then compressed to 2 dimensions by 1 × 1 convolution and up-sampled to a size of 128 × 128 (large-size feature maps are not up-sampled).Although obtained four groups of feature maps are of the same size, the semantic levels are diferent, and the spatial location representation is also diferent.Finally, these four feature maps are concatenated, compressed to 2 dimensions using 1 × 1 convolution, and up-sampled to a size of 512 × 512.And using the truth map (ground truth) and the obtained feature map to calculate the loss, the fnal prediction map is supervised to be generated.Te process of obtaining the forecast map is as follows: where D n denotes the feature map obtained by the feature pyramid network, sum • { } denotes the summation of the horizontal and vertical results, up[•] denotes the upsampling operation, and conv(•) denotes the 1 × 1 convolution operation, sup(•) denotes compressing the feature map to 2 dimensions and up-sampling to a size of 512 × 512, and generating the fnal prediction map using the ground truth with the obtained feature map for loss supervision.

Loss Function.
Due to the problems of signifcant imbalance of sample categories between changing and nonchanging regions, changing targets show diverse scale characteristics and small relative background occupancy.We use a loss function that is a combination of balanced binary cross-entropy and dice coefcient loss that is valid for sample equilibrium, and the loss function L [43] is a weighted sum of the two.Te formula is

6
Mobile Information Systems where L bce is the balanced binary cross-entropy loss; L dice is the dice coefcient loss; λ is the weighting factor, which takes the value of 0.5.Where η � (Y/(X + Y)), and 1 − η � (X/(X + Y)).X and Y represent the numbers of changed and unchanged pixels in the ground truth label images, respectively.δ(•) is the sigmoid output at pixel i.

Experimental Dataset and Evaluation
To evaluate the efectiveness of the method, we conducted a comprehensive experiment on three datasets, CDD, DSIFN, and SYSU-CD.And precision (P), recall (R), F1 score (F1), and overall accuracy (OA) were used as evaluation metrics.

Te CDD Dataset.
Te CDD [44] dataset with real seasonal variation was used for the frst experimental data.

Evaluation Metrics.
Remote sensing image change detection usually uses precision (P), recall (R), F1 score (F1), and overall accuracy (OA) as evaluation indexes, as shown in equations ( 6) to (9).F1 score is the summed average of precision and recall, and the higher the F1 score, the more robust the model is.In the CD task, a large value of P denotes a small number of false alarms, and a large value of R represents a small number of missed detections.Meanwhile, F1 and OA reveal the overall performance, where their larger values will lead to better performance.Four evaluation metrics are described as follows: where precision represents the precision rate, and recall represents the recall rate.P and N represent the judgment results of the model, T and F are used to evaluate whether the judgment results of the model are correct, FP refers to false positive cases, FN refers to false negative cases, TP refers to true cases, and TN refers to true negative cases.

Experimental Design and Results
Our network is implemented by TensorFlow as the backend using the Keras framework.Te lab is equipped with a dedicated server for the training of the network, for which we use small batch gradient descent with a batch size of 4.
We chose the Adma optimizer to optimize the network, where the initial learning rate for each dataset was set to 0.001, and all experiments were trained for 500 rounds.

Intermodule Ablation Experiments.
In this section, we conduct intermodule ablation experiments on CDD, SYSU-CD, and DSIFN datasets.Table 1 shows the quantitative analysis of the diferent modules on the two datasets CDD and DSIFN.Table 2 shows the quantitative analysis of the diferent modules on the SYSU-CD dataset.Figures 5-7 show the qualitative analysis of the two data sets.

Ablation Study of the Baseline Network.
We conduct experiments with vgg_16 and HRNet as the backbone networks, respectively.Te experimental results shown in Tables 1 and 2 demonstrate better performance when HRNet is used as the backbone network.Terefore, our benchmark network frst uses two twin high-resolution networks HRNet as the feature encoding module for feature extraction.

Mobile Information Systems
Secondly, the feature maps of diferent sizes after the output of the feature encoding module are frst channel-normalized.
Ten, an up-sampling operation and a tandem splicing operation are performed to obtain the information of eigenvalues with the same dimensionality.Finally, the eigenvalues of dual-temporal images are diferent, and the absolute values are taken to obtain dual-temporal feature fusion information at diferent scales.In the feature decoding module, the feature maps of diferent sizes obtained by fusing the diferences are subjected to diferent levels of up-sampling operations and fused for output.We quantitatively evaluated the performance of the benchmark network as shown in the second row of Tables 1 and 2. Te fourth and ffth columns in Figures 5-7 visualize that the results with HRNet as the backbone network are better than those with vgg_16 as the backbone network.
Te general outline of HRNet as the backbone network is shown in the visualization.

Ablation Studies of CAM.
We have designed the contextual aggregation module (CAM).Tis module can amplify the convolutional neural network receptive feld without increasing the computational efort, obtaining not only global but also (for diferent channels) detailed local contextual information.As can be seen from Table 1, the addition of the CAM module to the baseline network results in a signifcant improvement in all metrics.On the CDD dataset, the precision (P), recall (R), F1 score, and overall accuracy (OA) were 95.55%, 86.59%, 90.85%, and 96.73%, respectively, with the addition of this module to the baseline network.Compared to the baseline network, P, R, F1, and OA are improved by 0.71%, 0.82%, 0.78%, and 0.70%, respectively.On the DSIFN dataset, the precision (P), recall (R), F1 score, and overall accuracy (OA) were 90.02%, 84.43%, 87.14%, and 94.84%, respectively, after adding the module.Compared to the baseline network, P, R F1, and OA improved by 3.59%, 0.88%, 2.18%, and 0.48%, respectively.On the SYSU-CD dataset, the precision (P), recall (R), F1 score, and overall accuracy (OA) were 88.07%, 81.97%, 87.77%, and 92.25%, respectively, with the addition of this module to the baseline network.Compared to the baseline network, P, R, F1, and OA are improved by 1.23%, 2.09%, 1.53%, and 2.51%, respectively.From the sixth column of Figures 5-7, it is clear that the CAM module has improved the boundaries compared to the baseline network, and some Mobile Information Systems small targets can be seen in the sixth column of Figures 5 and  6.Its outline has been fully revealed, but there are still some detailed features that have not fully emerged and need to be extracted further.Tis indicates that parallel atrous convolution of multiple channels ensures maximum information extraction.

Ablation Studies of SOEM.
We also studied the contribution of the SOEM module to the network.Side output embedding module (SOEM) can fuse shallow and deep features and can improve the accuracy of small-volume targets and edge detection.And it enables the shallow layer to be trained more fully to avoid gradient disappearance and too slow convergence.As can be seen in Table 1, the signifcant gains in precision (P), recall (R), F1 score, and overall accuracy (OA) were achieved with the addition of the SOEM module compared to the frst two ablation experiments.On the CDD dataset, the addition of this module improves P, R, F1, and OA by 1.86%, 2.21%, 2.06%, and 1.83%, respectively, compared to the baseline network.Compared to baseline + CAM, P, R, F1, and OA improved by 1.15%, 1.39%, 1.28%, and 1.15%, respectively.On the DSIFN dataset, P, R, F1, and OA improved by 4.97%, 5.38%, 4.91%, and 1.01%, respectively, compared to the baseline network after adding this module.Compared to baseline + CAM, P, R, F1, and OA were improved by 1.02%, 4.50%, 2.73%, and 0.53%, respectively.On the SYSU-CD dataset, the addition of this module improves P, R, F1, and OA by 4.36%, 4.05%, 2.47%, and 5.15%, respectively, compared to the baseline network.Compared to baseline + CAM, P, R, F1, and OA improved by 3.13%, 1.96%, 0.94%, and 2.64%, respectively.As can be seen in the seventh column of Figures 5-7, the addition of this module not only improves the detection performance in general but also makes its edges more complete.And the seventh column of Figures 5 and 6 can fnd that some small targets are displayed more accurately.At the same time, the detailed features extracted by this module make the predicted result maps closer to the real labels.Terefore, using three modules together enables the network to be used to its best advantage.Five change detection networks include full convolutional network with pyramid pool (FCN-PP) [45], fully convolutional siamese-concatenation (FC-siam-conc) [46], fully convolutional siamese-diference (FC-siam-dif) [46], Unet++_MSOF [26], and IFN [22].Table 3 shows the experimental comparison of our method with the other fve methods on the CDD and DSIFN datasets.Table 4 shows the experimental comparison of our method with the other fve methods on the SYSU-CD dataset.As shown in Figure 8, the performance of diferent methods on CDD, DSIFN, and SYSU-CD datasets is quantitatively analyzed using a line graph format.show the visualization of the diferent methods on CDD, DSIFN, and SYSU-CD datasets.As shown in Table 3, we quantitatively evaluated the results of TCANet and diferent methods on the CDD and DSIFN datasets.As shown in Table 4, we quantitatively evaluated the results of TCANet and diferent methods on the SYSU-CD dataset.As shown in Figure 8, our network has the highest performance metrics on three datasets.On the CDD dataset, precision, recall, F1 score, and OA of TCANet were 96.70%, 87.98%, 92.13%, and 97.88%, respectively, compared to IFN by 1.00%, 0.18%, 0.55%, and 0.17%, respectively.On the DSIFN dataset, these three metrics for TCANet were 91.04%, 88.93%, 89.87%, and 95.37%, respectively, compared to IFN by 2.19%, 3.73%, 3.18%, and 6.51%, respectively.On the SYSU-CD dataset, precision, recall, F1 score, and OA of TCANet were 91.20%, 83.93%, 87.02%, and 94.89%, respectively, compared to IFN by 3.36%, 0.60%, 1.37%, and 3.78%, respectively.Figures 9-11 show the visualization of the diferent methods on the three datasets.Te red boxes represent the improvement areas.In comparing the visualization results of the experiments, it can be seen that the prediction maps of our proposed method are closer to the real labels, thus demonstrating the efectiveness of our proposed method.

Conclusion
In this paper, a twins context aggregation network (TCA-Net) is investigated.Tis study extracts features separately by feeding dual-temporal images into two networks with shared parameters.In the feature extraction stage, the limitations of the traditional "encoder-decoder" structure are considered.We introduce a parallel multiscale branching HRNet to reduce the loss of spatial information.In addition, we designed separate contextual aggregation modules (CAM) for each branch, expanding their efective receptive feld and integrating more contextual information.Ten, the two feature values obtained from the feature encoding stage are diferenced, and the absolute values are taken to obtain dualtemporal feature fusion information at diferent scales.Finally, we input the fused feature maps to the side output embedding module to facilitate the detection of edges and small targets.Our proposed architecture shows a greatly improved approach to existing architecture and achieves the better results on three remote sensing image datasets (CDD, DSIFN, and SYSU-CD datasets).One of the limitations of the method is that in order to avoid intensive computation, HRNet reduces the space size of the input data in the early layers.
In the future, we will plan to improve the training speed and accuracy of change detection by reducing the computation and trying to merge more parallel branches of HRNet.Since transformer has been widely used in the remote sensing feld, many scholars have migrated transformer to the semantic segmentation feld, so I will plan to introduce transformer to change detection in the next step.

Figure 5 :
Figure 5: Visualization of the three modular ablation experiments on the CDD dataset.Improved areas are marked with red boxes.

Figure 6 :
Figure 6: Visualization of the three modular ablation experiments on the DSIFN dataset.Improved areas are marked with red boxes.

Figure 7 :
Figure 7: Visualization of the three modular ablation experiments on the SYSU-CD dataset.Improved areas are marked with red boxes.

Figure 8 :
Figure 8: A shows the comparison of diferent networks on the CDD test set, and B shows the comparison of diferent networks on the DSIFN test set.
If the size of our input image is 512 × 512, the output sizes of the four branches of HRNet are 128 × 128 × 64, 64 × 64 × 128, 32 × 32 × 256, and 16 × 16 × 512.First, based on experience, we set the dilation rate of the dilation convolution to 1, 2, 4, and 8 (the dilation convolution with rate � 1 is equivalent to the ordinary convolution), and the receptive felds are calculated to be 3 × 3, 7 × 7, 15 × 15, and 31 × 31, respectively.Since the output feature maps of the latter two branches of HRNet are 32 × 32 and 16 × 16, respectively, they basically cover their main areas and achieve global awareness.Although our four CAM modules are designed the same, the four branches of HRNet have diferent output scales, and we can also get multiscale information.Finally, a feature map that incorporates these diferent receptive felds will further improve the performance of the network.
3.4.Side Output Embedding Module (SOEM).Side output embedding module (SOEM) proposed in this paper is a combination of feature pyramid network (FPN) and intermediate supervision.Feature pyramid networks can fuse shallow and deep features and can improve the accuracy of small-volume targets and edge detection.Intermediate supervision allows shallow layers to be trained more fully to avoid gradient disappearance and slow convergence.Terefore, this module allows us to improve our network in both speed and accuracy.
,000 pairs of images in the validation data set, and 4,000 pairs of images in the test data set.
tween 2007 and 2014.Te main types of changes in the SYSU-CD dataset include (a) new urban buildings; (b) suburban sprawl; (c) preconstruction foundation works; (d) changes in vegetation; (e) road expansion; and (f ) marine construction.Te 20,000 pairs of images are divided into a training set, a validation set, and a test set in the ratio of 3 : 1 : 1. Tere are 12,000 pairs of images in the training data set, 4

Table 1 :
Results of the ablation experiments of the modules on the CDD dataset and DSIFN dataset.

Table 2 :
Results of the ablation experiments of the modules on the SYSU-CD dataset.
Te best ones are marked in bold.

Table 4 :
Results of the quantitative evaluation of the diferent methods on the SYSU-CD datasets.
Te best ones are marked in bold.

Table 3 :
Results of the quantitative evaluation of the diferent methods on the CDD and DSIFN datasets.