Semantic Segmentation of Remote Sensing Image Based on Convolutional Neural Network and Mask Generation

High-resolution remote sensing images usually contain complex semantic information and confusing targets, so their semantic segmentation is an important and challenging task. To resolve the problem of inadequate utilization of multilayer features by existing methods, a semantic segmentation method for remote sensing images based on convolutional neural network and mask generation is proposed. In this method, the boundary box is used as the initial foreground segmentation profile, and the edge information of the foreground object is obtained by using the multilayer feature of the convolutional neural network. In order to obtain the rough object segmentation mask, the general shape and position of the foreground object are estimated by using the high-level features in the process of layer-by-layer iteration.+en, based on the obtained roughmask, the mask is updated layer by layer using the neural network characteristics to obtain a more accurate mask. In order to solve the difficulty of deep neural network training and the problem of degeneration after convergence, a framework based on residual learning was adopted, which can simplify the training of those very deep networks and improve the accuracy of the network. For comparison with other advanced algorithms, the proposed algorithm was tested on the Potsdam and Vaihingen datasets. Experimental results show that, compared with other algorithms, the algorithm in this article can effectively improve the overall precision of semantic segmentation of high-resolution remote sensing images and shorten the overall training time and segmentation time.


Introduction
e task of high-resolution remote sensing image semantic segmentation is to assign semantic labels to each pixel in the remote sensing image. In recent years, with the rapid development of remote sensing mapping technology, it is easy to obtain ultrahigh-resolution optical remote sensing images with a ground sampling distance (GSD) of 5-10 cm [1]. e urban high-resolution remote sensing image is mainly composed of artificial buildings, supplemented by some natural vegetation land. Artificial buildings mainly include houses, airports, roads, and bridges. If the housing target is accurately segmented, urban land use indicators such as residential density can be quickly obtained, which provides a basis for further urban planning. erefore, understanding the context semantics of the image accurately and labeling the pixels above the image become the research hotspot in the field of remote sensing image segmentation.
High-resolution remote sensing image contains rich semantic information that most of the traditional methods cannot effectively represent; the segmentation effect is not ideal. In the early days, the segmentation of artificial features and buildings mainly relied on vectorization model extraction technology [2], such as region-based segmentation, line analysis, and shadow analysis. A large number of studies rely on artificial design features, and the segmentation operation is realized by the supervised classifier. Artificial features often have poor generalization when they represent high-level semantic information. Deep learning technology can not only automatically extract features [3] but also fully mine the advanced semantic information features in the image.
Deep learning technology has achieved great success in the field of computer vision, such as image classification [4], object detection [5], and semantic segmentation [6]. e deep convolution neural network receives the input of the original image data, learns from the end-to-end structure, and obtains the final segmentation result according to the specific task. In the field of remote sensing image interpretation, deep convolution neural network has been widely studied. e semantic information contained in urban remote sensing images is complex [7], which has the characteristics of various artificial targets and many small targets [8]. Easily confused data are often adjacent or staggered in spatial distribution, which makes segmentation difficult. e fully convolutional neural network (FCN) was first applied to remote sensing image segmentation tasks [9]. It can accept image input and testing of any size and avoid the repeated storage and calculation caused by the use of pixel blocks. e problem of convolution is more efficient than traditional convolutional neural networks (CNN) with full connections, but its segmentation results are not fine enough, and the details of the image are not complete enough. On this basis, the "hourglass-shaped network" includes DeconvNet [10], SegNet [11], U-Net [12], and DeepUNet [13]. Other methods have been proposed and applied to remote sensing image segmentation. ese networks have made different adjustments in their decoder structure and achieved higher segmentation accuracy.
Based on the hole convolution, DeepLab [14] proposed a spatial pyramid module, which applies multisampling rate hole convolution, multireceptive field convolution, or pooling operations on the input feature map to explore multiscale context information. DeepLabV3+ [15] not only uses separable convolution but also introduces the encoderdecoder structure commonly used in semantic segmentation in order to fuse multiscale information, gradually recovering spatial information to capture clear segmentation target boundaries and refine the segmentation results.
is is a network framework model that performs well in the field of image semantic segmentation.
Many methods have been tried to classify high-resolution remote sensing images. Literature [16] used support vector machine (SVM) to classify remote sensing image objects. Literature [17] used an unsupervised clustering algorithm to segment houses in remote sensing images. Literature [18] improved the traditional edge detection method so that small objects in remote sensing images could also be segmented. Remote sensing images contain rich spectral information, so traditional feature extraction methods cannot achieve good segmentation results. From the perspective of pattern recognition, the selection of typical features is the bottleneck to improve the recognition accuracy [19]. It is impossible to accurately classify all types of ground objects by only using a specific set of features. erefore, automatic learning and classification of features in the corresponding dataset can improve target classification accuracy more effectively than manual design features [20].
In order to improve the semantic segmentation accuracy of high-resolution remote sensing images, a semantic segmentation method based on CNN and mask generation is proposed in this article. Firstly, an edge extraction method based on an iterative GMM model is used to fuse CNN multilayer features through the iterative method. One-layer feature map is used as feature input in each round of iterative operation. en, the manually marked bounding box is used as the initial value of the foreground target contour, and the segmentation mask is modified step by step using the GMM model. Experiments show that the proposed algorithm has better semantic segmentation performance on the CCF dataset.
e rest of the article is organized as follows. In Section 2, the related work is introduced. In Section 3, the method proposed by this article is introduced. Section 4 gives the network training. Section 5 gives the experimental results. At last, Section 6 draws the conclusion of this article.

Related Work
Image semantic segmentation is to segment the input image into multiple semantic regions, that is, to assign a semantic category to each pixel in the image. In recent years, a lot of studies have been conducted on image semantic segmentation based on deep learning at home and abroad. At present, the mainstream methods are based on the fully convolution network [6], aiming at the feature encoding and decoding [21], expanding the receiving domain of convolution operation [22], and making multiscale improvements on feature fusion.
Deep learning requires a large number of training samples, and the topic of image semantic segmentation needs to label each pixel in the training samples. On the one hand, pixel-level labeling is difficult. On the other hand, a large number of sample labels mean a high cost of manual labeling. erefore, image semantic segmentation based on weak supervision has become a research hotspot. e annotation methods used in weakly supervised image semantic segmentation include boundary box annotation, point annotation, and image-level annotation. In many labeling methods with poor supervision, the time of boundary layer labeling is short and the segmentation accuracy is high. Using the method in [23], the segmentation accuracy of the weakly supervised training result based on bounding box annotation can reach 88.2% of the training result based on pixel annotation. According to its research, the time of pixellevel annotation is 15 times that of bounding box annotation.
erefore, under the same annotation workload, more training samples can be obtained by using boundary box annotation, and a segmentation model with better generalization and robustness can be obtained. e theory of deep learning has received extensive attention since it was put forward. e basic motivation of deep learning is to build a deep neural network to simulate the learning and analysis mechanism of the human brain. Compared with traditional machine learning algorithms, deep learning emphasizes the automatic learning of features from huge data through multilayer neuron organization. Typical deep learning structures include recurrent neural network (RNN), deep belief network (DBN), and CNN [24]. CNN has achieved remarkable results in computer vision tasks, such as image classification and target recognition, and has achieved excellent results in the competition of authoritative datasets in ImageNet, PASCAL VOC, and other fields. Literature [25] researched and designed a 7layer CNN model (named AlexNet) and won the championship of the LVSRC (ImageNet Large Scale Visual Recognition Challenge) competition. Many scholars conducted semantic analysis research on remote sensing images based on the CNN method. Literature [26] proposed a 5-layer network structure to complete the target classification of remote sensing images. Hu et al. [27] used a pretrained CNN model to classify different remote sensing image scenes. Mnih [28] proposed a CNN-based large-scale context feature learning structure for aerial images, but the effect still needs to be improved. e number of pixels in high-resolution images is huge, so it is very difficult to achieve pixelby-pixel classification. e current pixel-level target classification accuracy is not ideal.
e key problem of semantic segmentation of weakly supervised images is how to use weak annotation to conduct supervised training on semantic segmentation networks. For boundary box annotation, the existing research methods can be divided into two categories: response area extraction method and pseudolabel-based method.
e method based on response region extraction is usually to train the FCN network directly by designing specific regular items. Literature [29] carried out similarity constraint on the categories of adjacent pixels in the output results of FCN and added this constraint to the loss function as a regular term. Literature [30] used the target loss function based on class activation mapping (CAM) and added a regular term of regional similarity based on global weighting pooling and a regular term of segmentation boundary based on conditional random field (CRF).
Since regular terms are usually calculated using the prediction results of image segmentation, they rely more on high-level semantic features. e high-level features have a large receptive field and are not sensitive enough to the edges of the target object. erefore, the accuracy of the semantic segmentation model trained by such methods is relatively low. e cut and paste method generates "fake" samples by segmenting high-response regions and uses generative confrontation networks to improve the accuracy of segmentation of high-response regions [31]. e pseudolabel-based method is to generate pseudolabel masks by using multiscale combination grouping, CRF, GrabCut, and other methods and train pseudolabel masks as alternative data for pixel-level labeling [32]. In the bounding box supervision method, the multiscale combined grouping method is adopted to generate candidate pseudolabel areas.
e IoU of the pseudolabel areas and the labeled areas of the boundary box are used as weight parameters, and all pseudolabels are used for weighted training [33]. According to the research, multiple iterations of training can gradually reduce the noise of pseudotags, thus improving the segmentation accuracy. e weakly semisupervised method adopted the EM algorithm as the iterative training mode and adopted the CRF algorithm to generate fake tags. GrabCut is used in mask sort method to generate pseudotags and takes the output of the FCN network as a unary constraint item in the GrabCut target function. e probability graph model is used in the existing pseudotag generation methods, such as CRF and GrabCut. Most of them only use the underlying image features as binary constraints. Rajchl [23] used only the colors of the three channels in the RGB image, while Khoreva [34] used only the edge features. e underlying image features do not contain semantic information. If the input image includes foreground objects with more complex colors, such methods are prone to produce high-noise pseudolabels. In order to solve the problem of lack of semantic information, most of these methods use high-level features as the input of unary constraints. When high-level features and low-level features are used as part of the objective function, the training process is separated from each other, and the multiscale image features in the FCN network are not fully utilized. Although the influence of pseudolabel noise can be reduced through multiple iterations during the training process, it will also double the training time.

Algorithm Implementation
e framework of the weak supervision training method is shown in Figure 1. For each training sample, the training process is divided into three steps. e first step is to move forward. For the input training sample image, the residual network module is used for multilevel feature extraction. e second step is mask generation. Firstly, multilayer image features are extracted from the residual network module. e third step is backpropagation training. e mask is used as the monitoring information to complete the backpropagation training of the residual network module.

Residual Network.
A deep convolution neural network has made a series of breakthroughs in the field of image classification. Recent research results show that the depth of the model plays a crucial role, and many recognition algorithms benefit from very deep models.
However, the deeper the neural network is, the more difficult it is to train it. In the case that the deep network can converge, degradation will occur, and the accuracy tends to saturation with the increase of the network depth. To solve this problem, a residual learning-based framework is used here, which simplifies training for those very deep networks. e optimal solution mapping is represented by H(x), where xis the input to these layers. e stacked nonlinear layer is used to fit another mapping H(x): � H(x) − x, and then the original optimal solution mapping H(x) is rewritten as F(x) + x. Assuming that the residual mapping is easier to optimize than the original mapping, even in extreme cases where mapping is optimizable, it is easy to push the residual to 0. It is much easier to push the residuals to 0 than to approximate the map to another nonlinear layer. e residual learning algorithm is adopted on the stacked layer. A building block is shown in Figure 2. e definition of the building block is as follows: Mathematical Problems in Engineering where x and y represent the input and output of the corresponding layer, respectively.F(x, W i ) represents the learned residual mapping function. e example in Figure 2 contains two layers, F � W 2 σ(W 1 x), where σ represents ReLU and the bias term is omitted for simplicity. F + x is obtained by a quick join and element-level addition. After addition, we perform another nonlinear operation (for example, σ(y)). e quick connections described here are attractive because they do not add additional parameters or computational complexity and are especially important when comparing normal and residual networks. It allows a fair comparison of the two networks based on the same parameters, depth, width, and computational cost (except for negligible element-level addition). In the above equation, the dimensions of x and F must be the same. If there are inconsistencies, such as changing the number of channels for input and output, a linear mapping on the shortcut connection is required to match the dimensions.
where W s is a linear mapping matrix, which is only used to solve the dimension mismatch problem here.

Dynamic Mask Generation
Method. e dynamic mask generation method is based on the rectangular edge marked by the boundary box, which modifies the segmentation mask used for monitoring training through iterative updating and generates the dynamic mask according to the final contour. e process is shown in Figure 1. As shown in Figure 1, the input data of the dynamic mask generation method include boundary box annotation data and CNN multilayer feature map. Firstly, the eigenvectors of all eigengraphs need to be normalized. Second, the GMM model is initialized with the eigenvectors of each sampling point. Finally, the probability of the sampling point relative to the GMM model is calculated to complete the mask update.

Normalization of Feature Maps.
As the midtier data extracted from the CNN network usually have no upper and lower bounds, and the eigenvalues of different dimensions vary greatly, feature bias will occur in the calculation of feature distance, so the normalization operation of feature vectors is needed.F 0 ∈ R H×W×C is used to represent the CNN feature graph of a layer, where H is the height of the feature graph, W is the width of the feature graph, and C is the number of channels of the feature graph, namely, the dimension of the feature vector of the sampling point. en, the calculation equation of the normalized feature graph F is expressed as follows: Step 1 forward propagation Step 2 mask generation

Mathematical Problems in Engineering
In the equation, F ijc � F(i, j, c) represents the value at coordinates (i, j, c) in the normalized feature graph F and F 0 ijc � F 0 (i, j, c) represents the value at coordinates (i, j, c) in the CNN feature graph F 0 .
Due to the boundary filling operation in CNN convolution, the pixel positions in the feature map are not strictly proportional to the pixel positions of the corresponding input image. As shown in Figure 2, the data at the edge of CNN feature map are redundant data caused by the filling operation of convolution operation, so the feature map needs to be clipped during normalization.
As shown in Figure 1, in the first iteration, the boundary box annotation is taken as the initial image edge. e position and size of the bounding box scale to 1/32 of the annotated data. e foreground GMM model and background GMM model are initialized by collecting sample points based on the current edge contour. e feature vector of the GMM model adopts the normalized 1/32 feature graph F data. en, the foreground and background GMM models were used to classify all the sampling points in the feature map, and the classification results were used as updated image edge data.
In round 1, the width and height of the mask update result graph are 1/32 of the input image, respectively. Moreover, the result graph is used as the basis for the next round of GMM model initialization, and so on. e input of the last round is characterized by the original input image. e size of the mask update result image is the same as the input image. And the classification result of the round is taken as the final output mask.

Mask Update.
e flow of the mask update method is shown in Figure 3. Input data for each round of mask update include input foreground edge and normalized feature map. And input foreground edge is bounding box annotation or result of previous round edge update. e process of mask update is to first determine the sample of foreground GMM and background GMM, and all pixel points on the input feature map are sampling points. Two types of samples are then used to initialize the foreground GMM model G f and background GMM model G b , respectively. Finally, the sampling points at the edge of foreground and background are classified again.
In Figure 3, the sampling points for foreground and background are divided as follows: the sampling points inside the existing edge contour are the foreground samples, and the sampling points outside the edge contour are the background samples.
In Figure 3, the method to obtain the list of boundary sampling points is expressed as follows. If the sampling point S is adjacent to the existing edge contour, then the sampling point is added to the list of boundary sampling points.
In Figure 3, GMM is used to classify the sampling point S as follows. If G f (S)〉G b (S), the pixel is classified into the foreground category. Otherwise, it is classified into the background category.
In Figure 3, the method to update the list of boundary sampling points is to add all sampling points adjacent to the current sampling point to the list of boundary sampling points if the classification result of the current sampling point is different from the initial value.

Semantic Segmentation Method.
e dynamic mask generated in Section 3.2 serves as the monitoring information during the semantic segmentation training to provide feedback to the CNN network. In training, according to the forward propagation operation results of each input sample image, the mask is dynamically generated, and the traditional pixel-level annotation is replaced with the mask to complete the calculation of the loss function. y ∈ R H×W×C is used to represent the dynamic mask obtained in Section 3.2, W and H are the width and height of the input image, respectively, and C is the number of semantic categories. h ∈ R H×W×C is used to represent the pixel-level prediction results in Figure 3, whose width and height are also W andH. e loss function used in training is L(θ) � i,j l(h ij ,y ij ), where l(h ij , y ij ) represents softmax loss function, and h ij and y ij , respectively, represent the data of prediction result h and pseudolabel y at coordinates (i, j). e weak supervisory training framework for the image semantic segmentation network is shown in Figure 2, which consists of three parts. And the three parts contain the residual network module, RefineNet module, and    In the split mask generation module, the output feature graphs of Conv2, Conv3, Conv4, and Conv5 were extracted from the residual network module as the input data extracted from the edges. e output mask image size of the split mask generation module was (W/4, H/4).
Because the generation process of segmentation mask relies on the semantic information contained by CNN features, the feature map extracted by CNN does not contain semantic information and cannot be used as features to extract edge information in the early stage of training. If the training method shown in Figure 2 is used directly, an effective image semantic segmentation model cannot be obtained. A pretraining process is therefore required. In the process of pretraining, only the input image is used as the feature, the mask generation method is adopted to generate the pseudolabel of all samples, and the pseudolabel is used as pixel-level labeling to complete the pretraining. After the pretraining, the training was completed using the method shown in Figure 2.
In the test process, only the residue network module and RefineNet module are needed to carry out forward propagation operation, and the image semantic segmentation results can be obtained without dynamic mask generation.

Network Training
In the weakly supervised semantic segmentation comparison experiment, RefineNet was used as the image semantic segmentation training framework, and ResNet101 was used as the image feature extraction network. As the size of the feature map of each CNN multilayer feature used in the segmentation mask generation module is different, GMM model parameters also need to be set differently. On the four CNN feature layers and input images extracted during the dynamic mask generation process, the number of submodels K of the GMM model is set to 3, 5, 10, 15, and 20, respectively. During the model training, the batch gradient descent method was adopted, and the sample quantity of each batch was set as 2. During the initial training, the learning rate was 5 × 10 − 4 , and 40 training cycles were completed. After that, the training method, as shown in Figure 2, was adopted to continue the training. e initial learning rate was 5 × 10 − 4 . After completing 40 training cycles, the learning rate was modified to 5 × 10 − 5 and 40 training cycles were completed. e momentum parameter of parameter update is 0.9, and the attenuation coefficient is 1 × 10 − 4 .

Data Sources and Preprocessing.
e method is evaluated on the ISPRS 2D semantic labeling benchmark. ere are two airborne image datasets, consisting of high-resolution true orthophoto (TOP) tiles and the corresponding digital surface models (DSMs). e Vaihingen dataset contains 33 patches (of different sizes), each consisting of a TOP extracted from a larger TOP mosaic. Each patch has 3-band IRRG (Infrared, Red, and green) image data and DSM. e selected experimental image is shown in Figure 4.
e Potsdam dataset contains 38 patches (of the same size), each consisting of a TOP extracted from a larger TOP mosaic. Each patch contains 4-band IRRGB (infrared, red, green, and blue) image data and DSM. e selected experimental image is shown in Figure 5. Notably, the IRRG images from the Vaihingen dataset and RGB images from the Potsdam dataset are used in this article. e images contained in this dataset are all large-sized high-resolution satellite remote sensing images, and the depth of CNN has a huge number of parameters, which requires high computer performance. Images like this dataset cannot be directly processed. Cut them into smallsized images first. e high-resolution images in this dataset are cut into 256 × 256 small-sized images. Starting from the upper left corner of the image, slide cutting with a step size of 256 × 256 pixels is performed. e number of weight parameters between deep CNN neurons is huge, which has high requirements for the number of training samples. For this reason, some methods are used to expand the existing samples. e small-sized images obtained by the previous cutting are, respectively, flipped and transformed and rotated 90°, 180°, and 270°c lockwise. ese methods can greatly expand the data volume of the sample, thereby effectively preventing the occurrence of overfitting. e partially transformed image is shown in Figure 6.

Parameter Setting and Experimental Platform.
e experiment divided the dataset according to the ratio of 5 : 1.
ere are 75,000 images taken as the training set and 25,000 images taken as the verification set. e learning rate is the parameter that controls the learning speed of the network. In the experiment, the learning rate was set at 0.01 according to the empirical value, at which time the model converges slowly and overfits occur. When the improved learning rate is 0.05, the training curve oscillates greatly, and the model cannot reach the optimal value. After several times of reference adjustment, the appropriate learning rate was finally selected as 0.01, the batch scale was set as 10, the number of experimental iterations was set as 30, and the optimizer was set as random gradient descent (SGD). SGD randomly selected a sample for gradient update as follows: where W ′ is the updated weight, W is the original weight, α is the learning rate, and J t (W) is the t-th sample loss. e number of convolution kernels was set to increase gradually from 64 to 512, which improved the network prediction performance. In the setting of activation function, ReLU and ELU are, respectively, used in this article to train the network and compare its final semantic segmentation effect. In order to ensure the accuracy of comparison verification results, the unified loss function was adopted in all networks as follows: where y * N j indicates the prediction of the category of sample j by the network, while y N j indicates the true category of sample j. e experimental environment configuration CPU is Intel(R)Core(TM)i7G 9700K processor, the graphics card is two NVIDA GeForce GTX1080Ti graphics cards, and the total memory capacity is 32 GB. All CNN models running are carried out under the Pytorch framework.

Evaluation Index.
e IoU index is a common model prediction index, which reflects the interaction ratio between the target region and the real source tag. e calculation equation is as follows: In the equation, t p is the actual number of pixels belonging to this class in the predicted results and it is also the number of pixels belonging to this class. f n is used to predict the number of pixels that belong to the class, but it actually belongs to other classes. f p is the number of pixels in the predicted results that actually belong to another category.
In general, if the IoU value exceeds 50%, the network is considered to have good image segmentation performance. Mathematical Problems in Engineering e accuracy rate can represent the accuracy rate of the predicted results, which represents the proportion of the pixels actually belonging to the class predicted in the model. Its calculation equation is as follows: Precision rate, which is also known as precision rate, is also a common evaluation index in semantic segmentation. In addition, recall rate is also used to evaluate the semantic segmentation effect of remote sensing images in rural areas with different networks. Its calculation equation is as follows: Recall rate is the ratio of the number of pixels in the class to the actual number of pixels in the class in the predicted results, also known as recall rate. It reflects the classification performance of the model to all positive samples in the dataset.

Experimental Results and Analysis
Several classical semantic segmentation networks are used to perform semantic segmentation on the predicted images, and the segmentation effects are compared to mark different  Table 1, where R, G, and B, respectively, represent red, green, and blue.  [35], and HUSTW5 [36], respectively, as shown in Table 2.
It can be seen from the training results that the network U-Net has the fastest convergence speed [37,38], which takes 15.9 hours. CASIA3 training takes 20.6 hours. HUSTW5 training takes 17.1 hours. e proposed network used RefineNet takes 16.6 hours. RefineNet used in this article is further refined on the construct. e mean value and standard deviation of small batch are used to adjust the intermediate output of the neural network so that the value of the intermediate output of each layer of the whole neural network is more stable, and the convergence is accelerated. At the same time, the multiscale feature extraction structure is adopted to ensure the segmentation precision and obtain a relatively better segmentation effect. It is efficient in memory and computation time and easy to train. e activation function of ELU has negative values, which can push the output mean value of the activation unit closer to zero, reduce the offset effect, and then make the gradient close to the natural gradient [39]. Compared with the ReLU method, some recent works have also found that the U-Net method can make the network fit faster [40,41].
For remote sensing images of the Vaihingen and the Potsdam dataset, the semantic segmentation results obtained by different methods are visualized, as shown in Figures 7  and 8.
According to Figures 7 and 8, the segmentation result of the improved method reduces the wrong segmentation to the minimum, which is the closest to the real value. And the segmentation effect is more intuitive and the overall visual perception is optimal. e specific data of evaluation indexes obtained by each segmentation method are shown in Tables 3 and 4.  It can be seen from Tables 3 and 4 that the proposed method in this article achieves the optimal effect to segment low vegetation, buildings, impervious surfaces, cars, and trees. e activation function ELU has a better effect than ReLU because the negative exponential term in ELU can prevent the emergence of silent neurons and improve the learning efficiency.

Original image
Ground truth U-Net CASIA3 Proposed

Conclusion and Future Work
Aiming at the difficulty of semantic segmentation of highresolution remote sensing images, this article proposed a semantic segmentation method based on CNN and mask generation. e edge extraction method based on the iterative GMM model is used to fuse the multilayer features of CNN through the iterative method. One-layer feature graph is used as feature input in each iteration operation. e manually marked bounding box was used as the initial value of the foreground object contour, and the segmentation mask was modified step by step using the GMM model. At the same time, the framework of residual learning is used to solve the problem of deep network degradation after convergence. e experiment shows that the edge information of the foreground object is extracted by fusing the features of the high, middle, and bottom images. Semantic features are used to reduce semantic level errors of the target contour and the underlying features are used to improve the accuracy of the edge contour. e experiment was performed on the Potsdam and Vaihingen datasets. e results show that the proposed algorithm can effectively improve the overall precision of semantic segmentation of high-resolution remote sensing images and shorten the overall training time and segmentation time.
e future work will tackle optimizing the algorithm, designing the low-resolution image, and also completing the image recognition work.

Data Availability
e labeled datasets used to support the findings of this study are available from the corresponding author upon request.   e bold values represent the maximum values of evaluation index in the same column, such as precision, recall, and IoU and the index values of this paper are greater than those of other algorithms.