A Deep Multiscale Fusion Method via Low-Rank Sparse Decomposition for Object Saliency Detection Based on Urban Data in Optical Remote Sensing Images

The urban data provides a wealth of information that can support the life and work for people. In this work, we research the object saliency detection in optical remote sensing images, which is conducive to the interpretation of urban scenes. Saliency detection selects the regions with important information in the remote sensing images, which severely imitates the human visual system. It plays a powerful role in other image processing. It has successfully made great achievements in change detection, object tracking, temperature reversal, and other tasks. The traditional method has some disadvantages such as poor robustness and high computational complexity. Therefore, this paper proposes a deep multiscale fusion method via low-rank sparse decomposition for object saliency detection in optical remote sensing images. First, we execute multiscale segmentation for remote sensing images. Then, we calculate the saliency value, and the proposal region is generated. The superpixel blocks of the remaining proposal regions of the segmentation map are input into the convolutional neural network. By extracting the depth feature, the saliency value is calculated and the proposal regions are updated. The feature transformation matrix is obtained based on the gradient descent method, and the high-level semantic prior knowledge is obtained by using the convolutional neural network. The process is iterated continuously to obtain the saliency map at each scale. The low-rank sparse decomposition of the transformed matrix is carried out by robust principal component analysis. Finally, the weight cellular automata method is utilized to fuse the multiscale saliency graphs and the saliency map calculated according to the sparse noise obtained by decomposition. Meanwhile, the object priors knowledge can filter most of the background information, reduce unnecessary depth feature extraction, and meaningfully improve the saliency detection rate. The experiment results show that the proposed method can effectively improve the detection effect compared to other deep learning methods.


Introduction
With the rapid promotion of information technology, urban data has become one of the important information sources for human beings. And the amount of information received by people has increased exponentially [1,2]. How to select the object regions of human interest from the mass of image information in urban becomes a significant research. Studies have found that under a complex scene, the human visual processing system will focus on several objects, named region of interest (ROI) [3]. ROI is relatively close to human visual perception. Saliency, as the image pretreatment process, can be widely applied in remote sensing areas such as visual tracking, image classification, image segmentation, and target relocation.
The saliency detection method mainly contains two aspects: top-down and bottom-up. The top-down-based saliency detection method [4][5][6] is a task-driven process. The ground-truth images are labeled manually for supervised training. It integrates more perceptions of humans to obtain the salient map. However, the bottom-up method is a datadriven process and pays more attention to the images' features such as contrast, position, and texture to compute the saliency map (SM). Itti et al. [7] proposed a spatial visual model taking full advantage of local contrast and obtained the saliency map via the image differences from the center to the surrounding. Hou and Zhang [8] put forward a saliency detection algorithm based on Spectral Residual (SR). Achanta et al. [9] proposed a frequency-tuned (FT) method based on the image frequency domain to calculate saliency. A detection method combining histogram was presented to calculate global contrast [10]. Furthermore, other relevant methods were raised and showed better effect [11][12][13][14][15]. But they do not analyze the image from the dimensions.
Yan et al. [16] treated the saliency region of the image as sparse noise and the background as a low-rank matrix. It calculated the saliency of the image by using the sparse representation and robust principal component analysis algorithm. Firstly, the image was decomposed into 8 × 8 blocks. Every image block was sparsely encoded and merged into a coding matrix. Then, the coding matrix was decomposed by robust principal component analysis. Finally, the sparse matrix obtained by decomposition was devoted to establish the saliency factor of the corresponding image block. However, because the large-size saliency object contained many image blocks, the saliency object in each image block no longer satisfied the sparse feature; thus, it greatly affected the detection effect. Lang et al. [17] utilized a multitask low-rank recovery approach for saliency detection. The multitask low-rank representation algorithm was used to decompose the feature matrix and constrained the consistency of all feature sparse components in the same image blocks. The algorithm used the consistency information of multifeature description, and its effect was improved. However, since the large-size target contained a large number of feature descriptions, the feature was no longer sparse. The reconstruction error could not solve this problem, so this method could not completely detect the saliency object with a large size. To perfect the result of the above method, Shen and Wu [18] proposed a low-rank matrix recovery (LRMR) algorithm combining bottom-up and top-down algorithm (providing high-level and low-level information, respectively). First, it performed the superpixel segment in the image and several features were extracted. Then, the feature transformation matrix and a priori knowledge, including size, texture, and color, were obtained by network learning to transform the feature matrix. Finally, the low-rank and sparse decomposition of the transformed matrix were carried out by using the robust principal component analysis algorithm. This method improved the deficiency to some extent. However, due to the limitation of center prior and the failure of color prior to complex scenes, this algorithm was not ideal for detecting images with complex backgrounds.
The saliency detection method using different low-level features is usually only effective for a specific type of image, which is not suitable for multiobject images in complex scenes [19][20][21]. Figure 1 is the instance of saliency detection. The low-level features of visual stimuli lack an understanding of the nature of saliency objects and cannot represent the features at a deeper level. For noisy objects in the image, if they are similar to the low-level features but do not belong to the same category, they are often wrongly detected as saliency objects. Yang et al. [22] showed a bag of word model to detect saliency. Firstly, the prior probability saliency map could be obtained through the object feature, and a word bag model representing the middle semantic features was established to calculate the conditional probability saliency graph. Finally, two saliency images were synthesized by Bayesian inference. The middle semantic features could represent the image content more accurately than the bottom features. Therefore, the detection effect was more accurate. Jiang et al. [23] took saliency detection as a regression problem and integrated regional attributes, contrast, and feature vectors of regional background knowledge at multiscale segmentation conditions. The saliency map was obtained by supervised learning. Due to the introduction of background knowledge features, the algorithm had a better ability to identify background objects, and thus obtained more accurate foreground detection results.
Deep learning (DL) combines low-level features to form more abstract high-level features, a typical representative is a convolutional neural network (CNN). Many saliency detection methods have adopted CNN to optimize the result. Li et al. [24] proposed deep CNN to detect saliency. Firstly, region and edge information were obtained by using the hyper-pixel algorithm and bilateral filtering. DCNN was utilized to extract the regions and edge features in raw images. Finally, the region confidence graph and edge confidence graph generated by CNN were integrated into the conditional random field to judge the saliency. Wang et al. [25] proposed recurrent fully CNN (i.e., RFCNN) for saliency detection, which mainly included two steps: pretraining and fine-tuning. RFCN was used to train the original image to correct the saliency prior image. Then, the traditional algorithm was used to further optimize the modified saliency graph.  Wireless Communications and Mobile Computing Lee et al. [26] proposed a deep saliency (DS) algorithm for saliency detection using low and high-level information in a unified CNN framework. VGG-Net was used to extract the advanced features. It mainly extracted the low-level features. Then, the CNN was used to encode the distance graph. Finally, the coded low-level distance graph was connected with higher features. A full-connected CNN classifier was adopted to evaluate the features' information and obtain the saliency graph [27]. The above DL methods show the excellent performance in terms of saliency detection rate. But there are still some disadvantages such as slow speed and highly complex calculations.
In this paper, we propose a deep multiscale fusion method via low-rank sparse decomposition for object saliency detection in optical remote sensing images. The main contributions are as follows.
(a) First, multiscale segmentation is executed for remote sensing images. For the first segmentation graph, the depth features of all the superpixel blocks are extracted by CNN (b) Then, we calculate the saliency value, and the proposal region is generated. The superpixel blocks of the remaining proposal regions of the segmentation graph are input into the CNN network. By extracting the depth feature, the saliency value is calculated and the proposal regions are updated. Meanwhile, the color, texture, and edge feature mean values of all the pixels in each superpixel are calculated to construct the feature matrix. In order to make the image background facilitate low-rank sparse decomposition, the above feature matrices need to be transformed so that the background can be represented as a low-rank matrix in the new feature space (c) To make use of the high-level information and improve the detection effect of the ROI, the fully convolutional neural network is used for learning fea-tures, and the high-level semantic prior knowledge matrix is obtained. The feature matrix is transformed by using the feature transformation matrix and the high-level semantic prior knowledge. The robust principal component analysis algorithm is used to decompose the transformed matrix into a low-rank sparse decomposition to obtain a saliency map. The process is iterated continuously to obtain the saliency map on each scale (d) Finally, the weight cellular automata method fuses the multiscale saliency graphs. It is shown that the proposed method can effectively improve the detection effect compared to other DL methods The remainder of the paper is organized as follows. The proposed deep multiscale fusion method for saliency detection is analyzed in section II. Section III introduces the saliency region extraction based on multiscale segmentation. Saliency is calculated based on the deep features in section IV. The performance and robustness are evaluated in section V. Conclusion is drawn in section VI.

Deep Multiscale Fusion for Saliency Detection
The proposed deep multiscale fusion method for saliency detection in optical remote sensing images is shown in Figure 2.
Firstly, the image l is segmented into a small number of superpixel blocks by using the superpixel segmentation algorithm. The deep feature is extracted from all the superpixel blocks. The color, texture, and edge feature mean value of all the pixels in each superpixel are calculated to construct the feature matrix. In order to make the image background facilitate low-rank sparse decomposition, the above feature matrix needs to be transformed so that the background can be represented as a low-rank matrix in the new feature space. And the multidimensional feature containing the key  Figure 2: The framework of proposed saliency detection.

Wireless Communications and Mobile Computing
information of the image is extracted by PCA (principal component analysis). The rough segmentation saliency graph is obtained based on the calculation of key features, where we can extract the initial saliency region to obtain the superpixel set Suppix. Then, we adopt Suppix to centralize the similarity degree between superpixel and the nonobject region. The input image is segmented at different scales. The region containing the superpixel block in the Su ppix set is selected for depth feature extraction. Saliency maps and Suppix sets at the next scale are obtained based on the same method. The robust PCA is used to decompose the transformed matrix into a low-rank sparse decomposition to obtain a saliency map. Weight cellular automata fusion is used to obtain the final SM M f inal .

Saliency Region Extraction Based on Multiscale Segmentation
Superpixel segmentation is to gather adjacent similar pixel points into image regions with different sizes according to the low-level features such as brightness, thus reducing the complexity of significance calculation. The superpixel segmentation algorithm mainly includes watershed [28] and simple linear iterative clustering (SLIC) [25] method. We combine their respective characteristics, SLIC method is used to obtain the segmentation results with regular shape and uniform size during rough segmentation, and the watershed algorithm is used to obtain better object contour during fine segmentation in this study.
denotes the obtained superpixel set at a certain segmentation scale. N j denotes the superpixel number at scale s j . Sp j i ðvÞ = fR, G, B, L, a, bg is pixel's color feature vector in the superpixel.
For the input image, we extract color, texture, and edge features to construct the feature matrix.
The saliency region of the image is regarded as sparse noise and the background as a low-rank matrix. In the complex background, the image background similarity degree after clustering is still not high. Therefore, the features in the original image are not conducive to low-rank sparse decomposition. In order to find a suitable feature space, most image backgrounds can be represented as low-rank matrices; in this paper, the eigentransformation matrix is obtained based on the gradient descent method. The process of obtaining the eigentransformation matrix is as follows: (a) Construct marker matrix Q = diag fq 1 , q 2 ,⋯,q N g. If the superpixel p i is within the marked saliency region manually, q i = 0. Otherwise, q i = 1 (b) According to the following formula, the optimal model of transformation matrix T is utilized to learn the features of raw image Where F k ∈ R d×N k is the feature matrix of kth image. N k represents the superpixel number of kth image. Q k ∈ R N k ×N k is the labeled matrix of the kth image. k⋅k ∘ represents the kernel norm of the matrix, that is, the sum of all singular values of the matrix. γ is the weight coefficient. kTk 2 denotes the ℓ 2 norm of the matrix T. c is a constant to prevent T from arbitrarily increasing or decreasing. If the eigentransformation matrix T is appropriate, then TFQ is low rank. −γkTk ∘ is to avoid obtaining the general solution when the rank of T is arbitrarily small.
(c) Find the T optimal gradient descent direction, that is (d) Adopt the following formula to update the eigentransformation matrix T until the algorithm converges to the local optimal. α is the step size 3.1. Extracting Proposal Region. The segmentation graph of a rough segmentation scale s j is taken as input. The saliency map Map j is obtained by depth feature extraction and saliency value calculation. The Map j , as the object prior knowledge in the next segmentation, is used to guide the proposal region extraction. The saliency Map j is binarized. The value of Map j is divided into K channels by the adaptive threshold strategy. pðiÞ is used to represent the number of pixels in the channel i. The channel k with 4 Wireless Communications and Mobile Computing the largest number of pixels in all channels is determined. The threshold value T is calculated by the formula (4).
In order to prevent T from getting larger, the significant pixel will not be binarized to 0 when the saliency object occupies the most space in the image. The pixel number in each channel must satisfy pðiÞ/ar eaðIÞ < ε. Where areaðIÞ is the pixel number of image l. ε ∈ ½0:6, 0:9 is an experience value. The binarization object a priori map is denoted as MapB j . We adopt MapB j as the prior knowledge. The corresponding super- in the next scale s j+1 constitutes the proposal saliency superpixel set . M j+1 is the number of proposal saliency superpixel at the scale s j+1 , M j+1 < N j+1 . Assume that Num i is the total number of the superpixel Sp j+1 i . num is the pixel number with a value of 1 at the corresponding position of the binary map MapB j . If num/Num j > 0:5, the superpixel at the corresponding position is considered to belong to Suppix j+1 .

Region Optimization.
The proposal object superpixel set may contain some background areas or missing saliency areas. It needs to optimize the proposal object area. It removes the possible background area in Suppix j+1 and adds the possible saliency area in the background area. According to the Euclidean distance between the two color spaces, the difference matrix is Dif mat. It is a symmetric matrix with N j+1 order.
Where F i,k is the kth feature of superpixel region Sp i . k = ½1,⋯,6 corresponds to R, G, B, L, a, and b, respec-tively. For Sp k ∈ Suppix j+1 , it calculates the local average dissimilarity degree through equation (6), Where Sp k , Sp l ∈ Suppix j+1 , M j+1 is superpixel number in the proposal saliency region set Suppix j+1 . We calculate the average dissimilarity degree of each superpixel Sp k in Suppix j+1 and its adjacent background region: Where Sp k ∈ Suppix j+1 , Sp l ∉ Suppix j+1 and Sp k is adjacent to Sp l . M j+1 ′ represents the number of superpixels  Assuming it is not the first time to segment the superpixel, the local and global features are extracted for superpixel Sp i . The local features of the superpixel include two parts: (1) the deep feature F self containing its own region; (2) deep feature F local containing itself and the adjacent superpixel region.
First, according to Suppix set, it extracts the minimum rectangular region Re ct self of each superpixel Sp i . Since most superpixels are not regular rectangles, the extracted rectangles must contain other pixels. These pixels are represented by the average value of the superpixel. The depth feature F self only containing its own region can be obtained through the deep CNN.
If we only adopt the saliency calculation of F self to acquire saliency detection value is meaningless. It is impossible to determine whether it is saliency without comparing it with the saliency of other adjacent superpixels. Therefore, it still needs to extract Re ct local to further obtain F local of the deep local feature. The location of the region in the image is an important factor to judge whether it is saliency or not. It is generally believed that the area in the center of the image is more likely to be saliency than the region at the edge. Therefore, the whole image is taken as the input, and the deep feature F global of the global region is extracted.
If it only uses the bottom feature to extract the saliency map, due to many interference objects, the final saliency map is not ideal. Therefore, the high-level information needs to be added to improve the detection effect. The adopted high-level semantic prior knowledge is mainly to predict the most likely ROI based on previous experience (i.e., training samples). The FCNN is used to train the high-level semantic prior knowledge, which is integrated into the feature transformation process to optimize the final saliency map. Higher-order features can be learned from the primitive data without preprocessing in the multi-stage global training process of CNN.
FCNN can accept input images with any size. The difference between FCNN and CNN is that the deconvolution layer replaces the full connection layer. Finally, pixel classification is carried out on the feature map of the upsampling. A binary prediction is produced for each pixel, and a classification result at the pixel level is output. Thus, the problem of image segmentation at the semantic level is solved. Semantic a priori is an important high-level information in the detection of the ROI, which can assist the detection of the ROI. Therefore, this paper adopts FCNN to obtain high-level semantic prior knowledge and applies it to the detection of the ROI.
The network structure of FCNN is shown in Figure 4. Based on the original classifier, this paper utilizes the back propagation algorithm to fine-tune the parameters in all FCNN layers. In the network structure, the first row gets the feature map after alternately seven convolutional layers and five pooling layers. The last step of the deconvolution layer is to conduct the upsampling of the feature map with a step size 32 pixels. The network structure in this paper is denoted as FCNN-32s. It is found that the precision decreases because of the maximum pool operation. It directly executes upsampling for the feature map of downsampling, which will result in very rough output and details loss. Therefore, in this paper, the features with step size 32 pixels obtained from the upsampling are extended by 2 times, which is summed with the feature with step size 16 pixels. Then, the obtained feature is recovered to the original image for training, and the FCNN-16s model is obtained. So more accurate detailed information is obtained than that of FCNN-32s. We adopt the same method to train the network to obtain the FCNN-8s model, the prediction of detailed information is more accurate. Experiments show that although lower-level feature fusion for training networks can make detailed information prediction more accurate, the effect of low-rank sparse decomposition on the result is not significantly improved. Since the training time will increase sharply, this paper adopts FCNN-8s model to acquire the high-level priori knowledge of images. The deep CNN model comprises an input layer, multiple convolution layers, downsampling layer, full connection layer, and output layer. The downsampling layer and  The former is used for feature extraction, and the latter is for feature calculation. The fully connected layer is connected with the downsampling layer, which can output the feature. The output of the convolution layer is: Where d l n and d i−1 m are the feature maps of the current layer and the previous layer. k l m,n is the convolution kernel of the model. f ðxÞ = 1/½1 + e −x is the neuron activation function. b l n is neuron bias. The feature extraction result of the downsampling layer is: Where s × s is the downsampling template scale. k l n is the template weight. In this paper, the trained GoogleNet model is used to extract the depth features of the proposal object region. On the strength of this model, the labeled output layer is removed to obtain a depth feature. The convolution layer C1 uses 96 filters with 11 × 11 × 3 size to filter the input image with size 224 × 224 × 3. The convolution layers C2, C3, C4, and C5 take the output of the downsampling layer as their input, respectively. The convolution processing is carried out by using the self-filter, and several output feature graphs are obtained and transmitted to the next layer. The full connection layers F6 and F7 have 4096 features. The output of each full connection layer can be denoted as: 3.4. Saliency Calculation Based on Deep Feature. PCA [28] is the common method for dimension reduction of highdimensional data, which can replace p high-dimensional features with a smaller number of m features. For n superpixels, the output features can constitute a sample matrix W with n × p dimension. The correlation coefficient matrix R = ðr ij Þ p×p of the sample is calculated by the formula (11): Where x i = 1/n∑ n i=1 x ij . By solving the equation |λI − R| = 0, we find the eigenvalues and order them. Then, we calculate the contribution rate and cumulative contribution rate of each eigenvalue λ i : We calculate the corresponding orthogonal unit vector z i = ½z i1 , z i2 ,⋯,z ip T of each eigenvalue λ i . The unit vector corresponding to the first m features with a cumulative contribution rate 95% is selected to form the transformation matrix Z = ½z 1 , z 2 ,⋯,z m p×m . The high-dimensional matrix m is reduced by formula (13).
3.5. Contrast Feature. The contrast feature reflects the difference degree between the region and its adjacent region. The contrast feature w c ðSp i Þ of the superpixel Sp i is defined by its distance from other superpixels features, as given in equation (14): Where n denotes the number of superpixel. k⋅k 2 is 2norm.
3.6. Spatial Feature. In the human visual system, we pay different attentions in different spatial positions. The distance between the pixel at different positions and the image center satisfies the Gaussian distribution. For any superpixel Sp i , its spatial feature w s ðSp i Þ is calculated as: Where Sp i,x is the central coordinate of superpixel Sp i . c is the central region. If the average distance from the image center is smaller, the spatial feature is larger. The saliency value of the superpixel Sp i is denoted as: We obtain the SM of the first segmented image and use it as the object prior knowledge to guide the proposal region extraction and optimization.

Saliency Detection Based on Low-Rank Sparse
Decomposition. The background in the image can be expressed as a low-rank matrix. The saliency region can be regarded as sparse noise. For an original image, the eigenmatrix F = ½ f 1 , f 2 ,⋯,f N ∈ R d×N and the eigentransformation matrix T are obtained. Then, we use the FCN to obtain the high-level prior knowledge P. The low-rank sparse decomposition of the transformed matrix is carried out by robust PCA.

Wireless Communications and Mobile Computing
Where F is the eigenmatrix. T is the learned eigentransformation matrix. P is a high-level prior knowledge matrix. L is a low-rank matrix. S represents the sparse matrix. k⋅k ∘ represents the kernel norm of the matrix, that is, the sum of all singular values of the matrix. k⋅k 1 represents the ℓ 1 -norm of the matrix, the sum of the absolute values of all the elements in the matrix. Supposing that S * is the optimal solution for the sparse matrix. The saliency map can be calculated by the following equation.
Where Salðp i Þ represents the saliency value of superpixel p i . kS * ð: ,iÞk 1 represents the ℓ 1 -norm of the ith column vector of S * , that is, the sum of the absolute values of all the elements in the vector.

Saliency Map Fusion Based on Weight Cellular
Automata. Wang and Wang [29] adopted the multilayer cellular automata (MCA) for object fusion. Each pixel represents a cell. In the m-layer cellular automata, the cellular of the saliency map has m-1 neighbors. They are at the same positions in other saliency maps.
If cellular i is labeled as foreground, the foreground probability of its neighbor j at the same position in other SMs is λ = Pðη i = +1 | i ∈ FÞ. Saliency maps obtained by different methods are considered to be independent. When updating synchronously, all saliency maps are considered to have the same weight. There are guiding and refining relationships between the saliency maps at different segmentation scales. The weights cannot be considered as equally during the fusion process. In different segmentation scales, it is assumed that the weight of the SM obtained by the first segmentation scale is λ 1 , represented by w i = λ 1 . The SM weight with different scale is expressed as: Where O i denotes the total pixel number in the proposal object set. o i is the superpixel number in the ith saliency map. Set λ 1 = 1. Synchronous updating mechanism f : Map M−1 ⟶ Map is defined as: Where Map t m = ½Map t m,1 ,⋯,Map t m,H T represents the saliency value of all the cellular of the mth SM at time t. Matrix I is a matrix with H elements. If the neighbor of cellular is judged as foreground, then the saliency value should be increased. We obtain the final saliency map by formula (21). T2 is the next time.
The proposed deep multiscale fusion method for object saliency detection is summarized as depicted in Algorithm 1.
Input: Raw image I, multiscale segment number N and segment parameter in each scale. Output: Saliency map.
for i = 1 : N { if i=1 then (1) According to the determined parameters, we use SLIC to segment image l; (2) Determine the input region Re ct self , Re ct local , Re ct global of each superpixel; (3) The above is input GoogleNet to extract deep feature F self , F local , F global ; (4) The deep features of all superpixels constitute a matrix W, and the transformation matrix A of W is calculated by using PCA to obtain the principal component features;

Experiments and Analysis
In this section, we obtain the experiment data from Google Earth. The remote sensing image size is from 512 × 512 pixel to 2048 × 2048 pixel. The spatial resolution is 1 m. The experiment environment is Intel(R), Core(TM), i7-8750, CPU2.2 Hz, Geforce GTX1060 with MATLAB 2017a platform.

Evaluation Index and Parameter
Setting. In the experiment, the PR curve, F-measure, and mean absolute error (MAE) of the saliency map are compared to evaluate the effect of saliency detection to select a better segmentation scale. Precision and Recall are the two most commonly used evaluation criteria in image saliency detection. If the PR curve is higher, the effect of saliency detection is better. Otherwise, it is poor. For the given manual labeled Ground Truth G and the saliency map S, the definition of Precision and Recall is given in equation (22): Where sumðS, GÞ represents the sum of the value after the pixels of visual feature graph S multiplying that of G. sumðSÞ is the sum of all pixels in the visual feature graph S. sumðGÞ represents the sum of all pixels in G.
When calculating F-measure, the adaptive threshold T of each image is used to segment the image.
Where the W and H denote the width and height of the image, respectively. It calculates the average precision and recall of the SM. The average F-measure value is calculated according to equation (24). The effect of saliency is better if the F-measure value is excellent. F-measure value is used for the comprehensive evaluation of accuracy and recall. β 2 is often set to 1.
MAE is used to evaluate the saliency model by comparing the difference between the SM and the GT. We use formula (25) to compute the MAE value of each input image. The calculated MAE value can be used to draw a histogram. If the MAE value is lower, the proposed algorithm is better.
4.2. Segment Scale Determination. The main parameter of this algorithm is the segmentation scale. Many segmentation scales can increase the computational complexity. Few scales will affect the accuracy of saliency detection. Therefore, 15 segmentation scales are set according to experience. We conduct experiments on randomly selected remote sensing image data. Then, we extract the depth features of all superpixels in the segmentation graphs and calculate the saliency map. The histogram of the PR curve with different segmentation scales is shown in Figure 5. Three segmentation scales with better effects are selected from them. Through comparative analysis, it is found that the three segmentation scales 10, 11, 12 have a relatively better saliency detection effect. The three segmentation scales are selected as the final segmentation scales of the proposed method.

PCA Parameter Determination.
To verify the effectiveness of PCA on selecting principal components from depth features, this section adopts the depth features extracted from each superpixel block as the data set. The percentage of explained variance (PEV) is used to measure the importance of the principal component in the overall data as formula (26). PEV is a main index to describe the distortion rate of data.
Where R 2 ii is the right matrix of the main component matrix M ′ after singular value decomposition. ∑ denotes the covariance matrix. Figure 6 shows the relation between PEV and the top 50 principal components. It reveals that shows an upward trend. But the trend grows slowly. When the number of principal components exceeds 20, the PEV reaches to 90%, which is considered to represent the overall information of the data. In this paper, the top 20 principal components are selected for saliency calculation.     Figure 7. Figure 8 displays the saliency results with different methods. Figure 8 shows the comparison of saliency detection results with different methods. It can be seen that the detection effect of this algorithm is obviously better than other algorithms. Table 1 is the F-measure result. With the change of Recall, the Precision of the method in this paper has better value and keeps a high level. However, in terms of F-measure value, our method is 7.18% higher than the second better method. Under the condition of complex background information, both the PR curve value and F-measure value of the proposed method are significantly higher than other algorithms. It fully demonstrates the advantages of the proposed algorithm in relatively complex image information. Similarly, the MAE of this proposed algorithm is lower than that of other algorithms. Figures 9-14 are the subjective evaluation results for the six objects.
We also adopt IoU (Intersection-Over-Union) to illustrate the effectiveness of the proposed method [35,36]. The IoU is calculated as follows: The greater IoU shows a better effect. The results are shown in Table 2.   Table 2, we can see that our proposed method has a better saliency detection effect than other methods.
There are also apparent differences in the detection time among different algorithms. In terms of the speed of saliency detection, the proposed method is faster than other methods as given in Figure 15. Though deep learning-based algorithms need to train many samples, compared with other deep learning methods, the processing efficiency is improved by about 4%. Overall, the deep multiscale fusion method has a better effect on saliency detection for remote sensing images.

Conclusions
The saliency detection algorithm based on DL can overcome the shortcomings of the traditional saliency detection algorithms. However, the detection efficiency is obviously insufficient. Therefore, we present a deep multiscale fusion method for object saliency detection in optical remote sensing images based on urban data. Through the deep feature extraction, we calculate the saliency value and use the weight cellular automata to integrate and optimize the scale saliency map. Results reveal that the proposed method can efficiently acquire the saliency detection results than other methods. In the future, some new models based on deep learning will be researched. And the new methods will be applied to practical engineerings.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.