An Extremely Effective Spatial Pyramid and Pixel Shuffle Upsampling Decoder for Multiscale Monocular Depth Estimation

To estimate the accurate depth from a single image, we proposed a novel and effective depth estimation architecture to solve the problem of missing and blurred contours of small objects in the depth map. The architecture consists of Extremely Effective Spatial Pyramid modules (EESP) and Pixel Shuffle upsampling Decoders (PSD). The results of this study show that multilevel information and the upsampling method in the decoders are essential for recovering the accurate depth map. Through the model we proposed, competitive performance compared with state-of-the-art methods in terms of reconstruction of object boundaries and the detection rate of small objects has been demonstrated. Our approach has wide applications in higher-level visual tasks, including 3D reconstruction and autonomous driving.


Introduction
Monocular depth estimation is a long-standing task, which aims to predict the continuous depth value of each pixel from a single RGB image. is task has a wide range of application in various fields, such as scene understanding [1], scene segmentation [2], 3D reconstruction [3], and simultaneous localization and mapping (SLAM) [4]. Traditional depth estimation methods of image-based depth estimation are usually based on a binocular camera, which calculates the disparity of two 2D images (taken by a binocular camera) through stereo matching and triangulation to obtain a depth map. However, the binocular depth estimation method requires at least two fixed cameras, and it is difficult to capture enough features in the image to match when the scene has less or no texture. erefore, researchers turn their attention to monocular depth estimation. Monocular depth estimation uses only one camera to obtain an image, which does not require additional complicated equipment and professional techniques. us, there has been an increasing demand for monocular depth estimation in recent years.
With the great success of deep learning, researchers have studied a number of various monocular depth estimation methods based on convolutional neural networks (CNN). Eigen et al. [5] first proposed a multiscale deep network to regress dense depth maps from coarse to fine. Laina et al. [6] proposed a fully convolutional residual network, which explores a new upsampling method to obtain more accurate depth predictions. Liu et al. [7] proposed a deep convolution neural field, which combines the depth convolution neural network with the continuous conditional random field to extract the structural information of features. In recent years, there have been new advances in monocular depth estimation. Wofk et al. [8] proposed an efficient and lightweight encoder-decoder network architecture and applied network pruning to further reduce computational complexity and latency for real-time monocular depth estimation. In literature [9], an attention mechanism and a multiscale feature fusion dense pyramid were used to further improve estimated depth maps of distant small-scale objects. In literature [10], an adversarial loss was introduced into the training stage of self-supervised depth estimation to optimize the depth network with high-level information.
Although monocular depth estimation research has made great progress, there are still some problems such as missing small objects and distortions in object shapes. Some works address the above issues by introducing auxiliary information such as geometric constraints and semantic information [11]. Others use feature fusion methods to improve depth estimation of small objects [9]. e CNNbased methods use a convolutional neural network to extract image features and then regress the features into image depth values, so the quality of extracted features directly affects the effect of monocular depth estimation. Inspired by this perception, we propose a novel U-shape network based on EESP skip connection modules and an upsampling method based on PSD modules. ese modules can result in fewer parameters, clearer object contours, less object distortion, and fewer small objects missing. e main contributions can be summarized as follows:

U-Shaped
Networks. Since the FCRN [6] network was proposed, the encoder-decoder structure has been adopted in most monocular depth estimation methods. e encoder downsamples the input image several times to extract the global features, and the decoder upsamples the feature map to get the depth map. However, only relying on the final high-level features to estimate the depth of each pixel, the performance is not desirable because it lacks local detail information. erefore, more recently, the U-shaped network, for example, U-Net, is commonly used, in which the decoder utilizes features from all layers by adding shortcut connections between the corresponding decoder layers and the encoder layers. In work [8], the multilevel features extracted from the encoder are directly added to the decoder. eir experimental results demonstrate that the enhancement of local features improves the accuracy of local depth in the prediction maps. Different from them, in this paper, the spatial pyramid EESP module is added to the skip connections of the U-Net network structure to fuse different scale-level features from different encoder layers.

Upsampling
Methods. Depth estimation models usually use a backbone network such as ResNet [12] to extract the features of the input image, and the resolution of the output features is usually small, which will limit the resolution of the depth prediction map. erefore, it is necessary to use encoder-decoder structure with upsampling operations at the decoder part to improve the resolution of the features. Commonly used upsampling methods are interpolation, deconvolution, and pixel shuffle. Bilinear interpolation is the most commonly used interpolation method [13,14], but it will dilute the feature information, thus blurring the image edges, and ultimately affect the effect of depth estimation. Deconvolution is also one of the popular upsampling methods. Its advantage is that it can improve the upsampling effect through training, but multiple deconvolutions may produce artifacts [15]. erefore, some works [6,8] add convolution layers after deconvolution to alleviate the artifacts caused by continuous deconvolution to a certain extent.
Pixel shuffle, also known as subpixel convolution, is widely used in the field of image super-resolution processing [16,17]. is method improves the resolution of the feature maps by reducing the number of channels of features, which also results in fewer parameters for subsequent convolution operations. Pixel shuffle retains all the feature information, which can better alleviate the edge blur and artifacts caused by information loss. Due to the advantages of pixel shuffle, some researchers have applied it to other computer vision areas, such as image reconstruction [18] and semantic segmentation [19]. In order to improve the resolution of output feature maps and further learn feature fusion, this paper uses pixel shuffle operation in the decoder. By rearranging the input features, pixel shuffle realizes the feature fusion between different channels, so it can also play the role of feature fusion.

Depthwise Separable Convolution.
e depthwise separable convolution [20] consists of two parts: a depthwise convolution and a pointwise convolution. e depthwise convolution is to convolute each channel independently, while the pointwise convolution is to fuse features across channels. e depthwise separable convolution is widely used in lightweight networks such as MobileNet [21] and FastDepth [8] due to its fewer parameters. Moreover, compared with the standard convolution, the depthwise separable convolution pays more attention to the fusion of features in a single channel. Since the pixel shuffle operation improves the resolution by reshaping multiple channels into one channel, this paper uses the depthwise separable convolution on the output of pixel shuffle layer not only to reduce the number of network parameters but also to improve the feature fusion effect of pixel shuffle. In addition, the dilated convolution [22] can increase the receptive field of the convolution kernel without introducing extra parameters. erefore, in this paper, depthwise dilated separable convolutions are used at the decoder part to obtain a larger receptive field and multiscale fusion feature while keeping as few parameters as possible.

3.1.
e Overview of Our Method. In recent years, the U-shaped network [8] has been commonly used in most monocular depth estimation methods. On this basis, we propose the pixel shuffle encoder-decoder convolution neural network (PSDNet). As shown in Figure 1, PSDNet uses ResNet [12] as the backbone network to extract features from the input image. ree EESP connection modules are used to transfer information from the three residual blocks of the feature extraction backbone to the upsampling modules of the decoder network. e decoder network contains four upsampling modules, that is, pixel shuffle decoder (PSD). Besides, the residual connections between every two adjacent PSD modules, which consist of bilinear interpolation and a 5 × 5 convolution layer, are designed to enhance information sharing and alleviate the gradient vanishing problem. Finally, the depth map is obtained by using a 3 × 3 convolution layer on the output of the last PSD module, and then the resolution of the prediction map is increased to the size of the input image using bilinear interpolation.

EESP Connection Module.
Motived by the feature fusion method of the spatial pyramid in lightweight network [23], we design different EESP skip connection modules to bring more comprehensive information to the decoder. e dilated convolution [22] increases the receptive field while avoiding a surge in the number of parameters. e depthwise separable convolution [20,21] can also reduce the number of parameters by separating it into a depthwise convolution and a pointwise convolution. e combination of the two methods not only increases the receptive fields but also reduces the number of parameters. erefore, it is much lighter and more efficient than other feature fusion methods. e EESP module extracts features using depthwise separable convolutions with different dilation rates and fuses the extracted features using the hierarchical feature fusion method HFF [23]. In HFF, feature maps from the branch with the lowest dilation rate are combined with the feature maps from the branch with the next highest dilation rate, and then all the features are concatenated and input into a 1 × 1 convolution layer to further fuse the features (see Figure 2). HFF enhances the convolution of small dilation rates and thus can effectively alleviate the grid artifacts [15] caused by dilated convolutions. In this paper, the EESP module is added to the skip connection of the U-Net structure to supply the lower-level features for decoder. Considering that the resolution of different residual blocks in the backbone network is different, different EESP modules are designed to make full use of these features and improve the performance of depth estimation.
As shown in Figure 1, there are four residual blocks at the encoder part. Except for the last residual block, the output features of the other three residual blocks are input into three EESP skip connection modules, respectively (denoted as EESP3, EESP2, and EESP1), to connect to the decoder. Features extracted from shallower residual blocks have larger feature resolution and fewer channels, and so it is considered that they contain more local information reflecting the depth of details. On the other hand, the deeper features have smaller resolution and so more global depth information. In order to balance the local and global information, the dilation rates of EESP connection modules connecting different residual blocks are set to different values. e dilation rates of depthwise dilated separable convolutions in the EESP3 connection module are set to 1, 2, 4, and 8. e dilation rates of the EESP2 connection module are set to 1, 2, and 4, while the dilation rates in the EESP1 connection module are set to 1 and 2.
e EESP connection module not only achieves the unification of resolutions and channel number of multiscale features but also can further learn extra features with only a few parameters. e main process in Figure 2 is shown as follows: where X denotes the input feature maps from the corresponding residual block of the backbone network, DDConv 3×3,i denotes 3 × 3 depthwise dilated separable convolutions with dilation rates i, and i is set to 1, 2, 4, and 8, respectively. Φ HFF denotes fuse features of different branches. Feature maps from the branch with the lowest dilation rate are combined with the feature maps from the branch with the next highest dilation rate. en, the features of all branches are concatenated as the output of HFF.

Pixel Shuffle Decoder.
Pixel shuffling is widely used in the field of image super-resolution processing [17]. By rearranging the input features, pixel shuffling can not only play the role of upsampling but also reduce the number of channels. e reduction of the number of channels can also greatly reduce the parameters of the subsequent convolution layer. In addition, compared with other upsampling methods (such as deconvolution [15]), it has no parameters. erefore, we design the PSD module based on pixel shuffling. e structure of the PSD modules (denoted as PSD2, PSD3, and PSD4 in Figure 1) is shown in Figure 3. e PSD module first adds the features from the previous PSD module and the output features of the corresponding EESP connection module. For the PSD1 (see Figure 1), there is only one input, which is the deepest level features output by the last residual block, and the other structures are the same as in Figure 3. e added features double the number of channels using a 1 × 1 convolution layer, and then the pixel shuffle unit changes the feature map to 2H × 2W × C/2; that is, the length and width of the input features are doubled, while the number of channels is reduced to onefourth of the original. Pixel shuffle can improve the resolution and reduce the number of channels. At the same time, it can achieve the effect of feature fusion by disrupting the feature values. However, pixel shuffling also destroys the connections between feature values in each channel, so this paper uses depthwise separable convolution to reconstruct the connections between feature values. In the proposed PSD module, after the pixel shuffle layer, a 5 × 5 depthwise separable convolution and a 3 × 3 depthwise separable convolution with a dilation rate of 2 are designed in parallel to establish new connections in each channel.
Computational Intelligence and Neuroscience e feature maps of these two convolution branches are summed and then further fused using a 3 × 3 convolution to get the final output of the PSD module. e process is shown as follows: where x denotes the input feature maps from the previous PSD module and its residual connection, Y denotes the input feature from the EESP module, and φ ps denotes pixel shuffle operation.

Loss Function.
In order to improve the sharpness of the object edges in the predicted depth map, the loss function L total is composed of three parts: the L 1 loss function, the gradient loss function, and the L 2 regularization term, as shown in equation (5). α is set to 0.5 and β to 0.0001 in our experiments.
where L 1 loss function calculates the absolute error between the predicted value y i and the true value y i , as shown in equation (2), where i is the pixel index and N denotes the number of pixels in the depth map. L 1 loss function measures the overall error between the prediction and the    ground truth, so minimizing the L 1 loss function makes the predicted depth map accurate.
e gradient loss function is shown in equation (5), where ▽ h y i and ▽ v y i denote the horizontal gradient and vertical gradient of the predicted depth map, respectively. ▽ h y i and ▽ v y i denote the horizontal gradient and vertical gradient of the true depth map, respectively. e gradient of the depth map reflects the change rate of depth values. is loss function is designed to make the depth change of the prediction map more real.

Evaluation Metrics.
e following evaluation metrics are used to measure the performance of monocular depth estimation methods: absolute relative error (rel, lower is better), root mean squared error (rms, lower is better), log mean error (log 10, lower is better), and threshold accuracy (δ 1 , δ 2 , and δ 3 , higher is better). e functional expressions of the evaluation metrics are shown as follows.

Compared with Other Advanced
Methods. e depth estimation test results on the NYU depth V2 dataset of our method and other state-of-the-art monocular depth estimation methods based on deep learning in recent years are reported in Table 1. Other methods involved in the comparisons are the full convolutional network method of Laina et al. [6], the conditional random field optimization of superpixel depth proposed by Liu et al. [7], the cascaded conditional random field depth optimization method of Xu et al. [25], the efficient and lightweight encoder-decoder network architecture proposed by Wofk et al. [8], the gradient optimization proposed by Li et al. [26], the Computational Intelligence and Neuroscience multiscale feature fusion method of Xu et al. [27], the augment ordinal depth relationship methods of Cao et al. [28], the method based on the geometric cues and scene parsing of He et al. [11], the successive encoder-decoder style subnetworks proposed by Dong et al. [29], and the attention mechanism and multiscale feature fusion method proposed by Xu et al. [9]. Different from the feature fusion methods [9], our multiscale feature fusion method transfers the features of different receptive fields to the decoder network, constructing the connections between the encoder and the decoder. Among them, the methods of Laina et al. [6], Wofk et al. [8], Xu et al. [9], Cao et al. [28], Dong et al. [29], and this paper do not perform other additional refinement steps, while Liu et al. [7], Xu et al. [25], and Xu et al. [27] used conditional random fields to do postprocessing. Li et al. [26] used image gradients for depth map optimization, and He et al. [11] used the geometric constraints and semantic information of the scene to alleviate the ambiguity in monocular depth estimation. As can be seen from the results in Table 1, the method in this paper achieves competitive results in all indicators. Figure 4 illustrates some depth estimation results of our method and Laina et al. [6], where column (a) indicates the input RGB image, column (b) is the predicted depth map of Laina et al. [6], column (c) is the depth prediction map of this paper, and column (d) indicates the depth truth map. e brighter the color of the pixel points, the smaller the depth value, and the darker color, the larger the depth value. Observing the prediction maps in Figure 4, we can find that the performance of our method is better than that of Laina et al. [6] in terms of local depth values. For example, the depth estimation results marked by the blue rectangular box in column (c) indicate that our method can better predict the depth of small objects such as chair legs and table lamps. And the results within the green box in column (c) shows that the prediction map of our method has clearer edges than that of Laina et al. [6]. Specifically, the outline of the rocking chair in the second row can be clearly seen in our predicted depth map.

Ablation Experiments
In this section, we conduct experiments to illustrate the effectiveness of each component of our PSDNet.

Comparison of Different EESP Modules.
In this experiment, we compare the depth estimation effects of several other structure options of the three EESP modules (see Figure 1), and the experimental results are shown in Table 2. In Table 2, "3-EESP" indicates that the three EESP connection modules adopt the same design as in Figure 2; that is, they all use four depthwise separable convolutions with dilation rates of 1, 2, 4, and 8, respectively. "EESP-4" indicates one depthwise separable convolution with the dilation rate of 1 for EESP1, two depthwise separable convolutions with dilation rates of 1 and 2, respectively, for EESP2, and three depthwise separable convolutions with dilation rates of 1, 2, and 4, respectively, for EESP3. "EESP-HFF" indicates that the HFF fusion is removed from the EESP modules, and others remain the same as in our proposed design, in which EESP1 has two depthwise separable convolutions with dilation rates of 1 and 2, respectively, EESP2 has three depthwise separable convolutions with dilation rates of 1, 2, and 4, respectively, and EESP3 has four depthwise separable convolutions with dilation rates of 1, 2, 4, and 8, respectively.
From the comparison results with "3-EESP" and "EESP-4," it can be seen that our proposed design schema obtains the best performance due to features of different receptive fields fused in a complementary fashion. Compared with the experimental results of "EESP-HFF," the rms is improved after using HFF fusion, while the rel remains unchanged. Since the rel is more sensitive to the smaller depth area than the rms, the improvement of rms can prove that HHF fusion can slightly improve the prediction accuracy of the larger depth area.

Comparison of Different Feature Fusion Methods in PSD
Modules. In order to study how different fusion methods after pixel shuffling in PSD modules affect the performance of the model, we compare different design options. e results are compared in Table 3, where "A" indicates that only the standard 5 × 5 convolution is used for feature fusion after pixel shuffling; "B" indicates that only the depthwise separable convolution with a kernel size of 5 × 5 and a dilation rate of 1 is used for feature fusion after pixel shuffling; "C" means that, after pixel shuffling, the features are input in parallel into a depthwise separable convolution of 3 × 3 with a dilation rate of 2 and a depthwise separable convolution of Computational Intelligence and Neuroscience 5 × 5 with a dilation rate of 1, and then simply add the outputs of these two branches as the result features; "D" is similar to the method proposed in Figure 3, except that the dilation rate of the 3 × 3 depthwise dilated separable convolution is reduced to 1. From the results in Table 3, it can be found that the "proposed" structure in Figure 3 has the best performance. It demonstrates that the proposed method can further integrate the features of different receptive fields and reconstruct the relationship between different channels.

Ablation Experiments on Decoder
Network. e decoder mainly contains three parts: the EESP connection modules, the PSD modules, and the residual connections. is section verifies the effects of the three modules on the depth estimation performance by ablation experiments. e experimental results are shown in Figure 5. Compared with the predicted depth maps in column (b) of Figure 5, the sharpness of object edges in the predicted depth maps in column (c) of Figure 5 is significantly improved, and the performance of predicting small objects depth values, such as table lamps and chair legs is enhanced, which proves that the EESP connection module can effectively improve the depth prediction performance of the network for object edges and small objects. Comparing the areas of the blue box in column (c) of the last row in Figure 5 with the corresponding areas in columns (b) and (e), it can be seen that the EESP connection modules disrupt the continuity of depth values in the depth prediction maps, while the combination of the EESP modules and the   residual connections can effectively improve the continuity of predicted depth values. Compared with the predicted depth maps in columns (b), (c), and (d) in Figure 5, the areas marked in red and green boxes in column (e) reflect that the predicted depth maps obtained by using these modules jointly not only have clearer object contours but also greatly improve the problem of missing small objects.

Conclusion
A depth estimation encoder-decoder architecture based on spatial pyramid EESP and pixel shuffle is proposed in this paper to address the problems of object distortion and missing small objects existing in monocular depth estimation. e spatial pyramid EESP modules are used to fully utilize the features of different scales. e proposed pixel shuffle decoder upsamples the features extracted from the backbone network and generates the depth prediction map by fusing the features of different scales step by step. Compared with other state-ofthe-art methods, the depth map estimated by our method has clearer object contours, less object distortion, and fewer small objects missing. e experimental result demonstrates the role of the EESP connection module and residual connection in feature fusion and verifies the reliability of our method in solving the problem of missing and blurred contours of small objects. However, the depth estimation performance of our method is not desirable in areas of very small depth values and objects refracting light such as mirrors, so the next step will be to try masking and other methods to improve the accuracy of depth estimation.

Data Availability
e data used to support the findings of this study are available at https://cs.nyu.edu/∼silberman/datasets/nyu_depth_v2.html.

Conflicts of Interest
e authors declare that they have no conflicts of interest.