Cascading and Residual Connected Network for Single Image Superresolution

Convolution neural networks facilitate the significant process of single image super-resolution (SISR). However, most of the existing CNN-based models suffer from numerous parameters and excessively deeper structures. Moreover, these models relying on in-depth features commonly ignore the hints of low-level features, resulting in poor performance. This paper demonstrates an intriguing network for SISR with cascading and residual connections (CASR), which alleviates these problems by extracting features in a small net called head module via the strategies based on the depthwise separable convolution and deformable convolution. Moreover, we also include a cascading residual block (CAS-Block) for the upsampling process, which benefits the gradient propagation and feature learning while easing the model training. Extensive experiments conducted on four benchmark datasets demonstrate that the proposed method is superior to the latest SISR methods in terms of quantitative indicators and realistic visual effects.


Introduction
Superresolution (SR) image reconstruction is widely used in various applications, such as military surveillance, medical diagnostics [1,2], remote sensing [3], and video streaming [4,5]. Single image superresolution (SISR) is aimed at reconstructing a high-resolution (HR) image from its counterpart low-resolution (LR) input, which is an essential and classic task in computer vision. Recently, high-resolution (HR) demand images have boosted. However, physical constraints limit the conduction of high-resolution pictures. A series of successful works brought attention to the research community.
The task of recovering HR images I SR from its counterpart (LR) version I LR is ill-posed. Researchers have made many efforts to this task and invented numerous algorithms, including interpolation-based, reconstruction-based, and learning-based methods [1], respectively.
The traditional SISR algorithms, for instance, bicubic interpolation [6], are high-speed while suffering from poor accuracy. It is easy to fail in practice. To limit possible solving space, researchers present more advanced methods, reconstruction-based algorithms [7,8], by introducing available prior knowledge. These algorithms may restore clear details (i.e., texture details), but extensive experiments show that they degrade sharply when the scale factors increase; subsequently, the algorithms with learnable parameters [9] are proposed to analyze relationships between the I LR image and their counterpart I HR image by training concrete instances [10,11]. Although such learning-based methods perform very well, the time-consuming optimization problems they involve are very tricky.
In recent years, CNNs have been introduced to facilitate the progress of the SISR field because of their excellent feature representation ability. Dong et al. [12] were the first to propose a three-stage convolutional network to solve the SISR problem, which has become a milestone in this field. Since then, the research community has set out to design more complex networks to improve performance. EDSR, a very large network with residual blocks, was presented by Lim et al. [13] and achieved satisfactory performance in both PNSR and SSIM [14]. However, these state-of-the-art methods still have some limitations: (1) The state-of-the-art (SotA) models [13] mainly improve the performance by considerably growing the depth and width of the proposed methods. Therefore, massive parameters and increasing resource-consuming problems are inevitable (2) Many progressive models do not fully take advantage of the hierarchical information from the primary LR images, which are essential for improving visual performances To address these shortcomings, we present a model named CASR, exploring two separate strategies to functionally extract features for precise SISR. Figure 1 shows the ×4 SR results of our proposed model on dataset DIV2K [15]. First, we propose a small but functional depthwise separable convolution network named head module aimed at more systematic feature extraction.
Second, we present another cascaded residual network (CAS-Block) for better feature and gradient propagation. Our proposed method combines features from excessive layers at both the regional and global levels with such architecture. Moreover, a stacking broader local residual connection is applied to exploit the feature of the I LR and let the vast low-level mappings be transmitted. This schema unites nonlocal actions to capture remote spatial features from former inputs.
As the crucial integrant of the presented method, the CAS-Block includes six subtrunks, each of which consists of two convolutional layers and a nonlinear activation pReLU. Because using the activation function in bottlenecks does affect the performance, we take advantage of channels before the pReLU layer to construct the inverse residual block, resulting in performance improvement.
The three main contributions of this article are summarized as follows.
(1) We propose a head module applied with a series of depthwise separable convolution operations for feature extraction. In addition, we replace all existing conventional convolution operations with deformable convolution layers in the module. At the same time, in order to effectively retain the features, we extend the low-dimensional representation to highdimensional before passing the activation function. This maintains a balance between a large number of parameters and excellent performance (2) In order to effectually raise feature fusion and gradient propagation, we introduce a cascaded block called CAS-Block. This mechanism allows our network to combine features from diverse layers. Fur-thermore, such a structure is also used to construct the network and promote its functional expression (3) We utilize the L 1 with the addition of total variance loss L TV instead of the traditional sole L 1 loss function, which significantly improves the quality of the reconstruction image I SR . Meanwhile, to obtain better optimization weights, we explored various parameter settings 2. Related Works 2.1. SISR Using Deep CNNs. In the field of superresolution, compared to conventional image restoration methods, CNN-based models have a stronger feature expression ability and have achieved great success. Dong et al. [12] first proposed an algorithm named SRCNN, which is an endto-end algorithm based on CNN. It consists of three convolutional layers, and its performance is impressive compared to traditional methods (i.e., sparse coding [7] and bicubic interpolation [6]). Later, the research community designed more intricate CNN architectures and developed more profound networks. For example, in order to grow the depth of the network, VDSR [16] introduced residual learning, and the verification experiment proved that this strategy heightens the SR image qualities and promotes convergence. DRCN [17] uses deep recursion to construct a neural network and uses the same convolution kernel 16 times in the reference network, effectually dropping the number of parameters. He et al. [18], inspired by the ordinary differential equation (ODE), propose an intriguing network named OISR, which provides a new understanding of network designs. It is worth noting that most of these latest methods use interpolated images for input, which will not only cause the details to be too smooth but also boost additional computational cost and time consumption.

Skip Connection.
ResNet [19] was the first to adopt the concept of skip connection, and then, the idea was extended to various computer vision tasks, such as image restoration [20] and semantic segmentation [21]. Since it is difficult for ordinary SR networks to construct extremely deep networks, employing various skip connections avoids the gradient vanishing trap and boosts performance. The strategy is roughly divided into two categories, namely, global or local residual connections and dense connections. Jiang et al. [25] proposed a hierarchical dense network (HDRN) in 2019, which can effectively establish realistic mapping relationships between the LR and HR image, promoting information interaction and representation. Different from the above models, CARN also uses a cascade mechanism at the local and global levels to integrate features from multiple layers, which can reflect input representations at different levels for receiving more information [26]. Haris et al. [27] proposed D-DBPN, which connects the features of the up-and downsampling stages and improves the SR result.

Depthwise Separable Convolution.
The cross-channel correlation and spatial correlation of the convolutional layer can be decoupled, and they can be mapped separately to achieve better results. Some lightweight networks, such as MobileNet [28], apply depthwise separable convolution, which is a combination of depthwise (DW) and pointwise (PW) to extract feature maps. Compared with the conven-tional convolution operation, the number of parameters and operation cost are relatively small. In Figure 2, we use the separable convolution operation in the depth direction in the head module.

Multiscale
Learning. So as to utilize computing resources more efficiently and extract more features under the same amount of calculation, Szegedy et al. [29] present the inception module. There are two main contributions of the inception structure: one is to use 1 × 1 convolution to perform dimensionality reduction; the other is to simultaneously perform convolution and reaggregation on multiple sizes. Inspired by [29] and [30], MSRB [31] was proposed. Multiscale feature fusion and local residual learning can be applied to adaptively detect images of different scales with different sizes of convolution kernel features. The results show that performing different kernel operations can provide better extraction capabilities. However, this method cannot expand more receptive fields and cannot generate more detailed structural information.

Deformable Convolution.
Conventional convolution kernels are usually of fixed size (for example, 3 × 3, 5 × 5, and 7 × 7.). The biggest problem with this convolution kernel is that it has poor adaptability to unknown changes and weak generalization ability. In order to solve the object space deformation problem, deformable convolution [32] is proposed to heighten the transformation modeling ability of CNN. Deformable convolution is based on traditional convolution, adding the direction vector of the adjustment convolution kernel to make the shape of the convolution kernel closer to the feature. 3 Wireless Communications and Mobile Computing 2.6. Real-World Image Superresolution. In real-world image restoration scenarios, lacking corresponding high-quality references usually conduct poor experimental results. We additionally introduce the naturalness image quality evaluator (NIQE) [33] and Perceptual Index (PI) [22] to perform the evaluation. In fact, these indicators can sensitively reflect content sharpness, detail contrast, and texture diversity. These evaluation indexes have a high consistency with the subjective quality and can effectively reflect the visual quality of images without reference. In particular, the smaller values of NIQE/PI indicate better perceptual quality and clearer content. We intend to apply it to the DRIVE [34] dataset to estimate the restoration capability of the proposed method.

Proposed Approach
3.1. Network Architectures. SISR's algorithm, such as ESPCN [35] and FSRCNN [36], does not take full advantage of lowlevel feature information. With a deeper structure, there are more parameters. As shown in Figure 3, the proposed CASR consists of three components: (1) head module, (2) cascading block, and (3) upsample module. All we want is the balance between the performance and the cost.
To better explore the mentioned issues, we adopt two different strategies: (1) original feature extraction and (2) cascading connection structure.
3.1.1. Original Feature Extraction. We depict I LR and I SR as the input and output of our models, respectively. Figure 2 illustrates how the head module extracts the original information from LR images: where H ext ð·Þ means a series of convolution operations. In the head module, we first replace the conventional one with a depthwise convolution layer for reducing parameters.
Through an activation layer, the feature maps are sent to another specific convolution layer, deformable convolution. As we discussed in Section 2, deformable convolution adds an offset to each convolution sampling point, thus achieving free deformation of the sampling grid. Then, after passing through a specific convolutional layer with 1 × 1 kernel and another pReLU activation function, F ext is sent to the next stage for a higher-level abstraction.

Cascading Connection
Structure. Now, we present the CAS-Block. The cascade connection allows information to spread across multiple paths in the network, which greatly enhances feature fusion. It [10] has been widely applied in various computer vision tasks. In Figure 4, the mapping process of our cascade network includes C CAS-Blocks, each with a skip connection: where H map i presents the output of the C i th CAS-Block. Each CAS-Block contains one group convolution layer (with 3 × 3 or 1 × 3 kernel), one traditional convolution layer for adjusting the number of channels, and a pReLU layer. We prefer stacking several kernels with smaller sizes (such as 1 × 3 and 3 × 3) to directly applying larger kernels (such as 5 × 5 and 7 × 7) for enlarging the receptive field of the feature extraction module and decreasing the number of learnable parameters: where H map m , H map m−1 , H map m−2 means all the outputs of the middle three CAS-Blocks. H cas denotes the cascading operation: where H map ð·Þ demotes our proposed mapping function. Finally, we use a common upsampling module to fuse the hierarchical structural features and amplify the image size: Conv pReLU Conv Conv pReLU Figure 2: The architecture of head module is abundantly applied with depthwise separable convolution and deformable convolution operations. ⊕ stands for the element-wise additional operation.

Wireless Communications and Mobile Computing
where H up ð·Þ indicates an upscale module. In recent years, many upsampling methods have been proposed, such as [12,27,36]. We adopt the postupsampling method, which has been proven effectively outstanding. The process of our model roughly includes three steps. First, taking the lowresolution I LR image as the original input, the feature extraction module obtains the initial features from the low-quality image. Then, these features are delivered to a higher abstraction layer. Finally, we adopt a simple upsampling block, including a convolutional layer, and a pixel-shuffle layer to enlarge the SR image.

Total Variation Loss.
Aly and Dubois bring the total variation (TV) [37] loss to the SR field in order to suppress noise in generated images, and for imposing spatial smoothness, Yuan et al. also select this TV loss: whereÎ depicts the reconstructed HR image, h, w, represent the dimensions of the corresponding feature maps, and c symbolizes the number of channels. On the other hand, although mean square error (MSE) is available, previous work [38] proved that it is not a good choice. Thus, the second loss function is defined as follows: We applied these loss functions in the training process of our presented model. From the experiment, we found that adopting the L loss compared with the simple L 1 loss, the model achieves better performance, and set λ = 1e −4 works well. As shown in Figure 5, the loss L enables the network to generate smoother recovery images, and Figure 6 comparatively illustrates that the combined loss function may produce sharper SR results.

Comparison with MSRN.
Compared with MSRN, our CASR is different as follows. First, the basic module design is distinct. In MSRN, the multiscale residual block (MSRB) incorporates parallel convolution with multiple feature channels. The output of each multiscale residual block is cascaded together through a hierarchical feature fusion to produce the final result, which leads to a lot of calculations. However, our multiscale modules are branch-based, using regional skip connections and cascades extensively, scaling down parameters. Second, it is the difference in the activation function. MSRN utilizes the ReLU function, while we employ PReLU as the activation function. According to the comparison in Figure 7, PReLU optimizes and improves ReLU. Under the premise of almost no increase in the amount of calculation, the PReLU function effectually improves the overfitting problem of the model, accelerates the convergence, and lowers the error. Therefore, our proposed multiscale module owns more effective representation capabilities.  Figure 3: Network design for our model. The green cuboid means the head module; the yellow one represents the cascading block, and the last blue rectangle depicts the upsampling process. 5 Wireless Communications and Mobile Computing former employs stacking memory blocks and massive shortcuts, while our method avoids extensive dense connections for lowering the number of parameters. What is more, Lim et al. trained their network with the L 2 loss, but we prefer L 1 loss to L 2 loss function. Besides, MemNet regards the interpolated images as input. Contrastively, our proposed method directly extracts hierarchical features from the original LR images upsampled at the end of the process for computational efficiency and SR performance improvement.

Training Details.
We set depthwise separable convolution operations in head module shown in Figure 2, which were first illustrated in the Inception net in the proposed model, and were able to reduce the size of the network parameters effectively. Figure 4 graphically illustrates the cascading process occurring. The medial layers' outcomes are cascaded into the posterior layers and finally assemble in a convolutional trunk consisting of a depthwise separable convolutional operation with three times the input and output features and then thorough a pReLU activation function.
We prefer L 1 + L TV to L 2 loss as the loss function, though the latter has been generally applied in computer vision tasks because of its intimate relation with PSNR's calculation. However, the research community recently indicates that L 1 loss provides better accuracy and faster convergence; TV loss (L TV ) imposes spatial smoothness on reconstructed images. Specifically, we set training patches with a size of 128 × 128, and batch size = 16. We employ (a) (b) (c) (d) Figure 5: ×3 SR's loss function comparison. In the first row, it is the comic images in the Set14 dataset. The image processed with L 1 + L TV has precise details in the area around the eyes. The bottom line is the "img020" images in the Urban100 benchmark dataset. This method applies the L 1 + L TV method to reconstruct the clear details, such as the windows.

4.2.
Datasets. DIV2K [15] is a high-definition dataset containing various image contents. It has 800 training images, 100 verification images, and 100 test images. We employ 800 training images to train the proposed model and randomly select ten validation images as evaluation. In the testing process, we adopt the following benchmark datasets as test datasets: Set5 [40], Set14 [41], B100 [42], and Urban100 [43]. They contain various scenes in real life, such as landscapes, buildings, and people, while the Digital Retinal Images for Vessel Extraction (DRIVE [34]) dataset is a dataset for retinal vessel segmentation. It consists of a total of JPEG 40 color fundus images, including 7 abnormal pathology cases.
As mentioned earlier, the dataset B100 contains many real-world images. As seen in Figure 8, the vase image recovered by our method has clearer edges, reaching 32.33 db. Figure 8 demonstrates visual comparisons on dataset B100 and Urban100 with scales ×2 and ×4, respectively.
The Urban100 dataset consists of 100 pictures of various buildings, which usually contain clear edges and rich textures. So, according to [24], RDN is expected to perform well on superresolution tasks, reaching 33.09 db on ×2. Our method acts well at ×3 and ×4 superresolution tasks, reaching 28.90 db and 26.67 db, respectively, which is approximately 0.1 db and 0.15 db lower than RDN, while CASR costs much less than the competitors.

Comparison
Results on Time Complexity. Besides, we have provided a comparison of the model's efficiency in terms of time complexity on public dataset Set14 (taking ×4 as an instance), as tabulated in Table 2. The table intuitively shows that the CASR model achieves a similar competitive experiment result compared to VDSR [16] and RDN [24], reaching 28.96/0.7899 on PSNR/SSIM metrics, while spending less time (0.1017 s on a single image) and costing the least resource on processing image restoration.
We may conclude that our proposed CASR model takes the least time consumption and adopts acceptable parameters compared with VDSR and RDN. Table 3 indicates that our proposed CASR method is highly competitive, achieving the lowest average values of NIQE/PI on benchmark dataset Set14 with scale factor ×4. Figure 9 illustrates the visual image restoration comparison with several SR methods on the real-world image chip. Results visually show that our method, compared to others, achieves better restorative performance. It not only achieves competitive PI and NIQE values but also improves more pleasant visual quality in terms of image, edge, texture, color, and feature-rich regions. Besides, as shown in Figure 10, the restorative performance on the larger scale, e.g., ×4, is also acceptable. The vessel in the retinal image is more clear than the competitors, and the edge of the retina is also sharp as we expected. Considering that the whole experiment was designed and conducted on dataset DIV2K, a supervised public dataset with ground truth images, which is acceptable and compromised, we believe that could provide a further research direction, exploring a more realistic oriented image SR process with a better degradation kernel on a real-world image dataset.

Ablation Study.
In order to further explore the details of the experiment, we design 2 ablation experiments: one is to investigate the influence of different dilation factors on deformable convolution, and the other is the experiment of different loss functions' effects. Figure 11 illustrates two training processes with variant dilation scales. We examine whether the dilation scale of deformable 10 Wireless Communications and Mobile Computing convolution would affect recovery performance or not. As is shown in Figure 11, with epochs rising, both training results grow as well, while the model with dilation two would achieve better performance but cause a worse fluctuation.

Study of the Deformable Convolution.
We also compare the effect of different scale factors on the experimental performance, as shown in Table 4. It can be learned that with the same scale factor ×2, our proposed method, which replaces the convenient convolution with deformable convolution, would achieve better results on both datasets Set5 and B100. With the dilation factor enlarging, the performances go better. This result mainly occurs since the operation may effectively and dynamically expand the receptive field. Because different input feature maps may correspond to objects with different deformation scales, for some tasks, it is essential to adaptively determine the ratio or receptive field size.

Study of the Loss Function.
To examine the effect of the mentioned loss functions, we design an ablation experiment to explore it. Expressed formally, let the first model be "L 1 " and the other one be L 1 + L TV (i.e., using the enhanced loss function L 1 + L TV ). We tried different combinations of scale factor and loss function to examine which one would achieve better performance on dataset DIV2K, as demonstrated in Figure 6 and Table 5. Afterward, the validation process on dataset Set14 and Urban100 proves that the enhanced loss function actually results in a clearer image with more details in Figure 5.

Conclusion
This paper presents two novel CNN architectures, namely, head module and CAS-Block, to improve the SISR performance. Compared with the state-of-the-art (SotA) CNNbased algorithms, our presented head module considers low-level feature expression by applying depthwise separable convolution and deformable convolution, which is demonstrated to not only effectively extract the patterns but also reduce the parameter size. At the same time, the CAS-Block employs a global residual connection and abundantly utilizes cascading connections to capture remote spatial features from former inputs. Extensive experiments have   illustrated that our presented model has effectively improved both the quality of the reconstructed images and the processing speed compared with the SotA methods in terms of quantitative indicators and realistic visual effects.

Data Availability
The image data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.