Adaptive Residual Channel Attention Network for Single Image Super-Resolution

Single image super-resolution (SISR) is a traditional image restoration problem. Given an image with low resolution (LR), the task of SISR is to find the homologous high-resolution (HR) image. As an ill-posed problem, there are works for SISR problem from different points of view. Recently, deep learning has shown its amazing performance in different image processing tasks. There are works for image super-resolution based on convolutional neural network (CNN). In this paper, we propose an adaptive residual channel attention network for image super-resolution. We first analyze the limitation of residual connection structure and propose an adaptive design for suitable feature fusion. Besides the adaptive connection, channel attention is proposed to adjust the importance distribution among different channels. A novel adaptive residual channel attention block (ARCB) is proposed in this paper with channel attention and adaptive connection. Then, a simple but effective upscale block design is proposed for different scales. We build our adaptive residual channel attention network (ARCN) with proposed ARCBs and upscale block. Experimental results show that our network could not only achieve better PSNR/SSIM performances on several testing benchmarks but also recover structural textures more effectively.


Introduction
Super-resolution (SR) is an important issue in the image restoration area. e task of single image super-resolution (SISR) is to find high-resolution (HR) images from the lowresolution (LR) images. Since it is an ill-posed problem, there are potential high-resolution images corresponding to an identical image with low resolution. ere are practical applications with SISR methods, such as video quality enhancement, remote sensing image processing, and MRI analysis. To find the most suitable HR images, there are various methods for SISR problem and other image restoration tasks.
Deep learning has shown its amazing performance in various tasks [1][2][3][4][5]. Nowadays, there are convolutional neural network-(CNN-) based works focusing on SISR problem. As far as we know, SRCNN [6] is the first work using a three-layer CNN for image super-resolution. After SRCNN, Dong et al. proposed FSRCNN [7] with a deeper but narrower network and achieved better performance with less time cost. Usually, when the network is deeper, the performance will be better. VDSR [8] proposed by Kim et al. used a very deep network design with global residual learning. Inspired by VDSR and residual connections, EDSR [9] proposed by Lim et al. applied an enhanced deep network for SISR problem with residual blocks. Besides EDSR, MDSR [9] designs an upscale module for different scaling factors jointly. Motivated by the structure of Laplacian pyramid, Lai et al. proposed LapSRN [10] with a progressive structure. Similar to MDSR, the progressive LapSRN could upscale images with different scaling factors concurrently. Recursive structures could effectively enlarge the receptive fields and make full use of sharing parameters. DRCN [11] proposed by Ghifary et al. used shared convolution layers to increase the receptive fields. DRRN [12] proposed by Tai et al. combined residual and recursive structures and achieved good performances.
ResNet [13] proposed by He et al. has been proved to be a success for network design. In ResNet, a residual block was proposed for image classification and beyond with effective gradient flow. ere are works for super-resolution with residual blocks. EDSR stacked residual blocks with a global shortcut to build a very deep network. RDN [14] proposed by Zhang et al. introduced a residual dense block (RDB) with feature fusion, achieving good performance. Besides RDN, RCAN [15] proposed by Zhang et al. designed a residual-in-residual structure to build the network deeper.
ere is a shortcut in residual block: the addition of feature maps before and after processing is regarded as the final result. In fact, the above methods apply a fixed ratio balancing the two parts, which does not distinguish the different importance.
Attention is used for human brain simulation. When watching pictures, human's brain usually focuses on more important area.
ere are attention methods for image processing tasks. SENet [16] proposed by Hu et al. introduced a channel attention method for importance distribution among channels. SENet requires few parameters, which is flexible for different network designs. To the best of our knowledge, RCAN [15] is the first image super-resolution work with channel attention mechanism. After RCAN, IMDN [17] proposed by Hui et al. modified the vanilla channel attention layers and achieved good performances with few parameters. SAN [18] proposed by Dai et al. introduced a second-order attention mechanism with channel and nonlocal attentions.
In this paper, we propose a novel adaptive residual channel attention block (ARCB) for image super-resolution. Different from vanilla residual blocks, an adaptive weight is learned from paired data for combining information of main path and shortcut. Considering the different importance of channels in residual blocks, channel attention is introduced in ARCB for weight distribution on channels. Besides block designs, recent works design special upscale modules for different scaling factors. In this paper, we introduce a simple but effective general upscale block design for different factors. e adaptive residual channel attention network (ARCN) is designed based on ARCBs and proposed upscale block. Experiments are performed on several testing benchmarks. e results show that our ARCN could not only achieve better performance on PSNR/SSIM comparison but also recover complex structural textures more effectively. e contributions of this paper could be demonstrated as follows: (1) We propose a novel block named ARCB with channel attention mechanism. In ARCB, we propose an adaptive residual connection with learned weights. e weight factors could find suitable ratios for combination information from different paths. Channel attention mechanism in ARCB distributes different weights on channels for concentrating more on important information. (2) We propose a tiny but effective upscale block design method. With the proposed design, our network could be flexibly analogized for different scaling factors.
(3) Experimental results show that our proposed ARCN achieves better PSNR/SSIM results on several testing benchmarks and recovers more complex structural textures than other methods.

Related Works
2.1. Single Image Super-Resolution. Let I HR and I LR denote LR and HR images separately; the observation model of degradation step could be described as where (·)↓ denotes the degradation and n denotes the noise.
Usually the degradation models are chosen as bicubic-down with different scaling factors. Given I LR , the task of single image super-resolution (SISR) is to find corresponding I HR .
However, there are several potential HR images degraded to a same LR image. Since SISR is an ill-posed problem, it is challenging to find the solution.
Convolutional neural network (CNN) has been proved as an effective tool for image restoration [1][2][3][4][5]. Recently, there are CNN-based works for SISR problem [19,20]. To the best of our knowledge, SRCNN [6] proposed by Dong et al. is the first deep learning work for SISR. ere are three convolution layers in SRCNN, corresponding to the three steps of sparse coding method: feature extraction, nonlinear mapping, and restoration. After SRCNN, FSRCNN [7] proposed by Dong et al. applied a deeper network for SISR problem. Different from SRCNN, FSRCNN used a deconvolution layer to upscale the feature map. By using shrinking and expanding modules, FSRCNN decreased the number of parameters and built the network deeper and faster. ESPCN [21] proposed by Shi et al. introduced a pixel shuffle layer to substitute the deconvolution layer for upscaling. Similar to FSRCNN, there are several convolution layers processing the LR feature maps. At the end of ESPCN, a convolution layer changed the channel number of feature maps, and the pixel shuffle layer performed the spatial transformation. From the practical experience, a deeper network usually results in better performance. VDSR [8] proposed by Kim et al. applied a very deep network with twenty convolution layers and global residual connection. Besides, batch normalization (BN) was used in VDSR to improve the performance. To preserve the resolution of feature maps, zero padding strategy was applied in VDSR. RED [22] proposed by Mao et al. introduced a symmetrical encoder-decoder structure with convolution and deconvolution layers. To transmit information to the bottom layer, residual connections were designed between blocks.
A deeper network will cause a large amount of parameters. Recursive design with sharing parameters is one of the choices to build lightweight networks.
ere are recursive networks for SISR problem. DRCN [11] proposed by Ghifary et al. applied sharing convolution layers to enlarge the receptive field with limited parameters. Similar to SRCNN, there are three modules in DRCN. Embedding network extracted the feature maps from input images. After feature extraction, the inference network with sharing 2 Scientific Programming parameters performed the nonlinear mapping. Finally, reconstruction network restored the HR images from feature maps. To increase the network performance, there is a skip connection in DRCN to learn the residual information. DRRN [12] proposed by Tai et al. designed a recursive residual connection with sharing parameters to prevent gradient disappearance. By using recursive design and sharing parameters, DRRN built a 52-layer network with few parameters and performed better than VDSR.
Recently, there are works with good performances focusing on different block designs and network pipelines [23][24][25][26][27]. Dense connection has been proved effective for image super-resolution [23,25,26]. SRDenseNet [28] proposed by Tong et al. introduced dense connection to SISR problem and concatenated all feature maps as the final output.
ere are four components in SRDenseNet. Firstly, one convolution layer was used to extract low-level features. After extraction, several dense blocks were used to extract highlevel features. Deconvolution layers were used to upscale the feature maps. Finally, a convolution layer was applied to HR image production. SRResNet [29] used residual blocks with skip connection to build a deep network for better performance. EDSR [9] proposed by Lim et al. removed redundant batch normalization layers from SRResNet, building a deeper network. Residual connection has been proved as an effective design for better network performance. RDN [14] proposed by Zhang et al. introduced a residual dense block named RDB combining two kinds of structures. By stacking RDBs with global feature fusion, the deep network RDN achieved good performance. MSRN [30] proposed by Li et al. introduced a multiscale residual block combining residual and inception blocks. Global fusion structure was applied in MSRN for feature extraction. To build a deeper network, a novel residual-in-residual structure was proposed in RCAN [15], which turned out to be a success.

Attention Mechanism.
Attention mechanism was firstly proposed by human brain simulation. When watching an image or reading a sentence, the important areas will be paid more attention. ere are different attention methods used in image processing.
ere are four kinds of attention mechanisms: item-wise soft attention, item-wise hard attention, location-wise soft attention, and location-wise hard attention. e difference between item-wise attention and location-wise attention is input form. Special sequential items are required for item-wise attention, while locationwise attention needs a single feature map. From another point of view, attentions could be separated as soft and hard attention. Soft attention focuses more on different areas and channels. After training, soft attentions will be generated by networks. Besides soft attention, hard attention concentrates more on different pixels. Hard attention is a random prediction procedure, which is usually implemented by reinforcement learning.
Spatial transformer network (STN) [31] is an attention method in spatial domain. In STN, information from origin images was transformed into another space with keypoints. e authors proposed a spatial transformer module for the transformation. ere are also works for channel domain attention. SENet [16] proposed by Hu et al. introduced a channel attention method for concentrating more on important channels. In SENet, a squeeze-and-excitation (SE) module is proposed to automatically learn the importance of different channels. In SE module, squeeze operation is firstly introduced to get the global channel features. After squeezing, excitation module is used to learn the relations among channels. ere are two full connection layers with a ReLU activation in excitation module. Finally, scale module was applied after excitation for reweighting the feature maps. SENet focuses on the importation of channels, regarding different areas of feature maps equally. To consider the global information of feature maps, nonlocal neural networks [32] introduced a long-range dependencies attention block for better performance. However, the proposed nonlocal blocks required more memory cost and high computation complexity.

Method
In this section, we will describe the proposed ARCN. In ARCN, an adaptive residual channel attention block named ARCB is proposed to compose the network. Adaptive factors in ARCB for different information importance are learned while training. After adaptive residual connection, channel attention mechanism distributes the weights on channels, which considers the importance from another point of view. e main body of ARCN is composed of several ARCBs and a padding structure. A global skip connection is introduced to ARCN for residual learning. After the main body, an effective and tiny upscale module is designed for changing the scaling factors flexibly. We will introduce proposed ARCN in the following manner: Firstly, the network design will be described in general. After description, the details of ARCB will be discussed with channel attention. Detailed introduction of flexible upscale block will follow the description of ARCB. Finally, some comparisons will be done with other SISR works.

Network Design.
e entire network structure is shown in Figure 1.
ere are three modules in the proposed ARCN. Firstly, feature extraction module extracts feature maps from input LR images. After feature extraction, nonlinear mapping module processes the feature maps from LR space into HR space. A skip connection is applied to nonlinear mapping module for global residual learning. Finally, restoration module with a flexible upscale block restores the HR images from proposed feature maps.
ere is one convolution layer in feature extraction module. e layer extracts low-level features from LR image and builds the feature maps. Let f FEM (·) be the feature extraction module; then the operation could be demonstrated as where I LR denotes the input LR images.

Scientific Programming
After feature extraction, several ARCBs are applied in nonlinear mapping modules for mapping feature maps from LR space to HR space. Let us denote H k as the output of k-th ARCB; then there is where f ARCB k (·) denotes the operation of k-th ARCB. After K blocks, there is a padding structure composed of two convolution layers with ReLU activation. e padding structure is used to increase the network depth and weight the information from main path for global residual learning. e operation of padding structure and global residual learning could be demonstrated as where f PAD (·) denotes the padding structure and HR denotes the feature maps after padding. Finally, an effective upscale block is applied in restoration module. In restoration module, the final HR image I SR is restored from processed feature maps. e operation of restoration module could be demonstrated as where f UP (·) denotes the upscale block.

Adaptive Residual Connection
Block. ARCB is introduced to the network with adaptive residual connection and channel attention. An illustration of proposed ARCB is shown in Figure 2(b). ere are two convolution layers with ReLU activation in ARCB. Different from ResBlock, which is used in most of SISR works, a channel attention layer is designed after the convolution layers. e attention layer weights information from different channels. After that, learned adaptive factor W α is used to scale the processed feature maps.
Let f RES (·) and f CA (·) denote the main processing path and attention layer, respectively; then the operation of k-th ARCB could be demonstrated as where W α denotes the adaptive factor learned while training. e adaptive factor W α is one of the main differences between ResBlock and ARCB. In vanilla ResBlock, the ratio of information mixture from two paths is fixed. However, it does not distinguish the importance of different information. In the proposed ARCB, weight factor W α is a learnable parameter. In other words, the ratio will be adjusted due to the training data, which is more suitable for information fusion.
Another main difference between vanilla ResBlock and ARCB is the channel attention. Convolution layers treat information from different channels equally. To concentrate more on important channels, channel attention is introduced to ARCB. e structure of channel attention is shown in Figure 3.
From Figure 3, global average pooling is firstly applied to information estimation. ere is a hypothesis that when the feature maps are more complex, the information will be more important. From this point of view, the global average pooling operation could extract the information fast and effectively. After the feature extraction, a squeeze-and-excitation design is introduced for nonlinear mapping. In squeezing step, the channel number shrinks for information distillation. e most important information will be weighted after squeezing. en, the excitation module preserves the channel number the same as the origin feature maps. Finally, a Sigmoid activation and a dot multiplication are adopted to distribute different importance among channels.
Let us denote by x in and x out the input and output of channel attention. e operation of channel attention mechanism could be demonstrated as where σ(·) denotes the Sigmoid activation; FE(·) and FS(·) denote the squeeze and excitation modules. ere is a ReLU activation between squeeze and excitation modules. e two modules are made of full connection layers. ere is a Sigmoid activation between the excitation and multiplication. On one hand, it will be helpful for nonlinearity. On the other hand, Sigmoid activation will convert weights to be negative. Since there is no negative for human vision system, it is designed to fit the biological process.

Flexible Upscale Block.
Upscale block is widely used in various works for SISR problem, which increases the resolution of feature maps and restores the final HR image.
ere are different upscale block designs for different scaling factors without a unified pattern. In this paper, we proposed a flexible upscale block design pattern. With the proposed design, the structure could be easily modified for different scaling factors. e structure of our proposed flexible upscale blocks is shown in Figure 4.
As shown in Figure 4, there are three traditional upscale block designs: A, B, and C [30]. In A, a nonlinear upscale design is introduced with ReLU activation. Design B  removes the ReLU activations but keeps the convolution and pixel-shuffle layers. Besides, design C uses deconvolution to substitute the pixel-shuffle layers. Different form other works, there is only one convolution layer with a pixelshuffle layer in the proposed upscale block. ere are two benefits of using the flexible upscale block design. On one hand, there is only one convolution layer in the block, which saves the parameters and decreases the computation complexity. On the other hand, when the scaling factor is changed, the only modification of the block is the channel number of convolution layers. After changing, the main body of the network could be find-tuned for a new factor with few iterations.
ere is a main difference between the proposed block and others. In other designs, there is a convolution layer after the last pixel-shuffle or deconvolution layer. Usually, it is used to restore the HR images with 3 channels from feature maps. However, in our proposed block, the restoration is proposed by the only convolution layer. On one hand, it is corresponding to the feature extraction module, which is also composed of only one convolution layer.
To introduce the design in detail, there are examples of different scaling factors. e special configurations are shown in Table 1. Notice that when the scaling factor is changed, the only modification is the channel number. From this point of view, the proposed upscale block is flexible for different factors.

Discussion
(1) Difference from DRDN [33]: In DRDN, there are dense block (DB) structures for feature exploitation. e entire DRDN holds a global residual dense connection design to efficiently process the features. In ARCN, global and local residual connections are utilized to process the features. ere is no dense connection in ARCN, which is for shrinking the channel numbers.
ere is no channel attention and there are no adaptive weights in DRDN. In ARCN, the two components are utilized to exploit the features more effectively.
(2) Motivation on global and local residual learning: In ARCN, global and local residual learning strategies are jointly applied for feature exploration. e residual connections can effectively solve the gradient vanishing problem, which make the network deeper. e local residual connection in ARCB ensures the gradient flow, while the global residual learning in ARCN guarantees the identical information transmission, which improves the network capacity and representation.    Avg pool x out x in

Scientific Programming
Our ARCN is trained on DIV2K [34] dataset. DIV2K dataset is a novel dataset for SISR problem. ere are 1000 images with up to 2K resolution from real world. In DIV2K, there are 800 images for training, 100 images for validation, and 100 images for testing. In this paper, we train our ARCN with 800 training images and validate the model with 5 images. e paired training data are cropped with resolution of 48 * 48 for LR patches. e batch size for training is set as 20. e model is updated for 1000 epochs by Adam optimizer. e learning rate of Adam optimizer is set as lr � 10 −4 and halved for every 200 iterations.

Experiment Results.
We compare our ARCN with some SISR works: SRCNN [6] and FSRCNN [7]. e quantitative PSNR/SSIM comparisons are shown in Table 2.  Table 2, our model has achieved competitive or better performance on five benchmarks compared to other works. For Urban100 and Manga109, our ARCN achieves better performance than the others.
ere are high-resolution images from real world in Urban100, and Manga109 is composed of the comic book covers. From this point of view, our ARCN could recover the complex structural textures more effectively.
Visualization comparisons are shown in Figure 5. ere are four images chosen from Urban100 testing benchmark to compare the performance. From the visualization, our ARCN performs better than the others on structure texture recovery.

Model Analysis
(1) Study on parameters: From the design, our proposed flexible upscale block could save the parameters. To show the comparison on parameter and performance, we test the model on five benchmarks. e quantitative results are shown in Table 3. We compare our ARCN with several recent lightweight works for SISR problem: CARN [43], EDSR-baseline [9], SRMDNF [42], and DRCN [11]. e results show that our ARCN could achieve competitive or better performance with fewer parameters. ere are around 18.7% parameters off in our ARCN with similar performance. (2) Study on adaptive factors: To demonstrate the effect of adaptive factors, we illustrate the learned features from two different parts of ARCB. As shown in Figure 6, (a) denotes the features processed from main path; (b) denotes the input feature from shortcut. e shortcut contains the origin information of input features, while the processed feature concentrates on the high-frequency information on features. After the adaptive fusion, the high-frequency information will be enhanced by aggregation.       As a substitution, the vanilla cascading design and proposed efficient one hold competitive performance. However, the proposed upscale block holds a much smaller number of parameters. A comparison of two upscale blocks with different scaling factors is shown in Table 5. From the table, proposed blocks have much fewer parameters with less computation cost for upscaling the images.

Conclusion
In this paper, we proposed a novel adaptive residual channel attention network named ARCN for single image superresolution (SISR) problem. In the proposed ARCN, adaptive residual channel attention block (ARCB) was designed for better performance. Mixture factors in ARCB were learned while training, which weighted the information from two paths in blocks adaptively. Channel attention mechanism was introduced to ARCB for distributing the importance among different channels. Besides ARCB, a tiny but flexible upscale block design was proposed for different scaling factors. Experimental results showed that our proposed ARCN could not only achieve competitive or better performance with fewer parameters than other lightweight works but also recover the complex structural textures more effectively.
In the future, more reference-free perceptual assessments will be performed to demonstrate the network performance. Furthermore, more experiments will be conducted on real-world datasets.
Data Availability e image and quantitative comparison data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.