Image Super-Resolution Network Based on Feature Fusion Attention

provided the original


Introduction
Image super-resolution (SR) is an important task in the field of computer vision [1][2][3][4][5]. In SR process, high-resolution (HR) images can be reconstructed from low-resolution (LR) images, which can break through the resolution limit of the original image. So it can be applied to fields such as enhancing LR film and television works, intelligent monitoring, video processing, HR film, and television production.
In recent years, with the development of convolution neural network (CNN) and deep learning methods, the research of image super-resolution based on deep learning has gradually become a hot spot. The method based on deep learning is good at dealing with this kind of nonlinear problems. With continuous development, convolution neural network is more and more widely used in the field of image processing. Among them, SRCNN [6,7] proposed a super-resolution network model based on 3 convolution layers, which mainly achieve image feature extraction, representation, feature nonlinear mapping, and final reconstruction. The network's depth is so constrained because it only consists of 3 convolution layers connected in series. Building a deeper deep learning network is made possible until the residual network [8] emerge. For instance, VDSR [9] has built a deeper super-resolution network combined with residual module [8,10,11], and the network performance has also been greatly improved. Since then, residual connections and deeper networks have become one of the hottest areas of super-resolution [12], while dense residual connections [13][14][15][16] and cascaded [17] structures with multiple skip connections are becoming more and more popular. This kind of method allows reusing the feature of various network layers, pushing residual information and gradient information to propagate more quickly, and enhancing their performance.
Among them, SRDenseNet [13] is mainly constructed by DenseNet [15] module. And DenseNet not only greatly reduces the amount of network parameters but also alleviates the problem of gradient disappearance to a certain extent by reusing the feature map of shallow networks and bypass the redundant features. This architecture makes full use of the features extracted from the shallow network and is reused in the deeper layer. The improvement of RDN [16] mainly introduces continuous memory mechanism (CM), global connection, and local residual connection to increase residual propagation. SRResNet [18] mainly removes redundant modules in SRResNet, such as batch normalization (BN) compared with EDSR [19], which can expand the depth of the model to improve the quality of the reconstructed image.
In recently, there are also many attractive methods such as HAN [20] based on heterogeneous graph, RFANet [21] based on residual feature aggregation, and DeFiAN [22] with extra Hessian filtering, they all take advantages of attention mechanism. For SwinIR [23] with Swin Transformer, it is hard to transfer feature information between neighbor window, which may limit the application to some extent. Moreover, in the feature learning process of the traditional deep learning network, with nonlinear mapping such as convolution and activation function alone the network cannot make full use of the high-frequency information and makes it hard to restore the detailed feature. Adding attention mechanism to the network improve the ability of extracting high-frequency information. In recent years, attention mechanism is widely used in image processing. Attention mechanism includes channel attention mechanism and spatial attention mechanism. Channel attention is used to capture the correlation between different channel feature of the model.
After different convolution kernels, each channel will generate new weight information. The convolution kernel output is decomposed into multichannel information components. The attention mechanism assigns each weight value learned to the corresponding channel. The problem of information overload can be resolved elegantly by introducing the attention mechanism, which allows us to concentrate on the information that is more important for the current task among the many inputs, focus the system's attention less on other information, and even filter out unimportant information. At the same time, we can enhance the task's accuracy and processing efficiency. Feature attention is popular in many disciplines, including text image [24], CT image [25] or medical image [26], binocular camera image [27], 3D remote sensing image [28], infrared image [29], and video [30].
A comprehensive feature attention system has not yet been developed among these existing attention techniques. Therefore, an integrated multichannel attention network based on channel attention, pixel attention, and residual attention to realize the super-resolution of images and further recover the detailed feature in images is proposed. For this paper, our main contributions are as follows: (1) A dense residual connection network based on feature attention mechanism is proposed, including pixel attention and channel attention. Combined with the residual scaling dense connection module and the new hybrid loss function proposed by us, our new method achieved significant performance improvement especially for the L ∞ error.
(2) We proposed and verified a residual scaling dense block (RSDB) based on residual scaling. Compared with traditional residual dense block (RDB), our new method incorporates residual scaling layer and feature attention layer to further improve the performance. The experimental results also show that our method has better residual convergence speed.
(3) A new hybird loss function L H is proposed and verified, which can effectively reduce the L ∞ error of the network. Compared with the traditional L 1 loss function or L 2 loss function, this method cannot only improve the PSNR and SSIM performance but also greatly improve the L ∞ error performance and can be easily adjusted according to user requirements, which is of great value in L ∞ sensitive occasions.
As we know, the attention mechanism can focus the network's attention to the relevant area, enhancing the learning performance of essential areas while decreasing the computational load on the network. The channel attention model can automatically learn the importance of each feature channel and assign different weights to each feature channel. Bigger weights will be assigned to it if the feature information is important, and vice versa. Such requirement is essential for the automobile system in autodriving application. As a result, the combination of the two approaches have substantial research significance. The experimental result infers that the residual dense connection equipped with pixel attention and channel attention could achieve positive influence on the final results.

Our Method
2.1. RDAN and the Basic Block. Attention mechanism in neural network is a resource optimization allocation scheme that assigns computing resources to more important tasks first and solves the problem of information overload when computing resources are limited, especially for the mobile vehicle systems [38,39]. Generally speaking, the more parameters of the model in the deep learning network, the stronger the expression ability of the model, and the greater the amount of information stored in the model, but this may bring the problem of information overload.
We can focus on the more critical information in the current task among the many input information by incorporating the attention mechanism and even filter out irrelevant information, so as to solve the problem of information overload and improve the efficiency and accuracy of task processing. The attention mechanism has been widely used to tasks like image segmentation and super-resolution in the field of computer vision. All of the visual attention models in use today are spatial, and typically, the feature map of the final convolution layer is weighted. We proposed a model called residual dense attention network (RDAN) based on multichannel feature fusion attention.      Figure 3: Multipath channel attention and pixel attention modules. The multichannel attention block is on the left side. The input is divided into two brunches after the first convolution layer. The feature flows into two different pooling channels and then concatenated and activated before flow out this block. While the pixel attention module is on the right side, the PA module has two convolution layers and two activation layer connecting in alternative orders. 3 Journal of Sensors depicts the method's structural breakdown. The multichannel attention module and the pixel attention module can be combined with RDN to restore the detailed feature information.
RDAN is mainly composed of residual dense attention block (RDAB) through dense residual connection. RDAB is composed of basic blocks (BB) through dense connection, and pixel-wise connection is utilized in the dense connection. The specific structure is shown in Figure 2. Combining the dense residual connection module, our network can allow the gradient and feature information propagate more fluently and further improve the network performance.
The basic block is composed of a residual dense block (RDB), multipath channel attention (MCA), and pixel attention (PA) blocks; the attention network adds residual connections after MCA and PA to speed up the propagation of residual and gradient information, as can be seen in Figures 2 and 3. Local residual learning and feature attention modules are part of a basic block structure. Local residual learning modules enable it to skip through less essential area, such as low-frequency regions. The backbone network architecture can focus on more effective critical information thanks to the various local residual connections.
MCA. In the previous channel connection attention module [24], average pooling was used to calculate channel weight. But, average pooling tends to smooth edges and ignore detailed information. Hence, we added another maximum pooling path to our multipath channel attention module in order to restore more detailed information. The convolution and activation layers are appended after the average and maximum pooling layers. This strategy enables more efficient detail acquisition and, as a result, increased learning efficiency.
PA. The pixel attention module uses a convolution and activation layer in sequence, as well as pixel-wise weighting and residual multiplication. The network treats distinct features and pixels unequally in the attention module, allowing for more flexibility in processing diverse types of data and extending the representation potential of convolution neural networks. The feature weights in the features fusion attention (FFA) layer of multiple levels of features structure are automatically learned from the feature attention (FA) module, which assign essential features higher weights. This structure can also store and pass on shallow data to deeper layers.

Multipath Channel Attention Module.
Based on the input feature information, the MCA module primarily determines the weights of each channel. Not only global average weights, but also global maximum pooling weights are produced in this module. This method allows for the collection of more specific information. The input and output images in the attention module have dimensions of C × H × W and C × 1 × 1, respectively. After applying global average and maximum pooling to the input data, we can obtain where H ap and H mp are global average pooling and global max pooling functions, respectively; F c is the input; X c is where σ and δ are the ReLu and Sigmoid activation functions, respectively; Concat is the pixel-wise splicing, and Conv is the 1 × 1 convolution. Finally, the weights and the inputs are multiplied pixel-wise to obtain the final channelaware output.
where ⊗ is the point-wise multiplication, and the detailed structure is shown in Figure 3.

Pixel Attention
Module. The pixel attention module concentrates on the pixel weight. The pixel attention module detects weight distribution over the entire image and apply this information to execute targeted weight calculations. The input-output image shape of the pixel attention module changes from C × H × W to 3 × H × W. This output also contains the RGB per color component weights; when utilizing the MCA output as the PA input, we get where F * c is the output of the MCA; σ and δ are the ReLu and Sigmoid activation functions, respectively; Conv is the 1 × 1 convolution. Finally, the output of the final pixelaware module is obtained after multiplying the weights with the input pixel by pixel as follows.
Multipath channel attention weight and pixel attention weight example are shown in Figures 4 and 5. 2.4. Hybrid Loss Function. RDAN network adopts a hybrid loss function composed of L 1 error and L ∞ error, which improves the instability of L ∞ error of the original RDN network. The L ∞ error is significantly decreased in addition to the PSNR/SSIM performance improvement, demonstrating the evident benefits of the enhanced network. We can refer to the test comparative analysis for further information.
2.4.1. L 1 -Norm. L 1 -norm is one of the most common norms, which is defined as follows: The L 1 -norm can be used to measure the difference between two vectors, such as the mean absolute error (MAE) 2.4.2. L ∞ -Norm. L ∞ -norm is mainly used to measure the maximum value of vectors. It is defined as In general, it can be expressed by the following formula: A very good feature of L ∞ -norm is that it is independent In practice, we usually use mean absolute error instead of summation of absolute error in order to avoid the correlation between L 1 -norm and vector dimension. However, the direct use of L 1 -norm may reduce the average error, while the absolute error of individual pixels may be still huge. This situation does occur. Because the L 1 -norm only reduces the average error, there is no constrain on the maximum error of a incorporating pixel. Therefore, a new loss function that can not only reflect the overall error, but also effectively reduce the maximum error of a single pixel is required, so as to further improve the quality of image recovery.
The L ∞ -norm just satisfies this requirement, which allows us to easily compare the max error between a single pixels, which is independent of the number of vector dimension. Therefore, we proposed a hybrid error loss function combining L 1 -norm and L ∞ -norm, which cannot only ensure the overall error of the image but also effectively reduce the maximum error between individual pixels. Through our tests and analysis, it is found that the recom-mended value range of β is [0.002, 0.1]. Too large or too small will deteriorate the performance of the network.
2.5. Implementation Details. Residual scaling dense block (RSDB) is composed of RDBs, residual scaling layer, and feature fusion attention layer through dense connection. The detailed pseudocode is show in Algorithm 1. One of the primary modifications from the original RDN model is the utilization of a more effective feature fusion attention layer and residual scaling layer in order to increase the efficiency of hardware resource consumption. In addition, it serves as the fundamental building block of our RDAN method.
Each RSDB module has 6 convolution layers in it. All additional connections are made by pixel-wise concatenation, with the exception of the residual connection between the first layer and the last layer, which is done by pixel-wise concatenation. In dense connection, the activation function is ReLu, and the convolution kernel size is 3. The RSDB module excludes the BN, dropout, pooling, and other structures and only includes the convolution layer, activation layer,  In addition, PA, MCA, and other feature fusion attention structures are added to RSDB. In MCA and PA, we reduce the number of feature maps by 8 through convolution layer. The size of MCA and PA convolution kernel is set to 1 × 1 and 3 × 3, respectively. In multichannel attention, the pooling function adopts average pooling and maximum pooling, respectively. The activation functions are ReLu and Sigmoid, respectively. In MCA output, the number of final output channels is 1. In PA, the final output channel is 3, that is, the RGB component weight of the corresponding color channel.
In the upscale module, the subpixel method is more flexible to realize image scaling, but it is required that the number of feature images input into the subpixel module must be a multiple of the square of the magnification, otherwise integer magnification cannot be achieved. Because this method is different from the upsampling or downsampling method, it will not lose or introduce new pixel information. And the number of feature map can be set through a 1 × 1 convolution layer.
In MCA output, the final output channel number is 1. But in PA output, it is 3, which represents the RGB component weights of the different color channels. During the training process, the preprocessor crops the input images to size 64 × 64 for distinct training sets, with no overlap between images. The training output is the same size as the original input image.
Input and output: 64 × 64 images are used as input for networks with varied magnification, while output highresolution images vary depending on the magnification and are 128 × 128, 192 × 192, and 256 × 256, respectively. The input image has been cropped, and it has been cropped to various sizes in accordance with the various needs for magnification.

Experiment Results
3.1. Training Platform, Data, and Evaluation Metrics. For SR task, there are public datasets for this task in test and validation, mainly include Set5 [40], Set14 [41], BSD100 [42], Urban 100 [43], DIV2K [44], and DTD [45]. The evaluation metrics of training performance includes PSNR and SSIM [46]. PSNR (peak signal to noise ratio) is an objective standard for evaluating images, which is the most common and widely used objective measurement method to evaluate image quality. Therefore, the greater the PSNR value, the less distortion of the image. SSIM (structural similarity) is not only a structural similarity but also a full images quality evaluation metrics. It measures image similarity in three aspects: brightness, contrast, and structure. SSIM value range is [0, 1]. The larger the value, the smaller the image distortion. The training platform and relevant parameters used in this method are shown in Table 1.
Training settings and results: training optimizer is ADAM, leaning rate (lr) =0.0001, β 1 =0.9, and β 2 =0.999.  The convolution kernel size is 3 × 3, the dataset is set according to reference [31], the input image size is 64 × 64, and the training platform is Keras 2.7. The method proposed by us has achieved ideal results from different datasets and different scaling factor and has better stability compared with the original RDN method.

Comparison with Existing
Methods. The comparison of the PSNR/SSIM result with the RDAN method and various recent significant methods is shown in Table 2. Our method has achieved excellent results in PSNR/SSIM/on standard datasets.
Our method achieved 6 best result in both 9 PSNR metrics and SSIM results. One PSNR result rank second, and two SSIM result rank second. The training setting are according to the literature [31]. The size of different is shown in Table 3.

Ablation Experiment.
This ablation experiment is designed to evaluate the impact of feature attention structure on network performance in our novel technique. The results of this experiment, which compare the RDN and RDAN methods with different settings, are displayed in Figure 6 and Table 4. From this result, we can see that the network performance is improved after adding the attention module. Compared with the original RDN method, the performance of PSNR/SSIM/L ∞ is improved by 0.70%/0.52%/2.03%, respectively. The lesser L ∞ error is, the better the network performance is. Compared with adding CA module or PA module alone, SNR/SSIM/L ∞ performance decreases by 0.65%/0.82%/13.17% and 1.01%/0.91%/6.26%, respectively.
Additionally, compared to the original RDN technique, performance cannot be increased by adding CA or PA modules alone. Only by combining the two can the performance be enhanced. The comparison experiments demonstrate that the addition of the attention structure significantly improved the PSNR/SSIM/L ∞ error and other performance metrics when compared to the original RDN network.

Residual Scaling Layer.
We developed a residual scaling layer in the network to further enhance its residual convergence performance. We designed four groups of comparative tests in order to quantitatively assess and compare the performance of residual scaling layer. They are RDAN1: RDAN with residual scaling in feature block; RDAN2: RDAN with residual scaling layer in feature group; RDAN3: RDAN with residual scaling layer and L ∞ error loss function, in order to quantitatively assess and compare the performance of residual scaling layer. Figure 7 and Table 5 both show the results. These results demonstrate that the RDAN1 network's performance is superior to the RDAN2. The difference between the two networks is that the residual scaling layer is located in different positions. In RDAN1 network, residual scaling is set in basic block, while residual scaling is set in RSDB in RDAN2 network. But in contrast,  8 Journal of Sensors the PSNR/SSIM/L ∞ of these two methods has also been improved compared with the original RDAN0 network with the residual scaling layer. The PSNR/SSIM/L ∞ performance metrics have been improved by 0.82%/0.78%/14.56% and 0.28%/0.24%/11.08%, respectively. It also shows that the scaling performance in basic block is better than that in RSDB. Besides, the performance improvement is more pronounced when our L H loss function is used, and the performance metrics increase by 1.19%, 1.18%, and 14.24%, respectively. This indicates that residual scaling and L H hybrid loss function play an important role in promoting the performance, and these two techniques can be used simultaneously, with particular benefits for PSNR and SSIM performance metrics. It can be seen from this comparative experiment that the position of residual scaling has some impact on the network's performance, and that the performance of residual scaling in basic block is superior to that in RSDB. And combined employment of the L H hybrid loss function and residual scaling can achieve to even better benefits.

Influence of β in the Loss Function on the Performance
Metrics. We discovered through the previous two sets of comparative experiments that the L H loss function can significantly boost the network's performance. Hence, we designed four groups of ablation experiments to assess the performance of ablation on network performance in order to further verify the difference in the loss function. These four sets of comparative tests include values of 0.0, 0.002, 0.01, and 0.02 successively. Figures 8-9 and Table 6 show influence on PSNR/SSIM/L ∞ error performance of different β value. The outcomes demonstrate that the network performance metrics improves when it is between [0.002, 0.01] after adopting the new hybrid loss function.  Figure 9: The effect of β on the performance metrics L∞ error. As can be seen in Figure 9, the network's L ∞ error is nearly saturated before the adoption of the hybrid loss function, which means that more training cycles will not significantly lower its L ∞ error. It is clear that the network's L ∞ error is further decreased when the new hybrid loss function is used, demonstrating the effectiveness of the hybrid loss function in decreasing L ∞ error. When β = 0:002, PSNR/ SSIM performance is enhanced by 0.55%/0.70%, respectively, while L ∞ performance is enhanced by 11.85%, and the result is quite apparent.
And with β value continues to increase, the PSNR/SSIM metrics increased by 0.32%/0.38%. Since the increase is not obvious, the L ∞ error decreases by 0.56%. And since β continue to increase, such as β = 0:02, the PSNR/SSIM metrics decreased by 0.98%/1.12%, respectively, and the performance deteriorated significantly. Therefore, it can be seen that when β value in range of [0.002, 0.01] can effectively improve the network performance from the comparison of this experiment. While when β exceeds this range, the network performance may deteriorate. From the visual performance, it is clear that the our approach restored results are better than that of SwinIR and DeFiAN methods in Figure 10. For example, the L ∞ errors of SwinIR and DeFiAN are 0.2784 and 0.2471, respectively, which are significantly bigger than that of the our method with 0.2314 L ∞ error, demonstrating the efficiency with which the RDAN method can reduce the L ∞ error effectively.
Besides, as show in Figure 11, the PSNR/SSIM of SwinIR method is 25.0648/0.5559, which is better than that of RDN

11
Journal of Sensors method with 24.9222/0.5438, but the L ∞ error of SwinIR method is 0.4902, which is higher than that of RDN method with 0.4157. The result shows that better PSNR/SSIM result does ensure better L ∞ performance. Similar result could also be found in Figures 12 and 13. 3.7. Super-Resolution for Real-World Images. The training data and the test data for the preceding trials come from the same dataset, and the data from the same dataset have certain similarities even when there is no overlap between the training data and the test data. We employ training and testing on many datasets to further compare the generalization performance of various approaches on unknown datasets, which can enable quantitative comparison and get around the previous restriction that real image comparison could only rely on visual comparison.
In this test, the training dataset is from DIV2K, and the test dataset is from Set14, BSD100, and DTD. Since such images also have high-resolution images for comparison, PSNR, SSIM, and other indicators can be compared quantitatively. As shown in Figures 14-17, the performance of RDN, SwinIR, and DeFiAN methods are compared, respectively.
Taking Figure 16 as an example, the PSNR/SSIM/L ∞ metrics of SwinIR method and DeFiAN method is 25.5228/0.6711/0.2902 and 26.3475/0.7021/0.2667, respectively, while the performance metrics of our method is 26.6747/0.7099/0.2510, respectively. Compared with SwinIR method and DeFiAN method, the performance is significantly improved.
From this result, we can see that the our method has achieved the best performance, which not only has better PSNR/SSIM performance metrics but also has better L ∞ error performance. It demonstrates that the RDAN method has better generalization performance on unfamiliar datasets because it not only reduces average error but also maximum error of local pixels.

Conclusion
Combining the advantage of residual dense net and feature attention mechanism, we proposed a more effective residual dense attention network (RDAN) for image super-resolution, which primarily consists of pixel attention and channel attention and residual dense structures. By introducing attention mechanism, we can focus on the more critical information in the current task among the many input information, reduce the attention to other area, and even filter out irrelevant information to solve the problem of information overload.
Meanwhile, we proposed a hybrid loss function based on the combination of L 1 and L ∞ error. The range suggested for the parameters is also given according to a set of comparison experiments. Too big or too small may deteriorate the final

12
Journal of Sensors performance. The experiment results show that the new method cannot only obtain better PSNR/SSIM performance but also better L ∞ performance in our proposed RDAN network based on feature attention mechanism, which verifies that the new network has better performance. Moreover, performance verification on real-world images shows that our proposed method has obvious advantages too, which further illustrates the advantages of our new method. The experimental result infers that the residual dense connection equipped with pixel attention and channel attention could achieve positive influence on the final results. So we can conclude that the attention approach together with residual dense structure could improve the performance of our network if it is properly designed.

Data Availability
The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest
We declare that there is no conflicts of interests in this research.