DFAN: Dual Feature Aggregation Network for Lightweight Image Super-Resolution

With the power of deep learning, super-resolution (SR) methods enjoy a dramatic boost in performance. However, they usually have a large model size and high computational complexity, which hinders the application in devices with limited memory and computing power. Some lightweight SR methods solve this issue by directly designing shallower architectures, but it will adversely a ﬀ ect the representation capability of convolutional neural networks. To address this issue, we propose the dual feature aggregation strategy for image SR. It enhances feature utilization via feature reuse, which largely improves the representation ability while only introducing marginal computational cost. Thus, a smaller model could achieve better cost-e ﬀ ectiveness with the dual feature aggregation strategy. Speci ﬁ cally, it consists of Local Aggregation Module (LAM) and Global Aggregation Module (GAM). LAM and GAM work together to further fuse hierarchical features adaptively along the channel and spatial dimensions. In addition, we propose a compact basic building block to compress the model size and extract hierarchical features in a more e ﬃ cient way. Extensive experiments suggest that the proposed network performs favorably against state-of-the-art SR methods in terms of visual quality, memory footprint, and computational complexity.


Introduction
Single image super-resolution (SISR) aims to reconstruct a visually natural high-resolution (HR) image from its lowresolution (LR) counterpart, which is an inherently illposed inverse problem. Due to the essential role in video processing [1], surveillance system [2], and object restoration [3], super-resolution (SR) is still an active research area.
Recently, deep learning-based image super-resolution methods [4][5][6][7] have shown prominent performance over conventional methods such as Bicubic interpolation and Lanczos resampling. After the proposal of residual learning [8], which simplifies the optimization of deep convolutional neural networks (CNNs), SR networks tend to become even deeper and larger. However, it is impractical to simply pursue performance gains without considering the model size and computational complexity. For devices with limited memory and battery capacity, cost-effective methods are preferred, which encourages the design of lightweight SR models. To reduce the number of parameters, some approaches adopt a recursive manner or parameter sharing scheme [9,10]. However, to compensate for the performance drop, these methods have to increase the network width or depth, thus, resulting in high computational complexity as shown in Figure 1. Some other methods directly design shallower network architectures, which reduce parameters and calculations simultaneously. For example, [11,12] are such compact models with fewer than 40 layers. However, their representation ability is restricted by the shallow architecture.
Towards these drawbacks, we propose Dual Feature Aggregation Network (DFAN) that can strike a better trade-off between SR performance and computational cost as illustrated in Figure 1. The key component of DFAN is the dual feature aggregation strategy. It aggregates local features and global features in a coarse-to-fine manner and could largely improve feature utilization via feature reuse. Specifically, the dual feature aggregation strategy consists of two modules: Local Aggregation Module (LAM) and Global Aggregation Module (GAM). LAM uses an efficient connection method and one convolutional layer to adaptively fuse hierarchical features along the channel dimension. Then, GAM further fuses the local aggregated features along the spatial dimension in an iterative manner. This progressive aggregation strategy fully leverages all hierarchical features, which enables the lightweight model to achieve better SR performance. In this paper, we also design an Efficient Convolutional Block (ECB) as the basic building block of DFAN. It comprises group convolutional layers with channel shuffle operation. Although ECB is compact, DFAN can still achieve competitive results with the help of the dual feature aggregation strategy.
In summary, our main contributions are as follows: (i) We propose DFAN, which can achieve better SR performance with limited computational cost. It is more practical in real applications  [14]. However, these methods are computationally expensive for real application. Thus, more and more lightweight SR methods are proposed. Deep Recursive Residual Network (DRRN) [9] and Memory Network (MemNet) [10] introduce recursive learning or weight sharing schemes to reduce parameters. However, they need to increase the computational complexity to compensate for the performance drop. Another idea is to build relatively shallower models, which can cut down the model size and calculations at the same time. Cascading Residual Network (CARN) [11], Information Distillation Network (IDN) [12], and Information Multi-Distillation Network (IMDN) [15] are all lightweight networks that have fewer than 40 layers. However, the shallow architecture could restrict their representation ability to some extent. For our method, we improve the feature utilization through dual feature aggregation, which can better balance the SR performance and computational cost.  [20] where the model is distributed over two GPUs, resulting in gains in accuracy and convergence speed. Depthwise convolution is a special case of group convolution and is originally introduced in [21]. In depthwise convolution, the number of groups is equal to the number of channels. Based on the depthwise convolution, Mobile Network (MobileNet) [18] gains state-of-the-art results among lightweight models in many visual tasks. Then, group convolution and depthwise convolution are generalized in a novel form in [22]. Channel shuffle operation is also proposed in [22] to overcome the side effect of group convolution. Recently, group convolution has been used in some lightweight image superresolution methods. Ahn et al. [11] proposed efficient residual block containing group convolutional layers, and Hui et al. [12] introduced group convolution to some specific layers. However, there is still room for improvement in the reconstruction performance of these two models. In our DFAN, group convolution is used as a basic building unit without affecting the reconstruction performance.

Deep Feature Aggregation.
As the feature representation capability of a single network layer is limited [23,24], deep feature aggregation is typically used to fuse features of different layers, which can improve the representation capability in a computationally economical way. For instance, the Densely Connected Network (DenseNet) [25] and the Feature Pyramid Network (FPN) [26] are the dominant architectures for semantic feature aggregation and spatial feature aggregation [27]. DenseNet can better propagate features and gradients through dense connections that connect each layer to every other layer in a feed-forward fashion. FPN can equalize resolution and standardize semantics across the levels of a pyramidal feature hierarchy through top-down and lateral connections. Besides, Residual Network (ResNet) [8] is also a typical feature aggregation method which aggregates features via simple element-wise summation. Recently, Yu et al. [28] proposed an iterative aggregation method and a hierarchical aggregation method, which can further improve the performance of the aforementioned dominant architectures in many visual tasks. Inspired by this work, we introduce an iterative and adaptive global feature aggregation module to DFAN, obtaining more comprehensive information and improving reconstruction performance.

Proposed Method
3.1. Network Architecture. As depicted in Figure 2(a), DFAN mainly consists of four parts: the shallow feature extraction layer, stacked local feature aggregation modules, the global feature aggregation module, and the upsampling module. The shallow feature extraction layer contains only one convolutional layer. It extracts shallow features F 0 from the LR image. Then, F 0 is input into the stacked LAMs for global residual learning. There are M stacked LAMs, and the local aggregated feature from the m th LAM can be formulated as where f m LAM refers to the operation of the m th LAM, and F m is the local aggregated feature from it. As shown in Figure 2, each LAM is composed of a series of ECBs, therefore, f m LAM can be viewed as a composite function.
After that, GAM fully leverages local aggregated features from LAMs in an iterative way, which can be expressed as where F A is the global aggregated feature. f GAM denotes the operation of GAM. Then, the global long skip connection adds F 0 to F A , obtaining the final aggregated feature F. The global skip connection can better propagate information and gradients, thus, stabilizing the training of DFAN. Finally, we use an upscale module proposed in [29] to restore the final SR image I SR . That is, where f gconv denotes the group convolution, f conv indicates the standard convolution, and f ↑ is the upscaling module.

Local Feature Aggregation.
Since features of different layers contain different weighted information, adaptively aggregating all hierarchical features could effectively improve the representation ability. Referring to [28], the key axes of feature fusion are semantic and spatial, which are closely related to channel and spatial dimensions, respectively. Thus, we propose the dual feature aggregation strategy, in which features are locally aggregated along the channel dimension, and then globally aggregated along the spatial dimension. In this subsection, we first explain the local feature aggregation. Figure 2(c), ECB is the basic building block of LAM. ECB is a residual learning module consisting of two group convolutional layers with channel shuffle operation [22] and a channel attention module [7]. Group convolution with channel shuffle operation can extract useful features in a computationally economical way. Assuming the group size of an s × s group convolutional kernel is g, the parameter amount and computation complexity of this group convolutional kernel will be both 1/g of an s × s standard convolutional kernel. Moreover, the channel shuffle operation enhances the information exchange among channels without extra parameters and calculations. There are B ECBs in each LAM. LAM fuses hierarchical features from ECBs by exploring the interchannel relationship. The local aggregated feature F m from the m th LAM can be obtained by

Balanced Connection. The connection method in LAM
is what we call balanced connection. As shown in Figure 3, compared with two commonly used connection methods in SR, i.e., skip connection and dense connection, our balanced connection is more flexible than skip connection and more lightweight than dense connection. The analysis is as follows: (1) Difference to Skip Connection. As shown in Figure 3(b), for each LAM, if we only use skip connection which makes the elementwise sum of the hierarchical feature maps, all hierarchical features will contribute equally to the final aggregated feature. It may be inflexible since different features contain information of different importance. Our balanced connection can simply solve this issue by a 1 × 1 convolutional kernel. This convolutional kernel assigns specific learned weights to each pixel of local features, thus, adaptively aggregating them along the channel dimension (2) Difference to Dense Connection. As shown in Figure 3(c), dense connection connects each ECB and all preceding ECBs to be concatenated and compressed as inputs to all subsequent ECBs, which requires more 1 × 1 convolutional kernels and harms the overall efficiency. However, our balanced connection directly connects each ECB for feature aggregation, which not only fully uses local features but also greatly reduces the number of parameters and computation operations 3.3. Global Feature Aggregation. The spatial dimension is orthogonal to the channel dimension. Thus, further fusing local aggregated features along the spatial dimension could supplement more information. Besides, since local aggregated features contain abundant information, it could be suitable to aggregate them in a coarse to fine fashion.

Wireless Communications and Mobile Computing
Therefore, we design GAM, which can further fuse local aggregated features with spatial attention mechanism in an iterative manner.
In Figure 2(d), F A,i represents the global aggregated feature in the i th iteration, and F i represents the output of the i th LAM. The iterative fusion of GAM can be formulated as where F A,i is initialized with F 1 , which is the output of the first LAM. f G represents the global aggregation of GAM.
The main parts of GAM are (1) spatial attention generation and (2) iterative feature aggregation. First, the spatial attention θ i is generated by the following operation, where f conv denotes a 1 × 1 convolutional kernel that reduces the channel number of ½F A,i , F i by half. f dconv denotes a 3 × 3 depthwise convolutional kernel to extract spatial information. Depthwise convolution applies a single filter to each input channel, which is more efficient than common convolution in terms of memory and computation. σ is the Sigmoid activation function constraining the spatial attention to ð0, 1Þ. The spatial attention θ i is the same size as F i and F A,i−1 . Second, as shown in Figure 2(d), the feature fusion in A-Unit can be formulated as where ⊗ denotes the Hadamard product. I is the tensor with all elements being 1.
After the G th iteration, we obtain the final global aggregated feature F A . The overall iterative global aggregation can be summarized as follows, where Θ i denotes the final spatial attention for F i , which is determined by all the local aggregated features from LAMs, thus, highly comprehensive. Additionally, Eq. (8) indicates that the global feature aggregation strategy satisfies the convex combination.

Study on Efficient Convolutional
Block. Different from most of the super-resolution networks, our DFAN uses group convolutional kernels instead of standard convolutional kernels in an ECB to extract features. Since group convolution is a basic operation of our ECB, we design DFAN_ W and DFAN_D to validate the effectiveness of ECB. These two models have the same structure as DFAN, but group convolutional kernels in ECBs are replaced with standard convolutional kernels. All three models have similar number of parameters and computation operations, i.e., approximately 900 K and 60 G, respectively.
We denote the number of ECBs in each LAM as B, the number of LAMs as M, and the number of channels of each intermediate feature as C. For our DFAN, we set B, M, and C to 10, 6, and 64, respectively, and the group number of each group convolutional kernel in ECBs is 8. We set B, M, and C to 3, 2, and 64, respectively, for DFAN_W, and these hyperparameters to 10, 6, and 27, respectively, for DFAN_D. In other words, the width of DFAN_W is the same as DFAN. While the depth of DFAN_D is the same as DFAN.
As shown in Table 1, group convolution makes an outstanding trade-off between representation capability and computational costs. Compared with standard convolution, group convolution can make the model deeper or wider with limited parameters and calculations, which is beneficial to obtain richer hierarchical information.

Study on Dual Feature Aggregation.
In this section, we experimentally investigate the effectiveness of the dual feature aggregation strategy. LAM0_GAM0 is the baseline network by removing balanced connections in LAM and GAM from DFAN. LAM1_GAM0 is built by removing GAM from DFAN. LAM1_GAM1 has both LAM and GAM, which is the same as DFAN. As shown in Table 2, when only LAM is added, PSNR is improved by approximately 0:05 dB.
When both LAM and GAM are added, the performance is improved by a large margin (PSNR: +0:15 dB on Set14).

LAM Analysis.
To intuitively show the effectiveness of LAM, we plot the training curves of LAM0_GAM0 and LAM1_GAM0 in Figure 4(a). Benefitting from the balanced connection in LAM, gradients could be better propagated. The margin between the two curves indicates that LAM could not only help the network converge faster but also help it converge to a better point. Additionally, the weight distribution is visualized in Figure 4(b). This indicates how much information of each ECB in an LAM contributes to the local aggregated feature generated by this LAM. Features from different ECBs contribute differently to local aggregated features, which suggest that LAM could adaptively aggregate hierarchical features to improve the final performance.

GAM Analysis.
We experimentally prove that GAM also works well for some other networks. We use a shallower RCAN [7] as the baseline network (denoted as sRCAN). To facilitate network training, we set the RG number to 3, and the RCAB number to 5 for sRCAN. Then, we apply our GAM to sRCAN, which is denoted as sRCAN_GAM. As Table 3 shows, with only a small increase in parameters and computational complexity (Paramerters:+9 K, Mul-tAdds:+0.9G), GAM can significantly improve the SR performance on all the benchmark datasets with scaling factor × 4. Therefore, GAM could be used as a general lightweight tool to improve the performance of some existing SR methods.
To better understand the adaptive and iterative aggregation strategy of GAM, we visualize the spatial attention heatmaps generated by GAM in Figure 5. The 3D attention is transformed to 2D by taking the absolute mean along the channel dimension and then normalized to ½0, 1 over the spatial dimension. We can see that (1) spatial attention for different LAMs focuses on regions of different frequencies.
For example, the spatial attention for LAM_1 ( Figure 5(a)) focuses on low-frequency regions such as the background. While the spatial attention for LAM_6 ( Figure 5(f)) focuses more on high-frequency regions with rich textures. Thus, both high-frequency and low-frequency information is important for SR. (2) Although some spatial attention focuses on high-frequency regions, they emphasize different parts. In LAM_6, more attention is given to regions of the main object. But in LAM_5, high-frequency regions in the background are emphasized. It indicates that GAM provides  [38], DRRN [9], MemNet [10], CARN [11], IDN [12], and IMDN [15].

Quantitative Results.
We evaluate the average PSNR and SSIM on five benchmark datasets. In particular, we also calculate the number of parameters and multiply-adds of these models by assuming the HR image size to be 720p (1280 × 720). In Table 4, the proposed DFAN performs favorably against these methods on all benchmark datasets for 2 × , 3 × , and 4 × SR. Note that the number of parameters of our method is inconsistent for different scales because we apply the pixelshuffle operation [29] for upscaling, and the convolutional kernels in the upscaling module are of different sizes for different scales. CARN [11] used to be a strong baseline for lightweight SR models, but our DFAN outperforms it by a large margin (PSNR:+0:17 dB,  fewer MultAdds on scale × 4. It indicates that our method can achieve a better trade-off between computational cost and effectiveness. Therefore, feature aggregation has promising prospects in the research of lightweight image SR.

Visual
Results. In Figure 6, we show visual comparisons on scale × 4. Our method restores the letter "g" in "ppt3" more clearly, while most other methods encounter artifacts or edge distortion. For "img030" in Urban100 and "img86000" in BSD100, most methods do not reconstruct the contour of the window well, but our method can reconstruct these edges better.
As shown in Table 5, compared with the networks that are stacked by several elaborately designed building blocks, such as CARN, IDN, and IMDN, our lightweight network with the dual feature aggregation strategy can better leverage the hierarchical features. In addition, the visual comparison in Figure 7 also demonstrates the superiority of our method.

Conclusions
We propose DFAN that can strike a better trade-off between SR performance and computational cost. The proposed dual feature aggregation strategy makes local and global feature aggregations adaptively. Through feature reuse, it could simultaneously improve feature utilization and representation ability. Benefitting from the dual feature aggregation strategy, our network achieves competitive performances with fewer parameters and lower computational complexity, which is more practical for real applications.

Data Availability
The image datasets supporting this work are from previously reported studies and datasets, which have been cited. The processed data are available at the repository: BasicSR(https://github.com/xinntao/BasicSR/blob/master/ docs/DatasetPreparation.md#Image-Super-Resolution).

Conflicts of Interest
The authors declare that they have no conflicts of interest.

Acknowledgments
This work was supported by the National Key R&D Program of China (2019YFB1406200) and was also the research achievement of the