A Multi-Attention Feature Distillation Neural Network for Lightweight Single Image Super-Resolution

In recent years


Introduction
As image acquisition and analysis techniques develop, image-based intelligent information processing systems play an important role in various felds such as healthcare, education, transportation, and entertainment.Apparently, high-resolution images are expected in these applications to guarantee the accuracy and reliability.However, the images acquired in real scenarios are easily degraded due to the limitations of imaging systems and conditions, causing the decrease of the performance of information processing systems and even abnormal operations.Terefore, it is of great importance to ensure the resolution and clarity of the images used for intelligent analysis and decision.
Super-resolution (SR) aims at enhancing the quality of the observed images or videos, especially the improvement of resolution [1,2], and it has garnered widespread interest for its potential applications in a wide range of felds such as smart industry [3], intelligent monitoring [4], medical imaging [5], and remote sensing [6,7].In particular, single image SR (SISR) refers to the process of reconstructing a high-resolution (HR) output from the given low-resolution (LR) observation.Initially, various interpolation approaches [8][9][10] were proposed for single image upsampling.After this period, SISR methods are mainly based on the priors derived from natural images, e.g., non-local similarity [11,12], local smoothness [13], and sparsity [14].Meanwhile, conventional learning models including sparse coding [15], neighborhood regression [16], and random forest [17] also have a signifcant impact in this feld.In recent years, the SR capacity has been signifcantly improved by deep convolutional neural networks (CNN) [18][19][20][21][22][23] as in other information processing areas such as object detection [24], disease classifcation [25], segmentation [26], emotion and activity recognition [27,28], image generation [29], manipulation and deepfake detection [30,31], and quality enhancement and assessment [32][33][34][35][36].However, most deep CNNbased SISR methods boost performance via deepening or widening networks, resulting in quite a few network parameters and high computational complexity.Tis limitation hinders their applications in real scenarios.Terefore, it is essential to develop lightweight SISR models that are more suitable for practical applications.Benefting from various network design based on feature distillation [37][38][39] and attention mechanism [40][41][42], recent research on lightweight SISR has made a lot of progress.Nevertheless, how to fully utilize deep features to make a balance between the capacity and complexity of an SR network remains an open problem [43].
To address the above issues, on the basis of the well-known information multi-distillation model [37], this paper proposes a multi-attention feature distillation network termed as MAFDN for lightweight and accurate SISR.Specifcally, we aim to combine complementary attention mechanisms to assist the distillation of the features for reconstruction and use the overparameterized block to further enhance representation ability without incurring any complexity increase in inference, thus achieving a good balance between reconstruction capacity and model complexity.Te main contributions of this study are summarized as follows:

Related Work
2.1.Deep CNN for Performance-Oriented SISR.Te threelayer SR network SRCNN [18,19] is the pioneering work of deep CNN-based SISR methods.After this, a large number of networks were developed in pursuit of greater reconstruction capacity, i.e., the quality of the reconstructed HR images.Kim et al. [44] deepened the network to twenty layers based on residual learning.Lim et al. [20] won the super-resolution challenge in NTIRE 2017 via enhancing both of the network depth and width.To take advantage of hierarchical features from diferent convolutional layers, Zhang et al. [45] proposed the residual dense network for SISR.Furthermore, they developed a residual channel attention network with over four hundred layers and 16M parameters, achieving excellent reconstruction performance [21].For greater ability to learn feature expression and feature correlation, Dai et al. [46] designed a second-order attention network.Niu et al. [22] proposed a channel-spatial attention and layer attention modules-based SR network, which utilizes the holistic interdependencies among positions, channels, and layers.Huang et al. [47] developed a pyramid super-resolution network based on the cross-layer non-local attention, achieving the coarseto-fne reconstruction of the HR image.To address the problem of the details over smooth, Hsu et al. [48] proposed to individually reconstruct low-frequency structures and highfrequency details using diferent subnetworks.Liang et al. [49] built a strong image restoration baseline SwinIR based on the Swin Transformer, achieving superior performance over previous advanced methods including RCAN [21] and HAN [22] on benchmark datasets with comparable network parameters.Considering that the computation of the self-attention-based SR transformer is expensive and redundant, Zhang et al. [50] proposed to characterize the long-range dependency property in images with the efcient long-range attention network, obtaining better SR results against SwinIR [49] with less complexity.More recently, Su et al. [23] proposed a feature fusion strategy based on the global learnable attention and further designed a deep learnable similarity network, achieving outstanding reconstruction accuracy.
On the whole, the performance of recent SR networks such as ELAN [50] and DLSN [23] has grown signifcantly compared with the pioneering work SRCNN [18].However, the reconstruction ability improvement is generally accompanied by increasing model size and computational complexity.As a result, these methods are often difcult to apply in resource-constrained scenarios like industrial production.

2
International Journal of Intelligent Systems

Lightweight CNN for Complexity-Oriented SISR.
Diferent from the above reconstruction performanceoriented networks, lightweight SISR models focus more on model complexity including model size (i.e., the number of network parameters) and operation numbers (i.e., multiadds) [43].Tis kind of SISR methods has received increasing attention for its potential application in resourcelimited systems.
Early attempts [51][52][53] use recursive structures to reduce the model size.However, this strategy does not decrease computational complexity.Hui et al. [54] used distillation blocks composed of the enhancement and compression units to gradually extract compact features for HR image reconstruction.Subsequently, they developed the multidistillation block, which contains distillation and selective fusion parts, to extract hierarchical features and fuse them according to their importance evaluated by the channel attention [37].After that, feature distillation and attention mechanism are two main elements of lightweight SISR models.For example, a skip connection and channel separation-based multi-stage residual distillation block was designed for SISR by Yang et al. [55].For efcient SR reconstruction, Zong et al. [56] proposed an asymmetric information distillation block with the capabilities of distillation information multiplexing and asymmetric information extraction.Yang et al. [57] developed the ranked information distillation block for SR reconstruction, in which the extracted feature channels are sorted according to the degree of channel redundancy.Wang et al. [39] designed a dynamic distillation fusion module for SISR, which refnes dynamic features gradually for the maximization of the role of hierarchical dynamic information.To learn more discriminative feature representations, Liu et al. [58] adopted multiple connections of feature distillation.
To exploit inter-channel dependencies and capture longrange spatial dependencies simultaneously, Behjati et al. [59] introduced the directional variance attention mechanism.In a diferent way, Zhu et al. [60] developed the expectationmaximization attention block in their network for modeling long-range feature dependencies.Park et al. [40] designed a residual self-attention module to produce 3D attention map in their dynamic residual network, leading to better performance without complex computation.Song and Zhong [41] built a lightweight SISR network using the localglobal attention block including three attention parts, i.e., long-range attention, window attention, and shifted window attention.Gao et al. [38] developed the wide-residual attention weighting unit with strong feature distillation capabilities for lightweight SISR.Wang et al. [61] integrated channel attention and multi-aware attention modules to capture multiple content-aware information at lower model complexity.Tang et al. [62] developed a lightweight SR using sparse self-attention and spatial/channel masks.
To make a good balance between the computational cost and reconstruction quality, Zhang et al. [63] simplifed feature aggregation by using residual modules for feature learning.Wang et al. [64] proposed to reduce repetitive feature information via feature deredundancy and selfcalibration, thus enhancing model efciency.Liu et al. [65] proposed to use linear transformations to generate similar feature maps, so as to signifcantly reduce network parameters with comparable SR performance.Sun et al. [66] utilized pixel-unshufed downsampling and self-residual depthwise separable convolutions to achieve competitive SR performance with fewer parameters and computational costs.Considering the characteristics of CNN and transformer, there are frameworks combining them for lightweight SR [67,68].
Overall, compared with performance-oriented methods, lightweight SISR networks achieve a better balance between model ability and complexity, and are thus more practical in resource-limited or real-time applications.Nevertheless, although a lot of efective lightweight SISR networks have been proposed in recent years, how to more fully utilize deep features with limited model size remains a challenge.In this work, we aim to enhance the distillation of the features for reconstruction by introducing complementary attention mechanisms, and further boost model representation ability using the over-parameterized block without additional parameter and computation cost.In this way, a good balance between reconstruction capacity and model complexity can be achieved.

Proposed MAFDN
Te proposed lightweight SISR network MAFDN is presented in this section.Te overall architecture of MAFDN is frst described, and then its core component MAFDB is introduced in detail.1, in which activation layers are omitted for brevity.In order to maintain the lightweight of network, MAFDN is designed following the idea of feature distillation.Specifcally, MAFDN contains three stages: feature extraction, multi-stage feature distillation (MSFD), and image reconstruction.

Architecture of MAFDN. Te diagram of MAFDN is presented in Figure
Given an LR image I LR ∈ R H×W×C , the shallow feature F 0 is frst obtained in the feature extraction stage, and this process can be expressed as follows: where f SF (•) denotes the 3 × 3 convolutional layer ("Conv3") for shallow feature extraction from I LR .It is worthy of note that the convolutional layer with other flter sizes (e.g., 5 × 5) can also be used in this process.Here, we follow the previous studies and set the kernel size to 3 × 3. Ten, F 0 goes through the multi-stage feature distillation module for deeper and fner feature extraction, and the whole process can be expressed as follows: International Journal of Intelligent Systems where f MSFD (•) denotes MSFD and F OUT represents its output.To better extract image features, MSFD consists of multiple cascaded MAFDBs ("MAFDB"), convolutional layers ("Conv1" and "Conv3"), and a skip connection.Specifcally, given the shallow feature F 0 as the input of MSFD, it is gradually distilled by K cascaded MAFDBs as follows: where f MAFDB k (•) represents the k-th MAFDB, F k−1 denotes the input of the k-th MAFDB, and F k is the corresponding output.To take full advantage of hierarchical features, the intermediate outputs F k produced by diferent MAFDBs are concatenated and combined using a 1 × 1 convolutional layer ("Conv1"), which also compresses feature channels to decrease computational complexity.Ten, the compressed feature is further optimized by the 3 × 3 convolutional layer ("Conv3").Te concatenation and combination of hierarchical features can be expressed as follows: where Concat(•) denotes the operation of channel concatenation and f conv1 (•) and f conv3 (•) represent 1 × 1 and 3 × 3 convolution, respectively.
To improve the stability and efciency of model learning, a skip connection is involved in MSFD.Specifcally, the shallow feature F 0 is connected to the tail feature F f by a skip connection, thus forming the residual structure.Tis process can be expressed as follows: where F OUT denotes the output of MSFD.Finally, F OUT is passed through the image reconstruction module to obtain the super-resolved image I SR ∈ R Hs×Ws×C , where s denotes the upscaling ratio.Te reconstruction process can be expressed as follows: where f REC (•) represents the image reconstruction process, which is composed of the operations of 3 × 3 convolution ("Conv3") and pixel shufe ("Pixel Shufe") [73].Specifcally, the convolution operation for image reconstruction frst changes the channel of its input F OUT and produces a tensor F OUT1 ∈ R H×W×Cs 2 , and then the pixel shufe operation followed reshapes the elements of F OUT1 to a tensor of shape Hs × Ws × C (i.e., I SR ).Mathematically, the pixel shufe operation can be expressed as follows: Te proposed MAFDN is optimized using the following mean absolute error-based loss function: where N represents the quantity of training samples and I SR and I HR are the SR output and the original HR image, respectively.

Multi-Attention Feature Distillation. As shown in Fig-
ure 1, the MAFDB detailed in this section is the core of MAFDN.Te overall architecture of MAFDB is frst described, and then its components are detailed, i.e., the OPCRB and three attention modules including the pixel attention (PA), enhanced spatial attention (ESA), and contrast-aware channel attention (CCA).

MAFDB.
Following prior feature distillation blocks [37,58], MAFDB has a two-branch structure.Te upper is the feature extraction branch to gradually extract deep features, while the lower is the feature distillation branch for reducing the number of feature channels.All intermediate features are fused by channel concatenation and compression operations, and then the fused feature is further processed by the ESA and CCA blocks to obtain more representative features.Specifcally, given an input feature F in , the two branches for feature extraction and distillation in MAFDB perform the following procedure: where f PA i (•) denotes the i-th PA module ("PA") and f conv1 i (•) and f OPCRB i (•) represent the i-th 1 × 1 convolutional layer ("Conv1") and the i-th OPCRB module ("OPCRB"), respectively.Te OPCRB is an improved residual block by the depthwise over-parameterized convolutional layer, which can boost performance without complexity increase in inference.More details are given in Section 3.3.f conv3 1 (•) denotes the 3 × 3 convolutional layer ("Conv3").F r i and F dis i are the i-th retained and distilled feature, respectively.For the purpose of aggregating the features of diferent levels, all distillation features are concatenated and combined via 1 × 1 convolution ("Conv1").Tis process can be expressed as follows: (10) where f conv1 4 (•) is the 1 × 1 convolution for combining multi-level distillation features and compressing feature channels.
Finally, to obtain more representative features and improve performance, ESA and CCA are introduced to modulate the fused distillation feature F all as follows: where f ESA (•) and f CCA (•) represent the ESA and CCA blocks, respectively, and F mafdb is the output of MAFDB.More details about the multi-attention blocks (PA, ESA, and CCA) in MAFDB are given as follows.

Multi-Attention.
As shown in Figure 1, PA, CCA, and ESA are introduced into the MAFDB to enhance the reconstruction performance with small amounts of complexity increase.Figure 2 shows the architectures of the three attention blocks used in our network.For PA [74] and CCA [37], the original architectures are maintained for their elegant design and excellent performance.For ESA [75], we improve it by introducing the depthwise separable convolution (DSConv) [76] to achieve a better tradeof between capacity and complexity.As shown in Figure 2, PA [74] mainly consists of a convolutional layer with the 1 × 1 kernel ("Conv1") and an activation function of sigmoid ("Sigmoid").A 3D matrix of attention coefcients is learned for all pixels in the feature maps, and it is multiplied with the input feature to achieve pixel-level evaluation and improve the expression capacity of convolutions.As presented in Figure 1, the PA block is placed experimentally in the feature extraction branch prior to each distillation, thus benefting both feature extraction and distillation.In addition to PA [74], following prior studies [75,77], the ESA [75] and CCA [37] blocks are introduced to modulate the fused distillation feature further in each MAFDB, thus enhancing model representational ability from the aspect of spatial and channel dimension, respectively.
Compared with the plain spatial attention block [78], the ESA designed by Liu et al. [75] is more powerful for its large receptive feld, which is achieved by combining the strided convolution and maxpooling.In this work, we further introduce DSConv [76] to replace the regular convolution in the original ESA, so as to better balance model complexity and performance.Meanwhile, a short skip connection is added next to the convolution group.Te short and long skip connections in the improved ESA facilitate the network gradient propagation and keep the original information International Journal of Intelligent Systems from being lost in training, which is conducive to model performance improvement.Specifcally, as the improved ESA presented in Figure 2, the feature channel is frst compressed using a 1 × 1 convolutional layer ("Conv1") to make it thinner, and the receptive feld is expanded with a strided convolutional layer ("Conv3") and a maxpooling layer ("Pooling").Ten, with the help of the residual connection, a convolution group ("Conv Group") formed by three consecutive 3 × 3 DSConv layers is integrated for deep feature extraction.Next, corresponding to the front maxpooling and channel compression operations, an unsampling layer ("Upsampling") and a 1 × 1 convolutional layer ("Conv1") are adopted to recover spatial resolution and channel dimension, respectively.Finally, a spatial attention map is learned via the sigmoid activation layer ("Sigmoid") and it is multiplied with the input feature to focus more on the regions of interest.
Unlike the native channel attention layer, the CCA [37] illustrated in Figure 2 is developed for low-level vision via taking the information about structures, edges, and textures into consideration.Specifcally, the contrast degree of the input feature ("Contrast"), which is evaluated by the sum of mean and standard deviation, is used to learn the channel attention map via the 1 × 1 convolution ("Conv1") and sigmoid activation ("Sigmoid").Ten, the input feature is multiplied with the channel attention map.In this way, all distillation features can be efectively aggregated according to their importance, thus improving feature representational ability and reconstruction accuracy.

Over-Parameterized Convolution-Based Residual Block.
Te proposed OPCRB is presented in Figure 3. On the whole, the architecture of OPCRB is inspired by the shallow residual block in RFDN [58] and recent re-parameterization techniques [79].Te motivation behind this design is to learn over-parameterization linear operations in training and fold them into a single layer operation at inference, thus obtaining performance gains while maintaining the inference complexity to be equivalent to a conventional convolutional layer.Specifcally, as shown in Figure 3(a), a DO-Conv layer ("DO-Conv")-based residual block is learned in training, and then the DO-Conv layer is folded into a regular 3 × 3 convolutional layer ("Conv3") in testing.Compared with the plain convolution, an additional depthwise convolution is augmented in the training of DO-Conv.In this way, over-parameterization is constituted by the two convolutional layers for more trainable network parameters, thus boosting the performance of the converged model for inference.Te DO-Conv is detailed as follows.
Figure 3 illustrates the diagram of DO-Conv [79].Let P ∈ R (M×N)×C in denote an input patch, where C in represents the channel number of the input, and M and N denote the width and height of the kernel, respectively.As shown in Figure 3, DO-Conv is formed by a depthwise convolutional layer ("Depthwise Conv") and a conventional convolutional layer ("Conv3"), and corresponding kernels are denoted as D ∈ R (M×N)×D mul ×C in and W ∈ R C out ×D mul ×C in , respectively.D mul represents the number of depth multiplier.C out denotes the channel number of the output O. Specifcally, the DO-Conv can be formulated as follows: where °and * denote the depthwise and conventional convolution, respectively.As proved in [79] and shown in Figure 3(b), the two cascaded convolution operations (i.e., °and * ) can be combined using a composite kernel W ′ .W ′ is obtained by performing depthwise convolution on W with the kernel D T .Ten, the DO-Conv can be expressed as a conventional convolution between W ′ and P as follows: where W ′ � D T °W.
Compared with the conventional convolution, the composition of the two convolutions (D and W) in equation ( 12) adds learnable parameters in training and constitutes an over-parameterization, thus boosting performance.It is worth noting that, as shown in equation ( 13), the two composite convolutions can be folded into a single operator at inference time, decreasing the complexity to be equal to a simple convolutional layer.Terefore, performance gains are achieved without complexity increase in inference.
4.1.2.Implementation Details.Specifcally, 6 MAFDBs are stacked in our MAFDN for the balance between accuracy and complexity, and the channel of features is 48.MAFDN is trained and evaluated on one NVIDIA GeForce RTX 3090 GPU under the PyTorch framework.In model training, the minibatch size is 32 and the LR patch size is 48 × 48.Te cosine learning rate decay strategy is adopted in training, and the initial learning rate is 1 × 10 − 3 .We train the model for 1 × 10 6 iterations in total.Following previous methods, three MAFDN models are trained for ×2, ×3, and ×4 SR reconstruction, respectively.

Complexity Comparisons.
To more comprehensively compare the capability and efciency of diferent methods, Figure 5 presents the PSNR vs. multi-adds vs. parameters for × 4 SR on Set5.For the scale factor of × 4, the proposed MAFDN contains 597K parameters and 33.79G multi-adds operations.Compared with the recent DRSAN [40] (730K/ 49.00G), ESRT [67] (751K/67.70G),FDSCSR [64] (839K/ 54.80G), LESR [62] (1020K/42.10G),AFAN [61] (692K/ 50.90G), and FRN [65] (778K/44.50G),MAFDN achieves International Journal of Intelligent Systems higher SR accuracy with lower computational complexity and fewer parameters.Overall, MAFDN outperforms the compared lightweight models in terms of tradeof between reconstruction capacity and model complexity due to its multi-attention feature distillation and over-parameterization convolution.For a more intuitive illustration of processing efciency, Table 4 reports the runtime of MAFDN.In order to include more representative image resolutions, the test images in DIV2K are used in this experiment.Specifcally, to obtain the runtime on representative image resolutions, we produce LR images of size 320 × 180, 480 × 270, and 640 × 360, and then super-resolve them with a factor of ×4 using the proposed SR model to HR images of size 1280 × 720 (720p), 1920 × 1080 (1080p), and 2560 × 1440 (2K), respectively.As shown in Table 4, the average runtime for each test image in the three aforementioned confgurations is 15.5 ms, 27.6 ms, and 37.0 ms, respectively.Terefore, the proposed method can achieve real-time or near real-time processing level under these representative conditions.

Methods
Scale Params Set5 Set14 B100 Urban100 Manga109 PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM Bicubic used as the basic module.For MAFDN_v2, the feature distillation is introduced to the baseline.On this basis, MAFDN_v3 incorporates the CCA and ESA to enhance feature distillation as previous studies.Ten, the PA is further introduced in MAFDN_v4, in which the three attention mechanisms are fully utilized.Finally, the over-parameterization is added to MAFDN_v4, forming the full version of the proposed MAFDN.It can be observed from Table 5 that the SR performance improves gradually with the use of feature distillation, multi-attention, and over-parameterization.Specifcally, compared with the plain residual block-based baseline model MAFDN_v1, MAFDN_v2 obtains 0.09 dB/ 0.0011 PSNR/SSIM gains with the introduction of feature distillation.With the use of the CCA and ESA, MAFDN_v3 outperforms MAFDN_v2 by 0.08 dB/0.0007PSNR/SSIM gains, and the PA in MAFDN_v4 further brings 0.02 dB/ 0.0004 PSNR/SSIM gains.Overall, compared with MAFDN_v2, the multi-attention used in MAFDN_v4 results in 0.10 dB/0.0011PSNR/SSIM improvements.On this basis, the over-parameterization in MAFDN produces 0.03 dB/ 0.0005 PSNR/SSIM gains with no increase in inference complexity.Tese results demonstrate the efectiveness of the feature distillation, multi-attention, and over-parameterization in the proposed MAFDN.

Conclusion
Tis paper presents the MAFDN for lightweight SISR.Specifcally, MAFDN mainly enhances the basic information multi-distillation model from two perspectives.On the one hand, multi-attention layers including pixel attention, channel attention, and spatial attention are incorporated to learn more discriminative and representative features.On the other hand, the depthwise over-parameterized convolution-based residual block is introduced to enhance ability without increasing the complexity in the inference

Table 4 :
Te average runtime of MAFDN on representative image resolutions.

Table 5 :
Comparison of diferent variants of MAFDN on Set5 (×4).Extensive comparisons and ablation studies demonstrate that MAFDN outperforms existing state-of-the-art lightweight SISR models as it maintains a good balance between reconstruction capacity and model complexity.However, there is still room for improvement on MAFDN.For example, MAFDN only integrates lightweight attention operators due to complexity restrictions, limiting the reconstruction performance to some extent.In addition, the design of MAFDN takes no account of hardware features in applications.How to simplify more powerful attention mechanisms like non-local attention and design hardwarefriendly network architectures are still opening and worth further exploring.Moreover, we would like to improve the proposed network and expand its applications in image denoising and other restoration tasks in the future.