A Multistage with Multiattention Network for Single Image Dehazing

For single image dehazing, an end-to-end multistage with multiattention network is proposed in this paper. The network contains two different stages, in which the first stage uses an encoder-decoder subnet to obtain contextual features, and the second stage adopts a single-scale pipeline to provide spatial image details. At each stage, ground-truth supervision is provided, and an attention mechanism is used between the two stages, so the features learned from the previous stage will be refined before passing to the next stage. A basic multiattention unit that combines channel attention, spatial attention, and pixel attention is designed to earn more weight from important features, and a positional normalization that normalizes exclusively across channels is used in the multiattention unit to learn more weight from important features. Experimental results in several benchmarks indicate that the proposed network outperforms the state-of-the-art methods both quantitatively and qualitatively.


Introduction
Image dehazing is a challenging task in the eld of image restoration. Since there are in nite feasible solutions, it is a highly ill-posed problem. e atmosphere scattering model [1,2] proposes a simple and e ective formula to solve the problem.
where I(x) is the hazy image, J(x) stands for clear image, the global atmospheric light A represents the intensity of the scattered light of the scene, and the transmission map t(x) describes the attenuation in intensity. Let the clear image J(x) be the output, and the formula (1) can be re-written as follows: It can be observed from formula (2) that the goal of single image dehazing is to restore the clear image J(x) from the hazy image I(x) by estimating A and t(x). Since only I(x) is known, it is di cult to restore the clear image J(x).
Recent decades have witnessed signi cant progress in image dehazing, and lots of techniques have been proposed. Early works were mostly based on priors such as the atmosphere scattering model, and these methods often try to design hand-crafted features to learn t(x) in formula (2). However, the methods are easily sensitive to image variations such as changes in viewpoints, illumination, and scenes [3]. With the success of deep convolution neural network (CNN) in the community of image processing and computer vision, lots of image dehazing methods based on CNN have been proposed [4][5][6], which can directly regress the intermediate transmission map or the nal haze-free image. Compared to early methods based on hand-designed features, CNN-based methods achieve superior performance with robustness. e net design is a primary reason of the superior performance achieved by CNN-based methods. Lots of network modules are introduced for image dehazing including residual dense connections [7,8], attention mechanisms [8,9], encoder-decoder [10,11], and generative models [12,13]. Nevertheless, most of them are single-stage models. On the other hand, multistage models have been shown to be more e ective than single-stage models in different vision tasks such as segmentation and pose-estimation. Recently, few efforts have adopted multistage networks to solve image deblurring and image deraining [14][15][16]. We analyze those methods and find there are several bottlenecks that prevent the performance. First, the existing multistage networks use same architecture in different stage, either an encoder-decoder architecture or a single-scale pipeline.
e encoder-decoder [11] architecture provides broad contextual information but lacks image spatial details, and the single-scale pipeline is effective in preserving spatially accurate but unreliable in extracting semantical information. We combine the two architectures in a multistage network for image dehazing. As far as we know, this is the first attempt to solve this problem. Second, we do not naively pass the output of the previous stage to the next stage [15]. A ground-truth supervision is provided in the first stage to refine the feature map before moving to the next stage.
ird, most attention modules (MAU) are single and limited, such as channel attention, which can extract the interdependencies among channels but lacks spatial information. We first combine different attention mechanisms to address the limits. e proposed multiattention combines channel attention, spatial attention, and pixel attention to extract more important information.
In summary, the main contributions of the work are as follows: (1) We employ a multistage network, which combines two different architectures. e proposed multistage network is capable of extracting broad contextual and spatially detailed information. (2) At each stage, a ground-truth supervision is provided and an attention mechanism is used among adjacent stages. By the supervision of ground-truth image, the features learned from the previous stage will be refined before moving to the next stage. (3) A multiattention unit (MAU) is proposed that combines channel attention in channel-wise, spatial attention in spatial-wise, and pixel attention in pixelwise to earn more weight from important features. (4) e positional normalization [17] (PONO), which is position-dependent and reveals structural information at this particular layer of the deep net, is adopted to improve the training performance.

Related Work
Most dehazing approaches follow the similar three-step methodology based on the atmospheric scattering model: (1) estimating the transmission map t(x) by the hazy image samples; (2) estimating the global atmospheric light A using empirical methods; and (3) computing the clear image J(x) according to formula (3). Most of the work focuses on the first step. ere are two ways to estimate t(x): physically grounded priors and fully data-driven approaches. Early methods based on physically grounded priors often require multiple images from the same scene under different conditions [2,[18][19][20]. However, these methods do not work when there is only one image for a scene. e dark channel prior (DCP) [21] is the most successful prior-based method and is followed by many successors. Gibson et al. [22] adopted a standard median filter to improve the DCP computing speed. An effective contextual regularization based on boundary constraints is proposed in [23] to restore the hazy image. Based on depth estimation, a color attenuation prior [24] is proposed for haze removal. Berman et al. [25] assume that an image contains only several hundreds of distinct colors and proposed a nonlocal method. However, the prior is computationally expensive and unreliable.
With the success of deep learning in diverse computer vision tasks, the data-driven dehazing approaches have become popular. To avoid estimating the parameters inaccurately and designing hand-crafted features, algorithms use convolutional neural networks (CNNs) to directly learn t(x) from data.
Single-Stage Networks: currently, most single image dehazing methods are based on single-stage networks. e AOD-Net [26] is the first end-to-end network to generate clean images directly. It is a lightweight CNN, but still performs much better than prior-based methods. e EPDN [27] adopts a generative adversarial network to solve the image dehazing without relying on the physical scattering model. Zhao et al. proposed a weakly supervised refinement framework called RefineDNet [28], which can outperform the weakly supervised methods but is weak than the supervised networks. e DehazeFlow [29] proposes a conditional normalizing flow based framework for single image dehazing.
MultiStage Networks: the existing multistage networks usually use the identical architecture in different stages, such as the Grid DehazeNet [8] and the gated fusion network [30]. e information generated by the previous stage always naively flows to the next stage to refine the restored image [3]. However, a common practice is to use the same subnetwork for each stage may yield a suboptimal result, and the naive connection between adjacent stages is also a bottleneck, as shown in our experiments.
Attention: attention mechanisms are widely used in both high-level computer vision tasks, including image classification [31] and object detection [32], and low-level computer vision tasks such as image dehazing [8,9], deraining [16], and deblurring [14,15]. e main idea is to capture long-range interdependencies in channel-wise, spatial-wise, or pixel-wise.

Proposed Method
We mainly discuss the detail of the proposed network MSNet in this section.
e MSNet is a multistage with multiattention network, and it is a trainable end-to-end network that does not rely on the atmosphere model. e MSNet consists of two stages, as shown in Figure 1, of which the first stage is based on the encoder-decoder network which learns the contextual information, and the second stage is a single-scale pipeline to provide the spatial image details. Inspired by [33], a supervised attention block (SAB) is used between the two stages. By the supervision of clear 2 Scientific Programming images, the feature maps in the first stage are refined by SAB before flowing to the second stage.

Multiattention Unit.
In our framework, a multiattention unit (MAU) is proposed as the basic unit. e architecture of basic MAU is depicted in Figure 2, and it consists of two convolution layers, a local residual learning and a multiattention block. e convolution layers are activated by ReLU, and the second convolution layer adopts positional normalization (PONO) with moment shortcuts (MS) [17] to normalize the activations. A global residual learning connects the input feature and the output feature. With local residual learning and global residual learning, the low-frequency regions from the input features can be learned through the skip connection. e multiattention block combines channel attention, pixel attention, and spatial attention, so it can provide additional ability in dealing with nonlocal and local information, and the representational ability of CNNs is expanded. e architecture of MAU is depicted in Figure 3.

Channel Attention.
Usually, a network uses a number of convolutional layers to capture the neighboring spatial dependencies within local receptive fields. However, the global spatial patterns also need to be considered under the complicated nonuniform condition. When the neighborhoods of the image contain strong hazy component, the contextual information from clear regions may be required. Recently, a channel attention module [31] has been proposed to capture richer nonlocal features by modeling the interdependencies among channels.
us, we propose the channel attention module to extract nonlocal context features, and the different weighted information from the different channel feature maps will be learned by the channel attention module.
Firstly, a global average pooling is used to capture the channel-wise global spatial features: where H p means the global average pooling function and X c (i, j) is the value of cth channel of input X c at position (i, j). And the dimension of the feature map changes from C × H × W to C × 1 × 1, C denotes the channels, and H × W is the size of the feature map. en, two convolution layers are applied to get the weights from different channels, and the first convolution layer uses PONO to normalize the activation.
where σ stands for the sigmoid function that is used to activate the first convolution layer and δ is the ReLU function used to activate the second convolution layer. Finally, the weight of the channel F * c is computed by element-wise multiplying the input F input and C f .  Scientific Programming module is applied to learn weights in an adaptive way from pixels, and the network can learn more informative features from thick-hazed pixels and high-frequency image regions. e architecture of the pixel attention module is depicted in Figure 3, it consists of two convolution layers and a sigmoid activation function, and the first convolution layer uses PONO to normalize the activations.
en, we element-wise multiply F * c and C p as the output of the channel-pixel attention map: 3.1.3. Spatial Attention. Spatial attention is designed to exploit the spatial attention map from the input convolutional features F input . e spatial attention module first applies global average pooling on F input along the channel dimensions and outputs a feature map f ∈ R H×W . e feature f is then passed through a convolution layer and sigmoid activation to get the spatial attention feature f SA ∈ R H×W .
Finally, the spatial attention map f SA and channel-pixel attention map F CP are concatenated, and then the concatenated feature map is passed through a convolution layer to obtain the multiattention map.

Encoder-Decoder Subnetwork.
e encoder-decoder subnetwork is based on the standard U-Net [34] as shown in Figure 4, each scale of the subnet uses an UBlock, which contains several MAUs to extract feature maps, and two down-sampling layers are adopted to reduce the size of the input map to reduce the computation. e skip connections are also processed by an UBlock and then concatenated with the decoder layer. e skipped connections enhance the detailed information of the image. e down-sampling and up-sampling are implemented by a convolution layer.

Single-Scale Subnet.
e single-scale subnet in the second stage consists of several multiattention groups (MAGs), each of which contains several MAUs and a shortcut, and the module is depicted in Figure 5. With the dense attention modules, the net can generate high-resolution and enriched detailed features from the input. [30], a supervised attention block (SAB) is used between the two stages, and the architecture of SAB is shown as Figure 6. e SAB uses a ground-true image to supervise the feature maps at the encoder-decoder stage. With the supervision of the groundtruth, the encoder-decoder stage will provide more informative features to the next stage.

Supervised Attention Block. Inspired by
SAB takes the output F input ∈ R H×W×C from encoderdecoder as the input, where H × W is the dimension of the features and C denotes the channel's number. After processed by a 1 × 1 convolution, the F input is added to the input hazy image to obtain the dehazed image I d ∈ R H×W×3 , and a ground-truth image is provided here to predict the dehazed image. en I d is processed by a convolution layer with a sigmoid activation to generate the attention maps M. en,  we element-wise multiply M and transformed F input that processed by a convolution layer. Finally, a shortcut is used to generate the output, which will pass to the next stage.

Positional Normalization and Moment Shortcut.
Although normalizing inputs is considered to be one of the tricks for training the network, several normalization methods have been proposed to improve the performance, such as batch normalization. Different from the prior normalization scheme, the positional normalization (PONO) is position-dependent and reveals structural information at this particular layer of the deep net. It normalizes exclusively over the channels at all spatial positions, so it is translation, scaling, and rotation invariant. e PONO computes the mean μ and standard deviation σ in the layer: where ε is the small stability constant.
Moment Shortcut (MS) fast-forward the PONO information μ and σ as shown in Figure 7.
e two moments of the activations (μ, σ) are extracted from the early layer and are sent to the corresponding layer later as where F denotes the intermediate layers, and c and β are predicted from μ and σ via a shallow convolution layer.

Loss Function.
e perceptual loss, mean squared error (MSE), GAN loss, and L 2 loss is widely used in many dehazing networks. e research in [35] points that the smooth L 1 loss provides better PSNR and SSIM metrics in many image restoration tasks. So we use the smooth L 1 loss to train the network: where J i (x) stands for the intensity of the ith color channel of pixel x in the dehazed image, and N is the total count of pixels of the image.
At each stage, there is a ground-truth to predict, so we add the losses from each stage to optimize the net:

Dataset.
We evaluate the proposed network on three benchmarks including RESIDE [36], Dense-Haze [37], and real-world dataset [38]. RESIDE contains both indoor and outdoor synthetic hazy images, which are collected from depth datasets [39] and stereo datasets [40]. After data     e Synthetic Objective Testing Set (SOTS) of RESIDE is used for testing, and the SOTS contains 500 indoor images and 500 outdoor images. e images of Dense-Haze and the real-world dataset [38] are collected from the real world.

Training Settings.
We resize the size of training images to 240 × 240 with 3 channels, randomly rotate the images by 90,180,270°, and horizontal flip the images for data augmentation. We choose the Adam optimizer for accelerated training, where β 1 and β 2 take the default values of 0.9 and 0.999, respectively. In the encoder-decoder subnet, each UBlock contains 3 MAUs. In the single-scale subnet of the second stage contains 3 MAGs, each of which consists of 8 MAUs. e channel number of preprocessing convolution layer is set to 64, and the channel number of input and output in MAU are both 64. We adopt the cosine annealing strategy [41] to adjust the learning rate η t from the initial value η � 1 × 10 − 4 to 0 by following the cosine function: where T is the total number of training batches and t is the current training batch. e batch size is set to 4, and we evaluate the model every 5000 steps, the total steps are set to 1,000,000.

Results and Analysis.
In this section, we compare MSNET with recent state-of-the-art image dehazing algorithms which are DCP, AOD-Net, DehazeNet, GCANet, RefineDNet, DehazeNet, GridDehazeNet, and FFA-Net quantitatively and qualitatively. Following those methods, we use peak signal to noise ratio (PSNR) and structure similarity (SSIM) for quantitative assessment of the dehazed outputs, and the outputs higher is better. And the    quantitative comparison results on SOTS and Dense-Haze are shown in Table 1. Among those methods, DCP is a priorbased, and it is often regarded as the baseline, and the others are deep learning based. From Table 1, we can observe that the results of databased method are better than the result of DCP, which is a prior-based method. Among the data-based methods, AOD-Net is simplest network, so the result value is low, but still much higher than DCP. e RefineDNet is a weakly supervised method, so its results are not good as the other supervised methods. Compared to FFA-Net, our results increased by about 1.8% on SOTS and about 4.6% on Dense-Haze because of the multistage and multiattention mechanisms used. e qualitative comparisons of visual effect on SOTS are shown in Figure 8. We select three images from the outdoor dataset and the indoor dataset, respectively, and the upper three rows are indoor results; the left three rows are outdoor results. e first column is the hazy input, the last column is the ground-truth, and the middle columns are the dehazed results from DCP, AOD-Net, GCANet, and MSNet (ours), respectively. From the results, we can see that the DCP method suffers from severe color distortion extremely, especially in the blue sky and the halo of the sun in the last image, and it loses some details. AOD-Net cannot remove all the hazy regions from the hazy image because of its simple network architecture. In the second row of Figure 8, the fog near the tree is still there. And in the fifth row of figure 8,    Scientific Programming there still has fog near the bridge. And its brightness value is lower. GCANet also performs not well on the blue sky and sun halo in the last two rows of figure 8 especially. e images recovered from our network are almost entirely in line with real-scene information, especially, the restoration of blue sky and halo images is much better.
We further give the qualitative comparisons on the realworld dataset [38] in Figure 9, and the results are similar with those on the SOTS dataset. e DCP and GCANet still suffer from severe color distortions, such as the blue sky of rows 1 and 2, and AOD-Net cannot remove the haze completely, so the output images are of low brightness such as a result of row 3. Also, DCP cannot remove the haze completely, such as in the sky of row 3. Although none of the methods can completely remove the hazy regions such as the last row of Figure 9, other methods suffer from color distortions compare to ours. And the restoration of our method is more natural. Above all, our method is capable of performing in image details and color fidelity than other methods in general.     Table 2. First, we compare the results of one-stage and two-stage without PONO and SAM. e results of the two-stage (ID 2) increased by an average of 2.3% than the results of the one-stage (ID 1). e results indicate the effectiveness of the two-stage network. Second, we prove the need of two different architectures in the twostage network by the results of ID 2. While using two different architectures, the PSNR is increased. Finally, from the comparison of ID 3, "✘" in the table means the module is not used, and "✔" means the module is used. We can observe that while the PONO and SAM are used, the result is better.

Conclusion
In this work, we propose a multistage with multiattention network for image dehazing. e model consists of two different stages: one uses an encoder-decoder subnet to obtain contextual features, and the other adopts a singlescale pipeline to provide spatial image details. At each stage, a ground-truth supervision is provided, an attention mechanism is used between the two stages, and a multiattention unit with positional normalization is proposed to stack the network. e results in several benchmarks show that the proposed network outperforms the state-of-the-art methods and have a great advantage over those methods in terms of image detail and color fidelity.
Data Availability e data used in this paper are all from public data sets, including RESIDE, Dense-Haze, and real-world dataset. which can be found in each reference in the paper.

Conflicts of Interest
e authors declare that they have no conflicts of interest.