Two-Stage CNN Model for Joint Demosaicing and Denoising of Burst Bayer Images

In the classical image processing pipeline, demosaicing and denoising are separated steps that may interfere with each other. Joint demosaicing and denoising utilizes the shared image prior information to guide the image recovery process. It is expected to have better performance by the joint optimization of the two problems. Besides, learning recovered images from burst (continuous exposure images) can further improve image details. This article proposes a two-stage convolutional neural network model for joint demosaicing and denoising of burst Bayer images. The proposed CNN model consists of a single-frame joint demosaicing and denoising module, a multiframe denoising module, and an optional noise estimation module. It requires a two-stage training scheme to ensure that the model converges to a good solution. Experiments on multiframe Bayer images with simulated Gaussian noise show that the proposed method has obvious performance advantages and speed advantages compared with similar approaches. Experiments on actual multiframe Bayer images verify the denoising effect and detail retention ability of the proposed method.


Introduction
A digital camera usually captures a raw image and uses an image processing pipeline to output a full-color image. e raw image is a digital matrix captured by a camera sensor and determined by a color filter array (CFA) on top of the sensor. Each pixel location of CFA consists of only one color among red, green, and blue. erefore, an interpolation process called demosaicing is required to recover the fullcolor image with three color channels. Besides, the captured raw image is contaminated with noise. erefore, a denoising step is also required. As a result, demosaicing and denoising are two separate steps that contribute to the output of a clean full-color image in a traditional image processing pipeline. e major drawback of separating demosaicing and denoising is that they interfere with each other. If demosaicing is performed first, the noise distribution is changed by the interpolation process, which makes it harder for denoising to remove noise. If denoising is performed first, color samples in the raw image are changed, which makes it more difficult for demosaicing to recover full colors.
Recovery of the full-color clean image from the noisy raw image is an ill-posed problem. Prior knowledge about image statistics, or image priors, is required to constrain the solution space of the problem to get reasonable results. Demosaicing and denoising can be jointly performed based on the same image priors [1][2][3][4][5][6][7], which comes from three aspects: (a) the image priors can be manually designed. Condat et al. [8] use total variation (TV) prior to ensure the smooth property of image in joint demosaicing and denoising. Heide et al. [9] propose a minimization model that combines TV priors with BM3D and cross-channel priors to improve the quality of the recovered image. Park et al. [10] introduce a convolutional neural network (CNN) model as prior to further improve image details. (b) e image priors can be learned from the image dataset. Khashbi et al. [2] uses random fields to fit the problem of joint demosaicing and denoising. Klatzer et al. [3] model the problem as a minimization problem and learn from the image dataset to improve performance. Khashabi et al. [2] introduce regression tree fields to learn image datasets through a specific loss function. Gharbi et al. [1] design a deep convolutional neural network (CNN) model for joint demosaicing and denoising, which firstly introduce CNN to solve the problem. Liu et al. [6] propose a density-map guidance to help the model deal with a wide range of frequencies, which improves recovered image quality. Xing et al. [7] carefully study the CNN model structure and loss functions to further improve recovered image quality. (c) e image priors can be extracted from multiple frames of the same scene or burst photography. Kokkinos et al. [4] propose an iterative framework to optimize a burst of raw images separately processed by Gharbi's CNN model. e method combines the CNN model with burst photography for joint demosaicing and denoising, which improves recovered image quality.
However, there are some drawbacks to Kokkinos's method. First, the CNN model and the iterative framework are not jointly optimized. It means when they optimize the results from image bursts using the iterative framework, the weights of the CNN model are fixed. A natural idea is to jointly optimize the two separate steps and further improve recovery performance. Second, the iterative framework is slow in deployment. If the burst input can be processed by a single CNN model without the iterative process, the running speed in deployment can be significantly improved.
In this article, we propose a unified CNN model to solve the problem of joint demosaicing and denoising of burst images. e model contains three submodules to process a single image [11], multiple frames [12], and noise estimation [13]. With a carefully designed network architecture and a two-stage training strategy, the proposed model outperforms comparative methods in both recovery performance and processing speed. Figure 1 shows a comparison of burst demosaicing and denoising methods on a real burst, which will be further explained in Subsection 3.4.

Problem Formation.
Given noisy multiframe raw images (burst images) of the same scene, the goal of joint demosaicing and denoising is to generate a noise-free and clear linear RGB image corresponding to the scene. Using the image redundancy information in multiple frames, joint demosaicing and denoising of burst images may achieve better image quality than single-frame demosaicing and denoising does.
Suppose b to be a collection of noisy raw images continuously exposed by multiple frames of the same scene, and y is the noise-free linear RGB image of the scene corresponding to the reference frame in b. We construct a training dataset (b i , y i )|i � 1, . . . , M in multiple scenarios and learn the joint demosaicing and denoising mapping F(b; θ) so that where θ denotes the model parameters and F(dm(·); θ) is a multiframe joint demosaicing and denoising function implemented by a CNN, in which dm represents a demosaicing function included in the CNN. We use l 1 -norm rather than l 2 -norm here to suppress blurry effects. e above formalization process reflects the main idea of dealing with joint demosaicing and denoising of burst images: first, use the network module dm to jointly demosaic and denoise each image to obtain a linear RGB image, and then use the network module F that performs multiframe denoising on the obtained multiframe linear RGB images. In this process, each frame of the input image undergoes singleframe and multiframe denoising in two stages, contributing to the final clear linear RGB image.

Network Design.
e multiframe joint demosaicing and denoising network is mainly composed of two existing major modules: the one is a single-frame joint demosaicing and denoising module, which is implemented using the DRDD [11] network structure. It consists of a series of residual blocks that learn a joint demosaicing and denoising mapping directly. e DRDD network has several designs to improve image recovery performance [11]. First, it splits pixels of Bayer images into three color channels as input. In contrast, directly input Bayer images to CNN will lead to a significant performance drop. Second, it studies the influence of residual blocks and it proves to be effective. ird, it introduces a noise level map at every residual block to strengthen noise information in the deep part of CNN and turns out to be beneficial. e other is a multiframe denoising module, which adopts the multiframe denoising network structure MF-SE-DRDD [12]. It performs end-to-end denoising of a burst of images. e input of the module is a burst of initially denoised RGB images, which goes through a stack of residual blocks followed by a convolution and ReLu [14] activation. It first outputs n intermediate clean estimates, then uses a squeeze-and-excitation (SE) module to get those estimates channel-wise weighted, and at last uses simple element-wise addition to merging the final clean output, which will be further explained in Section 2.4. Besides, the noise level map required for denoising can be estimated by the noise estimation module [13], or it can be input together with the multiframe noise level map. e overall network structure of the multiframe joint demosaicing and denoising network is shown in Figure 2. First, suppose we have n input frames and the n-frame raw images go through the joint demosaicing and denoising module, respectively, and output the initial denoising multiframe linear RGB images. In the figure, multiple joint demosaicing and denoising modules share weights.
is step performs demosaicing and first-level denoising in a single-frame image. en, the initial multiframe linear RGB images are concatenated into a 3n-channel tensor and input to the multiframe denoising module. In the multiframe denoising module, the input tensor passes through a series of residual blocks to obtain an intermediate result aligned to the reference frame, then goes through the squeeze-and-excitation (SE) module, and adds up to obtain the final linear RGB image. In this step, multiple frames of redundant information are used for the second level of noise removal. e ablation experiment in Section 3.5 shows that this two-stage denoising design can remove the noise in the image more effectively when the noise level is large.
In the traditional methods of computer vision, raw image demosaicing is usually solved by interpolation; the problem of multiframe denoising usually requires multiple frames to be aligned and then weighted and averaged; these two methods have great differences. In the trial stage of the network design, we have tried to complete these two steps in the same network, but the result was poor. e reason is that a slight error in multiframe alignment will cause significant color disorder in the raw image interpolation. erefore, it is extremely difficult to learn to solve these two problems at the same time. In this article, we use two subnetwork structures to solve these two problems, which reduces the difficulty of the entire problem.

Synthetic Training Data.
Training data are essential for joint demosaicing and denoising performance. Since it is difficult to get real bursts with groundtruth [15], we have to synthetic training data mainly in two steps. First, simulate clean Bayer images from clean RGB images as described in [12], and add a specific type of noise to Bayer images and obtain clean and noisy image pairs. en, we need to design a frame displacement model to simulate displacement in real bursts.

Generating Clean and Noisy Image Pairs.
Groundtruth training images are required to be clean and rich textured. We select the first 4,000 images from the Waterloo exploration dataset [16] to construct training data. Images are cropped into 128 × 128 nonoverlapping patches with a stride of 256. e rest images from the Waterloo exploration dataset are used as validation data to select a good model weight.
We train two types of models with different additive noise: first, white Gaussian noise for easy quantitative comparison with previous methods; simulating white Gaussian noise is quite easy since we only have to generator random values subject to Gaussian distribution with a given noise level; and second, simulated real noise for qualitative comparison on real noisy bursts.
Noise in raw images can be well modeled by a Poisson Gaussian distribution [17]: where n p is noise in pixel p and y p is the true image pixel intensity. e noise parameters σ r and σ s are fixed but can vary across images as sensor gain changes [15].

Frame Displacement Model.
Frame displacement directly affects how well denoising methods can take advantage of multiple frames. If the frame displacement is too large, it contributes little to denoising and may degrade performance due to misalignment. If the frame displacement is small, it is likely to contribute to the denoising performance. However, not all frame in a burst has a small displacement compared to the reference frame. Briefly, the generated bursts need to contain frames with both large and small displacements, guiding the model to drop frames with large displacements and take advantage of frames with small displacements.
We design a frame displacement model to ensure the requests are fulfilled. Suppose d x , d y as the horizontal and vertical displacement, respectively. ey subject to a uniform distribution: where a is a scalar parameter and D(a) is decided by the following distribution: where z ∈ U(0, 1). at is, the upper limit of d x , d y is randomly chosen between a and 16.
With model (4), we can control the distribution of displacement with a single displacement parameter a. a means frames are with less displacement and more similar. Since the ablation study of a has already been done in [12], we fix the parameter as a � 4 in this article.

Model Training.
Training of the proposed network is carried out in two stages. In the first stage, only the joint demosaicing and denoising module is trained, and the training method and data generation method are the same as the DRBD network in [11]. We select around 4,000 images from the Waterloo exploration dataset [16] to build clean-noisy image pairs by adding simulated noise to clean images. e type of added simulated noise can either be Gaussian or Poisson Gaussian as in (2). With the training image pairs, the proposed network can be optimized using a random gradient descent method. If blind denoising is required, the noise estimation module is also trained at this stage. e denoising module and the noise estimation module are optimized simultaneously.
When solving the blind demosaicing and denoising problem, there are two labels corresponding to the noisy raw image block b i : the noise-free linear RGB image block y i and the noise level map n i . Recall that the joint demosaicing and denoising module is dm(·, θ 1 ), and the noise prediction module is G(·, ϕ), and then the first stage loss function can be written as where λ is the scalar weight for adjusting the noise estimation and denoising weights. We take λ � 0.75 in this article.
In the second stage, the joint demosaicing and denoising module DRBD and the multiframe denoising module MF-SE-DRDD are jointly trained. Recall that the multiframe denoising module is F (·, θ), and then, the training loss function of the multiframe denoising model is where f i (·; θ 1 ) is the intermediate results before the SE module in Figure 2, namely, a part of the whole mapping F(·, θ) and θ 1 ⊂ θ; α is the parameter that controls the simulated annealing rate, β is the initial weight of the two terms in the optimization function, and t represents the current iteration number of the optimization process. As t increases, the weight of the second term in the loss function gradually approaches zero, so that only the first term remains in the loss function. e loss function of the second stage is the sum of the loss function of the first stage and the loss function of the multiframe denoising model: Under the constraint of this loss function, the noise estimation module G(·; ϕ), the single-frame joint demosaicing and denoising module dm(·; θ 1 ), and the multiframe denoising module F(·, θ) are optimized at the same time.
e purpose of staged training is to avoid the network learning the two inconsistent targets of joint demosaicing and multiframe denoising at the same time without initialization, which leads to jitter in the training process and difficulty in convergence, which will be further studied in the ablation study. After training in stages, these two subtraining problems are already solved problems, and the training process is stable and easy to converge.

Experiments
To evaluate the performance of the proposed method and other comparative methods on multiframe joint demosaicing and denoising tasks, we compare the results of each method on simulated Gaussian noise and real noisy multiframe raw images and design an ablation experiment to verify effectiveness of the two-stage design. [18], and BSD [19] datasets are selected as the evaluation datasets of multiframe joint demosaicing and denoising of Gaussian noise. e noise-free images from the datasets are used as groundtruth. e input simulated multiframe Gaussian noisy images are generated by the training data generation method in this article. When experimenting with multiple frames of real raw images, this section selects the HDR+ [20] dataset taken by Google mobile phones.

Compared
Methods. Several comparison methods are selected. One is the combination of classic methods: Bayer interpolation [21] followed by V-BM4D [22]. is method first performs Bayer interpolation on each frame of raw image to complete the demosaicing, turning the problem into a multiframe denoising problem; then, the classic V-BM4D method is used to perform multiframe denoising. e combination of these two classic methods is an important reference for measuring the performance of multiframe demosaicing and denoising. e second method for comparison is BDNet [23] from ECCV 2020. is method develops an alternating learning scheme to learn to align adjacent frames and to denoise static frames separately, and applies the learned model to real-world dynamic sequences. e third is a paper method of ICCV19, referred to as M2M [24]. is method first performs single-frame joint demosaicing and denoising for each frame of raw image based on DeepJoint [1], and then, an unsupervised adjustment is performed through iterative optimization to obtain the joint demosaicing and denoising result.

Evaluation Metrics.
e groundtruth of the Gaussian noise experiment can be obtained. e peak signal-to-noise ratio (PSNR) and the structural similarity factor (SSIM) are used as evaluation metrics. For the denoising results of real multiframe raw images, this section demonstrates the superiority of the proposed method through qualitative analysis and comparison of image details.

Quantitative Comparison on Simulated Multiframe
Raw Images. Table 1 uses source code experiments to compare Bayer interpolation [21]+V-BM4D [22], BDNet [23], M2M [24], and the proposed TwoStage method on three datasets. e input is three frames of the simulated Gaussian noise Bayer image, and the displacement parameter is a � 4.
With a total of 9 test groups with three datasets and three noise levels, the proposed method has achieved 8 firsts, and the performance improvement increases with the raise of the noise level. When the noise level is σ � 25, the proposed method has an average PSNR improvement of more than 0.8 dB on the Kodak and BSD500 datasets compared to the second place method M2M, and the average improvement of SSIM reaches more than 0.038. On the McM dataset, the average performance of the proposed method is weaker than M2M at σ � 5 and is close to the average performance of M2M at σ � 15, and the proposed method is still obvious when σ � 25. e performance advantage shows that the proposed method has the first average performance in most test situations of multiframe demosaicing and denoising tasks, and has a stable performance advantage.

Qualitative Comparison on Simulated Multiframe Raw Images.
is subsection shows the details of the demosaicing and denoising results of each method to illustrate its performance. Figure 3 shows the three-frame demosaicing and denoising results of each method when the noise level on the Kodak dataset is 25. It can be seen that in (a), the Bayer interpolation + V-BM4D method cannot effectively remove the noise. e reason may be that the Bayer interpolation changes the noise distribution, which interferes with V-  Figure 4 shows the three-frame demosaicing denoising result when the noise level is 25 on the McM dataset. Bayer interpolation with the V-BM4D method in (a) has a poor denoising effect. e result of BDNet in (b) is blurry with slight changes in brightness. Although the M2M method in (c) removes most of the noise, it leaves more obvious denoising artifacts on flat surfaces such as walls, which affects the visual effect. e proposed method in (d) effectively removes noise, and there is no residual noise on flat areas such as walls, and the visual effect is the best. (e) is the groundtruth of the scene. Figure 5 shows the three-frame demosaicing denoising result when the noise level is 25 on the BSD500 dataset. Bayer interpolation with the V-BM4D method in (a) cannot effectively remove noise. e result of BDNet in (b) is blurry; the M2M method in (c) leaves more noise residues and burrs at the edges of the apes. e proposed method in (d) has a better denoising effect and fewer defects. (e) is the groundtruth of the scene. rough the qualitative analysis in this section, it can be seen that the traditional Bayer interpolation with the Computational Intelligence and Neuroscience V-BM4D combination cannot effectively deal with the multiframe demosaicing and denoising problem, and its denoising effect is poor; BDNet is not good at aligning frames in the test case, which generates blurry results with some artifacts; the M2M method can effectively suppress the Moiré problem during demosaicing, and it is easy to leave denoising artifacts in flat areas, but more details can be preserved in some areas; the proposed method can also effectively suppress Moiré, with good visual effect and no artifacts.    Computational Intelligence and Neuroscience

Qualitative Comparison on Real Multiframe Raw Images.
is subsection illustrates the performance of each method in processing real multiframe raw images through a qualitative comparison of denoising details. e noise estimation module of the proposed method can obtain the noise level maps of scenes, and then, the average noise level is calculated as input for the other two comparison methods. e test multiframe images come from the HDR + dataset. e original data are in dng format. First, we use the open-source tool DCRAW1 to convert the dng format to tiff format. Raw image pixels change from 16 bits to 8 bits during the conversion process. ere is no other change to the pixels value of the raw images. Figures 1 and 6 show the demosaicing and denoising results of several groups of real multiframe raw images with a sequence length of 3. Figure 1(a) is a larger view of a TwoStage result. (b), (c), (d), and (e) are enlarged parts of the results of the three compared methods. We can find that the grid details in (b) are fuzzy and invisible; grid details in (c) are much worse than (b); the grid details in (d) are better than those in (b), but it is still unclear. e grid details in (e) are kept relatively complete. erefore, the proposed method maintains the best detail retention ability in this test scenario. In summary, V-BM4D cannot remove noise effectively; BDNet can remove noise but turns to generate blurry results; the results of M2M is less blurry; our proposed TwoStage method generates results with the best visual quality on real mutliframe raw images in the test.

Ablation Study.
In the multiframe demosaicing and denoising experiment, the proposed model is mainly composed of DRDD responsible for single-frame joint demosaicing and denoising and MF-SE-DRDD responsible for multiframe denoising. e ablation experiment will test the following: (1) replacing DRDD in the model with the classic Bayer interpolation method to explore its necessity; (2) using DRDD to directly do single-frame joint demosaicing and denoising to explore the necessity of multiframe denoising; and (3) comparing one-stage with two-stage training to demonstrate the effect of two-stage training. Table 2 lists the results of the first ablation experiment. Replacing DRDD with the classical interpolation leads to a performance drop in 8 out of 9 test cases. Besides, the higher the noise level, the more the performance is reduced. Specifically, when σ � 25 on the Kodak dataset, the average performance degradation of PSNR reaches 1.05 dB, and the average performance degradation of SSIM reaches 0.0224. is shows that DRDD plays an important role in the method, especially in the large noise condition.
Comparing the proposed method with the single-frame denoising method DRDD in the second experiment, it can be found that when σ � 25, there is a consistent performance improvement on different datasets; when the noise level σ � 15, there is a slight performance gain in Kodak datasets and a slight performance decrease on the remaining datasets.
is shows that the multiframe method has a better performance improvement when the noise level is large, and the single-frame method is more appropriate when the noise level is small.
In the third experiment, we conduct a one-stage training experiment in comparison with the two-stage Computational Intelligence and Neuroscience 7 one. Recall that the total number of training process is 2,000 epoches, and each stage takes 1,000 epoches. e comparison one-stage experiment uses the same training settings. It can be found in Figure 7(a) that one-stage training takes around 900 epoches before the training and validation loss reach a reasonable value range. In comparison, the two-stage training can keep the training and validation loss in a low value range during the whole training process in Figure 7(b). e validation losses of one-stage and two-stage training are plotted in the same figure in Figure 7(c). It is clear that the validation loss of two-stage training converges more quickly and to a lower level, which suggests the two-stage training is a better policy.
3.6. Running Time. Table 3 lists the average running time on three frames. Among them, interpolation + V-BM4D uses MATLAB to run on CPU, and M2M and TwoStage use the Pytorch framework to run on GPU. Running time data are measured on a desktop computer with Intel I7-5390K CPU and Nvidia GTX 1080Ti GPU. e first method and the latter two methods do not run on the same device, and its running time is only listed for reference.
It can be seen that the proposed TwoStage has an advantage over M2M in speed. is is because M2M is iteratively optimized, and it takes more time even on the GPU. e proposed method only needs to run model inference and is much faster. BDNet is the fastest method.

Conclusion
A two-stage demosaicing and denoising method for burst images is proposed. e basic idea is to do joint demosaicing and denoising on single frames first, and then to do multiframe denoising on the initial results. In this process, each frame of the input image undergoes singleframe and multiframe denoising in two stages, contributing to the final denoised linear RGB image. For a network design, this article proposes a two-stage training method to ensure that the model converges to a good solution. Experiments on multiframe Bayer images with simulated Gaussian noise show that the proposed method has obvious performance advantages and speed advantages compared with similar methods. Experiments on actual multiframe Bayer images verify the denoising effect and detail retention ability of the proposed method. Ablation study shows the effectiveness of each CNN module.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Computational Intelligence and Neuroscience 9