Image Demosaicing Based on Generative Adversarial Network

Digital cameras with a single sensor use a color filter array (CFA) that captures only one color component in each pixel.,erefore, noise and artifacts will be generated when reconstructing the color image, which reduces the resolution of the image. In this paper, we proposed an image demosaicing method based on generative adversarial network (GAN) to obtain high-quality color images. ,e proposed network does not need any initial interpolation process in the data preparation phase, which can greatly reduce the computational complexity.,e generator of the GAN is designed using the U-net to directly generate the demosaicing images.,e dense residual network is used for the discriminator to improve the discriminant ability of the network. We compared the proposedmethod with several interpolation-based algorithms and the DnCNN. Results from the comparative experiments proved that the proposed method can more effectively eliminate the image artifacts and can better recover the color image.


Introduction
Images are widely used in people's daily life. Compared to analog images, digital images are more superior in their higher resolution and easier storage, and they are more suitable for computer processing. With the development of computer technology, the digital imaging technology has attracted lots of attention and digital cameras have gradually become the mainstream imaging equipment that is widely used in intelligent transportation [1,2], medical imaging [3,4], remote sensing technology [5,6] and other fields. In our daily life, the digital color images are most commonly used, which include three color components, that is, red, green, and blue, in each pixel. Ideally, digital cameras with three sensors can get full-color images with each sensor capturing one color component, and the three components are combined together into a color image. However, in practice, the arrangement of the three color sensors will affect the subsequent color synthesis, and cameras with three sensors are usually expensive and relatively large. erefore, most digital cameras use single sensor with a color filter array (CFA) placed in front of the sensor. e obtained CFA image needs to be processed to acquire the full-color image and this process is known as image demosaicing [7]. As only one color component is captured for each pixel in the CFA, without image demosaicing, the CFA image can only reflect the general outline of the scenery instead of the complete color information, which consequently affects subsequent image processing [8].
e CFA image demosaicing is essentially an ill-posed inverse problem [9]. e methods for image demosaicing generally include interpolation-based algorithms and learning-based algorithms. Generally, image demosaicing using interpolation methods can achieve high accuracy for smooth areas with approximately same colors and gradient brightness. For the color images, the red, green, and blue components occupy different color channels, respectively. When the high-frequency signals change (high-frequency single/information refers to the region with strong color variation, such as edges and angles), there may be spatial offsets in each color channel. erefore, the reconstructed images may display color artifacts and zippering when doing interpolation [7]. In addition, some traditional interpolation-based methods ignore the correlation among different color channels, which results in unsmooth images [8]. On the whole, the interpolation-based algorithms still have some limitations for image demosaicing, especially at the high-frequency areas.
In recent years, neural networks have been rapidly developed and widely used in image processing, such as image classification [10,11], motion recognition [12,13], and image super-resolution [14,15]. Recently, the generative adversarial network (GAN) [16] has been proposed and rapidly attracts attention of many researchers. Ledig et al. [17] proposed a super-resolution generative adversarial network (SRGAN), which used a deep residual network for the training and can well recover the image textures from greatly downsampled images. Inspired by super-resolution image reconstruction and the conditional generative adversarial network (CGAN), Kupyn et al. [18] applied the CGAN to image deblurring and effectively restored clear images. Pan et al. [19] proposed a physics model constrained learning algorithm so that it can guide the estimation of the specific task in the conventional GAN framework, which can directly solve image restoration problems (such as image deblurring and image denoising).
GAN has been used and played important roles in several areas; however, it has not been used for image demosaicing. In this paper, we proposed a novel learningbased image demosaicing method using GAN to improve the ability for color image recovery. Our contributions are as follows: (1) We proposed a CFA image demosaicing method based on GAN (2) We carefully designed each part for the GAN model (3) We introduced long jump connections for the improved U-net [20] model to design the generator (4) We used the dense residual network, which includes dense residual blocks with long jump links and dense connections for the discriminator (5) We combined the adversarial loss, the feature loss, and the pixel loss together to further strengthen the network performance In the experimental section, we show the performance of our method using some comparative experiments. e results prove that the proposed method can more effectively remove artifacts and recover the full-color image, especially for some high-frequency areas such as edges and angles.

Interpolation-Based Algorithms.
ere are many interpolation-based methods for image demosaicing. Linear interpolation algorithm is the simplest one. However, this method often causes artifacts and blurring at the image edges [21]. e bilinear interpolation algorithm [22] estimates the unknown pixels from their adjacent pixels.
is method often causes color distortion in the reconstructed image. Malvar et al. [23] proposed a high-quality linear (HQL) interpolation algorithm, which can greatly reduce the computational complexity. However, the artifacts still occur at high-frequency components of the image. In order to further reduce the artifacts, different interpolation techniques were proposed.
Within the gradient-based schemes, Hamilton and Adam [24] proposed the Hamilton-Adam algorithm, which uses the second derivative of the sampled color channels when doing interpolation. erefore, this method considers the correlation among different color channels and significantly improves the image details. Mukherjee et al. [25] proposed a two-line (TL) interpolation algorithm, which used the homogeneity of the cross-ratios of different spectral components around a small neighborhood to interpolate the pixels lying in the low gradient directions, so as to produce high-quality images.
Within the directional interpolation schemes, Chung and Chan [26] used the prior decision in the horizontal interpolation and the vertical interpolation and got the interpolation result according to the trend of the image edges. is method is prone to producing false colors at tiny edges, especially when the edges are not in the horizontal or vertical directions. Zhang et al. [27] proposed a local directional interpolation and nonlocal adaptive thresholding (LDI-NAT) algorithm. is method used the nonlocal redundancy of the image to improve the local color reproduction and can better reconstruct the edges and reduce color artifacts.
Within the residual interpolation schemes, Kiku et al. [28] proposed a minimized-Laplacian residual interpolation (MLRI) algorithm. is method estimated the tentative pixel values by minimizing the Laplacian energy of the residuals, which can effectively reduce the color artifacts. Monno et al. [29] proposed an adaptive residual interpolation (ARI) algorithm, which adaptively selects a suitable iteration number and combines two different types of residual interpolation algorithms at each pixel. Kiku et al. [30] incorporated the residual interpolation algorithm into the gradient-based threshold free (RI-GBTF) algorithm, and the interpolation accuracy is greatly improved. Besides, L. Zhang and D. Zhang [21] proposed a joint demosaicing-zooming scheme.
is method used the correlation of the hyperspectral spatial for the CFA image to calculate the color difference, so as to restore the three color components, which can effectively eliminate color artifacts.

Learning-Based Algorithms.
Recently, neural networks have also been used for image demosaicing. Prakash et al. [31] used a denoising convolution neural network (DnCNN) to perform demosaicing and denoising independently, which effectively suppressed the noise and artifacts. Tan et al. [32] used the deep residual network for image demosaicing and image denoising, which also effectively obtained highresolution color images. Shopovska et al. [33] proposed an improved residual U-net and used it for image demosaicing, which achieved high-quality reconstructed color images for different CFA patterns. Generally, the learning-based strategies can achieve better performance compared to the traditional interpolation-based methods. However, higherresolution and clearer recovered color images are the constant pursuit for image demosaicing; that is why we are trying GAN for this task.

CFA Image.
To obtain a color image with detailed description of the natural image, the best solution is to use three sensors to accept the red, green, and blue components for each pixel, respectively. en the color image can be synthesized by combining the three color components. Considering the cost and volume, most digital cameras use a single image sensor for the image acquisition systems. e image acquisition of the camera with single sensor is shown in Figure 1. e CFA is set before the sensor. For common CFA, such as the Bayer pattern [34] that is used in this work, the light reaching the sensor mainly consists of the red, green, and blue components. Within the CFA, each pixel only accepts one color component. As shown in Figure 1, the obtained Bayer pattern image can only estimate the approximate gray outline of the scenery instead of the complete color information. e color arrangement of Bayer pattern can be clearly seen from the local zoomed in area. In the Bayer pattern, a set of red and green filters and a set of green and blue filters are alternately used. e number of green pixels is 1/2 of the total number of pixels, while the numbers of red and blue pixels are both 1/4 of the total number of pixels. As only one color component is captured for each pixel, the other two color components need to be recovered according to the color information from adjacent pixels; then a full-color image is obtained from the CFA image. is processing is called image demosaicing.

eory of GAN.
GAN is a kind of probabilistic generative network, which was first introduced by Goodfellow et al. [16] into the deep learning field. e general architecture of GAN is shown in Figure 2. GAN uses G to perform inverse transformation sampling of the probability distribution and capture the distribution of the ground truth data x. Based on noise data z which obeys a certain distribution (such as Gaussian distribution), G will generate a fake sample G(z) similar to x. e output of D represents the probability of the incoming data. us, if the input is x, the output D(x) is a large probability value; otherwise it outputs a small probability. e GAN's training process is to maximize the discrimination accuracy by training D, as well as to minimize the difference between the generated sample and the real sample by training G. us, the training for G and D is a min-max game problem. e performances of G and D are improved by alternative optimization. Finally, G and D reach Nash equilibrium, so that the data distribution synthesized by G is similar to that of the ground truth data x. e loss function of the above process is defined as where V(D, G) represents the value function [16].
x ∼ P data (x) represents ground truth data x obeying a real data distribution P data (x), and z ∼ P noise (z) represents noise data z obeying a simulated distribution P noise (z) (such as the Gaussian distribution). D(x) and D(G(z)) are the classification outputs of D for the ground truth data x and the generated data G(z), respectively. E means expectation.

The Proposed Method
In this section, we propose an effective demosaicing algorithm based on GAN. e whole process is shown in Figure 3. e proposed algorithm first extracts the red, green, and blue components from the original CFA image to form the 3-channeled split CFA image. en the extracted green component is further separated into two channels to form the 4-channeled split CFA image. Subsequently, the algorithm extracts only the pixel values that are not 0 to compress the 4-channeled split CFA image. e compressed 4-channeled image is taken as the input of G in GAN. e output of G is the interpolated 3-channel full-color image. e output images from G and the ground truth images are then inserted into D. e parameters of G are optimized according to the output of D. We designed the architectures for G and D and trained the database through an end-to-end trainable neural network. In addition, the algorithm combined the adversarial loss, pixel loss, and feature loss to design the generator loss function in order to further improve the network performance [35]. In the following, we give a detailed introduction to different parts of the network.

Generator.
e purpose of the generator G is to convert the 4-channeled compressed CFA images to the 3-channeled output full-color images. e structure of G is shown in Figure 4. We used the improved U-net [20] model for G. Overall, the generator G consists of an encoder (the first half ) and a decoder (the second half), which is shown in Figure 4. One layer in the encoder and the corresponding layer in the decoder form a U-shaped symmetric layer. e long jump links within each symmetric layer in the U-net model can reduce the information redundancy. Besides, we remove the pooling layer in the U-net, which can avoid the loss of useful information in the feature maps and increase the stability of the training process. e encoder is mainly based on the downsampling operation (i.e., convolution operation). It can analyze the input data to obtain the most significant features and provide feature mappings to its corresponding layer in the decoder. e activation function of the encoder is a leaky rectified linear unit (LReLU), which is defined as where κ is a positive constant (κ ∈ (0, 1)). x e represents the input vectors for a specific layer of the encoder. In our experiments, we set κ as 0.1. e decoder is mainly based on the upsampling operation (i.e., deconvolution operation) to restore the full-color images. e activation function of the decoder is a standard rectifier linear unit (ReLU), which is defined as where x d represents the input vectors for a specific layer of the decoder.

Mathematical Problems in Engineering
Particularly for the final layer of the decoder, the activation function is the tanh activation function, which is defined as where x f represents the input vectors for the final layer of the decoder.
In order to accelerate the convergence and improve the network performance, we introduce the batch normalization (BN) operation after each convolution and deconvolution operation to slow down the transfer of internal covariates and reduce the sensitivity of the network to initialization weights [36].
Detailed parameters for the convolution and deconvolution layers are shown in Table 1.

Discriminator.
We used a dense residual network, which is inspired by the ResNet [36], for the discriminator D. e ResNet is formed by stacking multiple consecutive residual blocks (RB).
In order to improve the network performance and solve the problem of gradient disappearance and gradient dispersion during the network training, we used an improved residual dense block (RDB). e structure of D is shown in Figure 5. e long jump connection after each RDB helps to transfer the output of this RDB to the final convolution layer. Within each RDB, there are several units with each unit consisting of the ReLU activation function, the convolution layer, and the BN operation. ere are dense connections with different distances among these units. e output from the final convolution layer is mapped into 0 or 1 using the sigmoid activate function. e sigmoid function performs a probability analysis that can normalize the discriminant result, which is defined as where x s represents the input vectors for the sigmoid function.
For the convolution layers in D, the kernel size is set as 3 × 3, the stride size is set as 1 × 1, and an output channel is 64.

Loss Function.
Denote the ground truth images x i , i � 1, 2, . . . , N , where N represents the number of images. After a series of operations, the CFA images are transformed into the corresponding 4-channeled compressed CFA images, which are denoted as y i , i � 1, 2, . . . , N and are regarded as the input of G. According to the loss function inspired by Alsaiari et al. [35], we combined the adversarial loss, the feature loss, and the pixel loss together with appropriate weights to work as the final loss function for the generator. e adversarial loss function (L a ) is expressed as where y i represents the i − th 4-channeled compressed CFA image. e 3-channeled color images are produced using Equation (6) to fool the generator G. e feature loss function (L f ) is defined as where F(·) represents the feature mapping matrix extracted from the pretrained VGG network [35]. ‖ · ‖ 2 represents the L2 norm. Using Equation (7), we can extract the image features and restore the image details by comparing feature data between the generated image G(y i ) and the ground truth image x i . e pixel loss (pixel-to-pixel Euclidean distance) function (L p ) is defined as where ‖∇G(y i )‖ 2 is the regularization item, with λ representing the regularization weight. Using Equation (8), we can correctly restore the image information by comparing each pixel between the generated image G(y i ) and the ground truth image x i . We combine L a , L f , and L p together with appropriate weights to form the final loss function for the generator, which is defined as where α, β, and c represent the predefined positive weights according to the empirical values [35]. According to Equation (1), the discriminator D uses the following equation to update the parameters: For the ground truth image x i , the probability of the output D(x i ) is close to 1. For the generated image G(y i ), the probability of the output D(G(y i )) is close to 0.
Based on the above strategy, the generator G and the discriminator D will be alternately optimized.
Based on the above introduction, we give the whole pipeline in Figure 6 to clearly describe the proposed method. e real scenery is captured by the camera and converted to the CFA image. e obtained CFA image is then further converted to the 4-channeled compressed CFA images, which is inserted into the generator designed by the U-net model. e output from the generator and the ground truth image are then inputted into the discriminator that is designed by the dense residual network. e generator will finally give near-real demosaiced image through the network training.

Experiments
In this section, we demonstrate the performance of the proposed network with numerical experiments. e network GPU and Intel Core i5-8265U CPU. e training sets are created beforehand and then uploaded into TensorFlow.

Training Details.
e training database used in this paper is from the Waterloo Exploration database (WED) [37], which contains 4744 pristine natural images. We first randomly selected 400 images to create the training set. For the training set, we used data augmentation operations such as cropping and rotations to increase the number of images. To be more specific, we first scaled down each selected image by 1, 0.9, 0.8, and 0.7 times and then used a sliding window to crop the scaled images into patches with a size of 40 × 40 pixels. e sliding step-lengths in the horizontal and vertical directions are both 20 pixels. Subsequently, the obtained    patches are sequentially vertically and horizontally flipped and rotated 90°, 180°, and 270°, respectively, as shown in Figure 7. rough the above data augmentation operations, we obtained 86400 training images. ese images are input in batches during the network training process to reduce the calculation and avoid local extreme value problems. During the training, the weighting parameters in the loss functions are set as α � 0.5, β � 1, c � 1, and λ � 10 − 5 according to the empirical values [35]. e batch size is set as 256 and there are 200 iterations for the whole network training. During the training, we used a variable learning rate, where the initial learning rate is set as 0.01 and the value is reduced by 1/10 every 40 iterations. e trained network is tested with the Kodak database and the McMaster database [38]. e Kodak database consists of 24 images with the size of 768 × 512 pixels. e McMaster database consists of 18 images with the size of 500 × 500 pixels.
In order to quantitatively evaluate the performance of the proposed network, we used color peak signal to noise ratio (CPSNR) and structural similarity index (SSIM) as measurement standards for the demosaicing results. e CPSNR value is calculated as where x(h, w, t) and y(h, w, t) represent the (h, w) pixel value of the ground truth image x and the demosaiced image y for the t-th color channel, respectively. H and W represent the height size and width size of the image. e SSIM measures the similarity between two images, which is defined as SSIM(x, y) � 2μ x μ y + C 1 σ xy + C 2 where μ x and σ x represent the mean intensity and the standard deviation of the ground truth image x. μ y and σ y represent the mean intensity and the standard deviation of the demosaiced image y; σ xy is the covariance between x and y. C 1 and C 2 are two constants used to keep the equation balanced and stable, which are usually set as C 1 � (K 1 * K)2 and C 2 � (K 2 * K)2, with K 1 � 0.01, K 2 � 0.03, and K � 255.

Image Demosaicing Test.
In this section, we prove the effectiveness of the proposed method by comparing different demosaicing methods. e methods used for comparison are the Bilinear [22], TL [25], HQL [23], Zhang's [21], LDI-NAT [27], ARI [29], MLRI [28], RI-GBTF [30], and DnCNN [31] methods, as well as the proposed method. Table 2 shows the CPSNR and SSIM of the test results for the Kodak database from different methods. Figure 8 shows the corresponding box plots of the CPSNR and SSIM for easier comparison. Table 3 shows the CPSNR and SSIM of the test results for the McMaster database from different methods. Similarly, Figure 9 shows the corresponding box plots. We can see that the proposed method shows higher CPSNR and lower SSIM, which means better performance compared with other methods.
For additional comparison, Figure 10 shows the reconstructed images using different methods for the 19th image in the Kodak database. e marked portion of the image within the black box (the fence) is enlarged for clearer comparison. It can be seen that this part has obvious vertical textures and it is prone to artifacts. Residual images (i.e., difference between the ground truth image and the demosaiced images) for the enlarged portion are also shown for easier comparison. From the reconstructed images and the residual images, we can see that, compared with other methods, the proposed method can more effectively   Mathematical Problems in Engineering Mathematical Problems in Engineering suppress the artifact phenomenon, especially for some tiny edges and angle areas. Figure 11 shows the reconstructed images using different methods for the 22th image in the Kodak database. e marked portion in the black box (the window) is enlarged. It can be seen that this part is prone to appearing color stripes and zipperings. e residual images are also shown for easier comparison. From the reconstructed images and the residual images, we can see that most methods can obtain satisfactory results for the smooth areas, while there may appear some wrong colors at the edges. From the comparison, we can see that the results of the proposed method show relatively fewer artifacts and color stripes. Figures 12 and 13 show the reconstructed images using different methods for the 1st and 12th images in the McMaster database, respectively. Similarly, the marked portions in the images are enlarged and the residual images for the enlarged portion are shown for clear comparisons. It can be seen that, compared with other methods, the proposed method can better recover the images with fewer artifacts, especially at some tiny edges, which proved the validity and performance of the proposed method.

Discussion
In this work, we proposed a new method for image demosaicing based on GAN, which aims to more effectively reconstruct the full-color image. One of the challenges of this task is the recovery of the high-frequency information in the image, such as edges and angles. Many related algorithms have strong ability to process the smooth part of the image; however artifacts, zippering, and strip colors still exist in the high-frequency part. In the current work, we redesigned the generator and discriminator of GAN and combined the adversarial loss, the feature loss, and the pixel loss to further improve the network performance. Numerical experiments showed that the proposed algorithm can effectively reduce the artifacts at the edges and produce near-real reconstructed images, which can be the basis for subsequent image processing, such as image recognition and image transmission. e proposed method can produce better recovered color images; however, the learning-based strategy is relatively time-consuming in the data training phase. erefore, how to improve the efficiency of the network training is an important aspect to further enhance the performance of the learning-based technology. In practice, there are many kinds of CFA patterns. We used the Bayer pattern in this paper. Different patterns of the CFA image may have different impact on the reconstructed image, so we will use CFA    images with different designs to test the network in the near future. In the current work, we assumed the CFA images are noiseless. However, the images from cameras in practice may have been affected by noises. erefore, we will try combining image demosaicing and denoising in the future. e current work focuses on directly generating the demosaiced images using neural network. We will test on combining traditional demosaicing algorithms and the neural network in the future.

Conclusions
In this paper, we proposed an image demosaicing method based on GAN. e generator is designed by using the improved U-net architecture to directly generate the demosaicing images. For the discriminator, we used the dense residual network including dense residual blocks with long jump connections and dense connections to overcome the problem of gradient disappearance and gradient dispersion during the network training, which can improve the discriminant ability of the network. In addition, we combined the adversarial loss, the pixel loss, and the feature loss together to improve the loss function. e network was trained using images from the Waterloo Exploration database and the trained network was tested with the Kodak database and the McMaster database. Comparisons among different image demosaicing methods showed that the proposed method can better eliminate artifacts in the reconstructed image and can especially better restore high-frequency features, such as edges and angles of the image.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request. e data used to support the findings of this study are open datasets which could be found in general websites, and the datasets are also freely available.