A Generative Adversarial Network with Dual Discriminators for Infrared and Visible Image Fusion Based on Saliency Detection

Infrared and visible image fusion needs to preserve both the salient target of the infrared image and the texture details of the visible image.+erefore, an infrared and visible image fusionmethod based on saliency detection is proposed. Firstly, the saliency map of the infrared image is obtained by saliency detection. +en, the specific loss function and network architecture are designed based on the saliencymap to improve the performance of the fusion algorithm. Specifically, the saliencymap is normalized to [0, 1], used as a weight map to constrain the loss function. At the same time, the saliency map is binarized to extract salient regions and nonsalient regions. And, a generative adversarial network with dual discriminators is obtained.+e two discriminators are used to distinguish the salient regions and the nonsalient regions, respectively, to promote the generator to generate better fusion results. +e experimental results show that the fusion results of ourmethod are better than those of the existingmethods in both subjective and objective aspects.


Introduction
Image fusion aims to utilize complementary information of two source images to synthesize a fusion image with a more comprehensive understanding of the scene [1,2]. e infrared image can identify the target according to thermal radiation contrast, and the visible image can provide a clear image in line with the human visual system [3,4]. Due to the characteristics of the infrared and visible image, the fusion result of the infrared and visible image can preserve the significant target of the infrared image and the texture detail of the visible image simultaneously [5,6]. Infrared and visible image fusion has been widely used in many fields, such as target recognition, video surveillance, and scene understanding [7][8][9].
e key of image fusion is to integrate the effective information and remove the redundant information of the source image to gain a better fusion image [10,11]. For this purpose, a large number of infrared and visible image fusion methods have been proposed. ese methods can be divided into two categories: (i) traditional methods, which usually complete the fusion task based on mathematical transformation and manual design; (ii) deep learning-based methods, which usually use the specific loss function to optimize the neural network to obtain a fusion result [12].
Although the abovementioned methods can complete the fusion task successfully, several aspects still need to be improved. Firstly, the manually designed fusion rules lead to the traditional methods being more complex and timeconsuming. Secondly, some methods only apply deep learning in the part of fusion process, which is difficult to give full play to the advantages of deep learning [13]. irdly, due to the lack of ground truth, GAN-based methods are difficult to determine the input of the discriminator. e existing methods usually use two technical routes to solve this problem: (i) using one source image as the input of the discriminator, which will inevitably lead to the gradual loss of information of the other source image [14]; (ii) using the generative adversarial network with dual discriminators, which takes both two source images as the input. However, this scheme is difficult to control the balance between the two discriminators [15].
To address these challenges, this paper proposes a generative adversarial network with dual discriminators for infrared and visible image fusion based on saliency detection (SDGAN). Firstly, the proposed method is based on deep learning, which optimizes the network through specific loss functions and overcomes the increasing complexity caused by manually designed fusion rules. Secondly, in order to solve the problem of lacking ground truth, we use the generative adversarial network with dual discriminators to deal with the fusion problem. At the same time, in order to maintain the balance between the two discriminators, we introduce saliency detection into image fusion. e two discriminators take significant targets and nonsignificant targets as inputs, respectively, to ensure that the two discriminators can work smoothly without conflict.

Infrared and Visible Image Fusion
e traditional fusion method can be divided into three steps: feature extraction, feature fusion, and feature reconstruction [16]. As feature reconstruction is usually an inverse process of extraction, the key of traditional methods is feature extraction and feature fusion. By employing different strategies for feature extraction and feature fusion, a large number of fusion methods have been proposed.
Although the above three steps summarize most traditions, some methods are not suitable for the abovementioned three steps, such as GTF, which is based on gradient transfer and total variation minimization [23].

Deep Learning-Based Methods.
Although the traditional fusion methods can gain a satisfactory fusion image, the fusion methods are generally very complex and timeconsuming due to the artificially designed fusion rules. With the rise of deep learning, more and more fusion methods based on deep learning are proposed.
Li and Wu [24] employed the encode/decode network architecture and introduced the densely connected convolution layer in the encoder to extract the features of the source image to avoid losing information in the convolution process. Yang et al. [25] proposed a fusion model based on visual saliency sparse representation and detail injection to avoid the loss of significant thermal radiation targets of infrared images. Zhang et al. [26] propose an image fusion network based on proportional maintenance of gradient and intensity, named PMGI, which can preserve source image information through the gradient and intensity path. With the rise of the generative adversarial networks, Feng et al. [11] tried to use GAN to solve the problem of image fusion, named FusionGAN. Subsequently, its variant [27] was proposed by introducing the target-enhancement loss to enhance edge details of the fused image. However, these methods force the fused image to obtain more details in visible images as the adversarial game proceeds. In contrast, the thermal information of infrared images is gradually lost. To address this issue, Ma et al. [15] introduced dual discriminators into GAN to avoid excessive loss of information in the source image.

Saliency Detection.
e human visual system will focus on important regions of the image, which helps humans easily obtain important information. Saliency detection aims to simulate the human visual system to extract the significant regions of the image and prioritize allocating computing resources for important regions in subsequent processing.
Itti et al. [28] first combined the multiscale features to get an initial saliency map and then used a neural network to optimize the initial saliency map to get the final result. Hou and Zhang [29] extracted the spectral residual of an image in the spectral domain by analyzing the log spectrum of an input image and proposed a fast method to construct the corresponding saliency map in the spatial domain. Cheng et al. [30] proposed a saliency detection method based on regional contrast, simultaneously evaluating global contrast differences and spatial coherence. Traditional saliency detection methods mainly rely on manual extraction of features and then combine these features to obtain a saliency map. Vig et al. [31] proposed an entirely automatic datadriven method that performs a large-scale search for optimal features to gain a saliency map. Kümmerer et al. [32] first used depth networks to solve saliency detection, which can reuse existing networks that have been pretrained on the task of object recognition in models of fixation prediction. en, a large number of saliency detection methods based on the neural network have been proposed and achieved good results.

Problem Formulation.
e infrared image can highlight the target by the difference of thermal radiation. Relatively, the visible image contains richer texture details. Infrared and visible image fusion can retain the highlighted target of the infrared image and the texture details of the visible image simultaneously. Saliency detection can extract highlighted targets of the image. erefore, introducing saliency detection into infrared and visible image fusion can improve the performance of the image fusion algorithm.

Mathematical Problems in Engineering
For a given infrared image I r , the significance value S(k) of pixel k is obtained by calculating the distance between pixel k and all other pixels i on the image, which can be defined as follows: (1) e significance map S of the infrared image I r can be obtained by calculating the significance values of all pixels of the infrared image pixel by pixel. en, the weight map w can be obtained by normalizing all values on the saliency map S to the interval [0, 1], which can be used to constrain the fusion weights of different targets in the loss function. e calculation process of the weight is shown as follows: Finally, the saliency map S is binarized to extract the salient region of the image successfully. Specifically, the pixel value of the saliency map S is normalized pixel by pixel. If the pixel value is greater than b, the corresponding pixel value in the mask m takes the value 1; otherwise, it is 0. In this paper, the mask can be obtained when b is determined as 0.25. e mask m can be obtained when all pixels are calculated. e mask m calculation process is shown as follows: As shown in Figure 1, the saliency map and mask of two typical infrared images are given. It can be seen that the saliency map can indeed detect the significant target of the infrared image and the mask can indeed represent the significant area.
Given an infrared image and a visible image, the goal of image fusion is to obtain a generator constrained by the source image. e fused image generated by the generator can retain the salient target of the infrared image and the texture details of the visible image at the same time.
is paper proposes a generative adversarial network with dual discriminators for infrared and visible image fusion based on saliency detection, named SDGAN. e entire procedure of our proposed SDGAN is shown in Figure 2. e infrared image and visible image are input to the generator G to gain an initial fused image. However, it is difficult to obtain a satisfactory fusion image only by the generator G; therefore, two discriminators D r and D v are introduced in our network to establish the adversarial games with the generator. e generator G can generate a better fused image through adversarial games. e discriminator D r is used to distinguish the salient regions of the fused image and the infrared image. e discriminator D v is used to distinguish the nonsalient regions of the fused image and the visible image. e significant region of the image can be obtained by multiplying the mask m and the image pixel by pixel, and the nonsalient region can be obtained by multiplying the mask (1 − m) and the image pixel by pixel. Since the two discriminators are used to distinguish the complementary regions of the source image and the fused image, the two discriminators can complete their own tasks independently without conflict. e goal of generator G is to synthesize a fused image, which can make it difficult for both discriminators to distinguish whether the input image is from the fused image or the source image at the same time. Mathematically, the training goal of generator G is minimization: where ⊙ represents the Hadamard product, G represents the generator, D r and D v represent the two discriminators, m represents the mask, which is used to extract the salient area of the image, and (1 − m) is used to extract the nonsalient area of the image. e training goal of D r and D v is to maximize equation (4).

Loss Function.
e original GAN is prone to lead to artifacts and noisy or other incomprehensible results in the generated image due to the instability of its training process. In order to make the training process more stable, a common solution is to introduce content loss. To improve the quality of fusion image, in addition to adversarial loss L con , this paper also introduces an enhancement loss L enh . erefore, the loss function of generator L G mainly includes three parts: content loss L con , enhancement loss L enh , and adversarial loss L adv , as shown in where λ 1 and λ 2 are introduced to control the tradeoff. e content loss L con is used to constrain the similarity between the fused image and the source image in content, which mainly consists of two parts, gradient loss L grad and intensity loss L int , as shown in where c is obtained to maintain the balance between the gradient loss L grad and intensity loss L int . e gradient loss L grad is committed to preserving the texture details of the source image in the fused image, which is defined as follows: where ∇ represents the gradient operator, which is used to extract the gradient of the image, ‖ · ‖ 2 represents the Euclidean norm, and ξ 1 is used to balance two items. e intensity loss L int is used to constrain that the fused image and the source image have similar intensity distribution, which is defined as follows: Mathematical Problems in Engineering where ξ 2 is introduced to control the tradeoff. e enhancement loss L enh is mainly used to enhance the highlighted targets and the texture details, which is defined as follows: where w represents the weight map, which is used to control the retention degree of significant targets in the fused image, (1 − w) is used to control the retention degree of nonsignificant targets in the fused image, and ξ 3 is used to balance two items. e adversarial loss L adv comes from the game between the generator and the discriminators, as shown in where m represents the mask, which is used to extract the significant region of the fused image, and (1 − m) is used to extract the nonsignificant region of the fused image.
In order to make the generator converge smoothly, two discriminators D r and D v are obtained to construct the adversarial relationship between the generator and the discriminators. e loss functions of the two discriminators D r and D v are defined as follows: Figure 3, a dualencoder-single-decoder structure is introduced in the generator. Two encoders are used to extract the features of two source images, respectively. Each path of the encoder adopts four-layer network architecture for feature extraction. All convolution kernel sizes are set to 3 × 3. All steps are set to 1, and batch normalization and ReLU activation function are used to avoid the vanishing gradient and speed up network convergence. Moreover, dense connections are employed in each encoder path to realize feature reuse [33]. For the decoder, the output of the dual encoder is connected as the input to reconstruct the fused image. e decoder also adopts a four-layer network architecture, with the  convolution kernel size of 3 × 3, and contains batch normalization and LReLU activation functions.

Discriminator
Architecture. Two discriminators D r and D v are used to establish the adversarial game with the generator and promote the generator to generate more realistic and detailed images by distinguishing the input. e discriminator D r is used to distinguish the true and false aspects of significant targets between the fused image and infrared image, and the discriminator D v is used to distinguish the true and false aspects of nonsignificant targets between the fused image and the visible image. e architecture of two discriminators D r and D v is the same, but they does not share parameters. e network architecture is shown in Figure 4. e first four layers are convolution kernels with a size of 3 × 3, and the activation function is LReLU. e last layer is the linear layer, and the activation function is tanh, which is used to generate a scalar to estimate the probability that the input is from real data. e step of all convolutions is set to 2.

Dataset and Training Details.
e training dataset comes from the public infrared and visible dataset TNO, which is the most commonly used dataset in infrared and visible image fusion tasks. 28 images are selected from the dataset TNO to train the model in this paper; however, only 28 images are not enough to train a good model. erefore, the clipping strategy is carried to expand the training dataset, and each image is cropped into image patches with the size of 120 × 120. Eventually, 23364 image patches can be used to train the model.
In the training process, the generator selects 32 pairs of infrared and visible image patches as input at one time. Next, we used 32 pairs of the salient areas of the infrared and fused image patches as the input of the discriminator D r . Simultaneously, 32 pairs of the nonsalient areas of the visible and fused image patches are used to input into the discriminator D v . We first train the discriminator 1 time and then train the generator until reaching the maximum number of training iterations. All the parameters of our model are updated by the Adam optimizer [34] at a learning rate of 10 −4 .

Compared Methods and Objective Indexes.
As we all know, we need to make qualitative and quantitative comparisons with the existing advanced methods in order to evaluate the performance of our method. For qualitative comparisons, we compare our method with five existing methods, including three traditional methods, i.e., LP [35], DTCWT [36], and FPDE [37], and two deep learning-based methods, i.e., FusionGAN [14], and DenseFuse [24]. All traditional methods run on the same CPU i7-7700k, while deep learning-based methods run on the same GPU GTX 1080ti. All comparison methods are implemented based on public code, and the parameters are default.
Although qualitative comparison can measure the performance of the method to a certain extent, it is easy to be affected by people's subjectivity. In this paper, qualitative comparisons are used to evaluate our method more comprehensively.
Entropy (EN) is a common parameter for statistical image features, reflecting the amount of information obtained from infrared and visible images. Mathematically, entropy can be defined as follows: where L denotes the gray level of the image and p l is the normalized histogram with the gray-scale value of l in the fused image. Standard deviation (SD) represents the dispersion of image gray-scale value relative to the average gray-scale value, defined as in where F(i, j) is the pixel value of the fused image F in the i-th row and the j-th column, M × N denotes the size of the fused image F, and μ F is the average pixel value of the fused images. Structural similarity (SSIM) mainly simulates image loss and distortion from three aspects: loss of correlation,  Mathematical Problems in Engineering luminance, and contrast distortion. e product of the three components is the evaluation result of the fused image, which can be defined as follows: where x and y represent two source images. μ denotes the mean value, σ represents the standard deviation/covariance, and C 1 , C 2 , and C 3 are the parameters to make the metric stable.

Qualitative Comparisons.
Qualitative comparisons mainly evaluate the performance of the method according to the human visual system. In this paper, three typical infrared and visible images are used to evaluate the method. e experimental results are shown in Figure 5.
From top to bottom in Figure 5, the three lines are the fusion result of Kaptein_1654, Marne_02, and soldier_-in_trench_2. From left to right, the first two columns in Figure 5 present the original infrared images and visible images. e last column is the fusion results of the proposed SDGAN, and the remaining columns correspond to the fusion results of LP, DTCWT, FPDE, FusionGAN, and DenseFuse.
As shown in Figure 5, all methods can complete the fusion task, but all comparison methods can only better retain the information of a certain source image. For example, the fusion result of FusionGAN better retains the significant target of the infrared image but loses a lot of texture details of the visible image. Relatively, although the fusion results of LP, DTCWT, FPDE, and DenseFuse retain the texture details of the visible image, the significant targets of the infrared image are not so prominent. In contrast, the fusion results of our SDGAN can highlight significant targets and retain texture details of visible images at the same time, such as the human in Kaptein_1654 and soldier_-in_trench_2, significantly brighter than other comparison methods. e texture details on the wall in Marne_02 are clearer than in other methods.

Quantitative Comparisons. Quantitative comparisons
show difficulty in avoiding the influence of people's subjective emotions. In order to evaluate our SDGAN more comprehensively, quantitative comparisons are also employed to evaluate our fusion methods. It is not comprehensive to use only one objective metric to evaluate the fusion method. erefore, entropy (EN), standard deviation (SD), and structural similarity (SSIM) are used to evaluate the methods in this paper. Quantitative comparisons are performed on 32 image pairs. e results are shown in Table 1.  From the experimental results, we can find that the proposed SDGAN achieves the optimal results on three metrics. e optimal entropy shows that the fusion result of our SDGAN obtains the most information from source images, which shows that the fusion method in this paper is indeed effective and can retain rich source image information. e largest standard deviation shows that the fused image of our method has higher contrast, which proves that the fused image retains more intensity information of the infrared image in the significant area and more texture details of the visible image in the nonsignificant area by significance detection. e optimal structural similarity shows a strong correlation between the fusion results of SDGAN and the source images, and the fused image does not have serious distortion, which shows that the fusion method proposed in this paper can retain more information from the source image. e average running time of LP, DTCWT, FPDE, FusionGAN, DenseFuse, and the proposed SDGAN is presented in Table 2. It can be seen that the average running time of this method is second only to LP, indicating that this method does not lose the efficiency of the algorithm on the basis of improving the quality of the fused image.

Ablation Experiment.
In order to generate high-quality fusion images, two discriminators are employed in our network. We have conducted ablation experiments to verify the role of two discriminators by removing all the discriminators. e comparison results are given in Figure 6. We can find that our GIDGAN can better preserve the texture details of the visible image while preserving the significant targets of the infrared image, such as the result of our GIDGAN can better preserve details of shrubs in the first row.

Conclusions
In this paper, an infrared and visible image fusion method based on saliency detection is proposed. e saliency map of the infrared image is extracted through saliency detection, which is employed not only in the loss function to train the model but also in network architecture. We obtain a generative adversarial network with dual discriminators. e saliency map can divide the image into significant regions and nonsignificant regions, and the dual discriminators can be used to identify them, respectively. By the adversarial game, the generator can generate more realistic fusion images with highlighted target and rich texture details. Qualitative and quantitative experiments show that the proposed SDGAN can achieve the promised effect that the fused image retains both the salient target and rich texture details.

Data Availability
e dataset used to support the findings of this study are included within the open data collection in https://figshare. com/articles/TNO20Image20Fusion20Dataset/%201008029.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Mathematical Problems in Engineering 7