Study on the Method of Fundus Image Generation Based on Improved GAN

. With the continuous development of deep learning, the performance of the intelligent diagnosis system for ocular fundus diseases has been signiﬁcantly improved, but during the system training process, problems like lack of fundus samples and uneven sample distribution (the number of disease samples is much smaller than the number of normal samples) have become increasingly prominent. In view of the previous issues, this paper proposes a method for generating fundus images based on “Combined GAN” (Com-GAN), which can generate both normal fundus images and fundus images with hard exudates, so that the sample distribution can be more even, while the fundus data are expanded. First, this paper uses existing images to train a Com-GAN, which consists of two subnetworks: im-WGAN and im-CGAN; then, it uses the trained model to generate fundus images, then performs qualitative and quantitative evaluation on the generated images, and adds the images to the original image set to expand the datasets; ﬁnally, based on this expanded training set, it trains the hard exudate detection system. The expanded datasets eﬀectively improve the generalization ability of the system on the public datasets DIARETDB1 and e-ophtha EX, thereby verifying the eﬀectiveness of the proposed method.


Introduction
With the continuous development of deep learning, it has been widely applied in the medical field, and the performance of corresponding medical intelligent diagnosis system has been significantly improved, but there are also many problems. For the hard exudate detection system, a large number of marked images are needed in the training process of the system, but in reality, it is difficult to obtain fundus images (obtaining fundus images requires professional medical cameras to take pictures of human eyes) and the distribution of sample data is uneven (the number of sick samples is much smaller than the number of normal samples). For the problem of uneven sample distribution [1,2], the data-level solution strategies can be roughly divided into three types. e first type is data enhancement, which includes traditional data enhancement methods, such as flipping, scaling, cropping, and adding noise. ere are also advanced data enhancement methods [3][4][5] such as Sample Pairing [3], which uses two images to synthesize a new sample.
is type of method can indeed alleviate the problem of insufficient positive samples, but it is limited in terms of scalability and relies too much on existing datasets. e second type is oversampling [6][7][8] and undersampling [9][10][11]. Oversampling is to expand the minority class samples (called positive samples) so as to increase the percentage to a normal value. Examples include random oversampling [6], SMOTE method [7], and integrated oversampling [8]. ese methods can improve sensitivity, but due to insufficient diversity of the positive samples, it can easily cause overfitting [12], so it is usually used in combination with data enhancement. Undersampling is to discard the majority class samples (called negative samples). For example, Ng et al. [9] clustered negative samples to obtain their distribution information, so as to select representative samples and discard others. is type of method has great drawbacks: not only does it lose some of the negative sample features, but it also often leads to an insufficient number of samples for training. For different samples, how to find the effective sampling strategy is also a challenging problem. e third type is artificial data synthesis, such as VAE [13], which can generate low-resolution images. e Generative Adversarial Network (GAN) [14], since it was proposed by Ian Goodfellow, has shown strong generation capabilities in the field of image generation [15][16][17] and has been widely used to augment datasets. In theory, GAN can explore the distribution rules of data based on the existing data and then generate samples with the same distribution as the original data. e method proposed in this paper falls within the third type of methods.
Fundus image acquisition is expensive and involves patient privacy, making it a difficult subject for public research. In order to introduce private medical data into the public domain and alleviate the problems like insufficient fundus image data and uneven sample distribution, many researchers have applied GAN to expand the fundus image datasets, and there have been many successful cases, but there are also problems such as loss of details, mode collapse [18,19], and unstable training [20,21]. ese cases can be divided into two categories. One is based on unsupervised learning GAN and its improved models [22][23][24][25]. In theory, it can generate rich picture data, but the actual training process is still very difficult, with serious mode collapse. Images can only be generated randomly and poorly controlled. Guibas et al. proposed a fundus angiography image generation method [26], which can effectively improve the quality and diversity of image generation but still suffers from loss of details and cannot generate images with corresponding labels. e other category is to modify the unsupervised learning GAN to CGAN [27] and pix2pix [28]. e most representative one is the method of generating fundus angiography images with diseased tissues proposed by Appan et al. [29]. is method significantly improves the quality of image generation but relies too much on existing datasets, so it is difficult to generate rich fundus images using this method.
In order to improve the shortcomings of the previous methods, this paper proposes a fundus images generation method based on Com-GAN: firstly, im-WGAN is used to generate a vascular tree, and then im-CGAN is used to generate a complete image. Experiments show that the model integrates the advantages of the two methods and performs better than either of the methods alone. Compared with unsupervised GAN, the proposed approach is more controllable and the quality of generated images is significantly improved; and compared with the supervised CGAN model, it takes two steps to improve the diversity of samples and generate richer images. e generated image is then added to the training set of the hard exudate detection [30,31] model. Compared with those generated by other methods, the image generated by the proposed method can greatly improve the generalization ability of the model and effectively alleviate the problems of insufficient samples and uneven distribution. e main contributions of this paper are as follows: (1) A two-step method for generating fundus images based on Com-GAN is proposed, which incorporates the advantages of two adversarial networks. Unlike direct generation, this method first uses im-WGAN to generate a vascular tree [32], which can reduce the difficulty of fundus images generation, ensure the quality, and increase the diversity. (2) It improves the original CGAN network by introducing two generating conditions in the generator and the discriminator. e improved network can not only generate high-quality fundus images but also control the categories of the images generated. (3) It introduces pixelwise mean squared error (pMSE) [33] and perception loss [34] based on the original loss function, so as to retain the characteristics of original images and improve the visual satisfaction about the generated images.

Related Studies
2.1. GAN. GAN is composed of two parts: generator G and discriminator D.
e generator takes random noise z as input and is used to learn the distribution of training data x; and the discriminator is similar to a classifier, which is used to discriminate real data x and G(z). e two networks are trained alternately. When the discriminator cannot correctly classify the sample sources, the generator and the discriminator will reach Nash equilibrium [35]. e objective function of GAN is where p data (x) is the probability distribution of the real data, p z (z) is the probability distribution of random noise, and E is the mathematical expectation.

2.2.
WGAN. An important reason for the difficulty in GAN training is that, due to gradient disappearance [36], that is, under the condition that the discriminator approximates optimality, when there is no nonnegligible coincidence between the generated data and the real data, optimizing the objective function is equivalent to optimizing the Jensen-Shannon [37] divergence between the generated data and the real data. At this time, the Jensen-Shannon divergence is approximately a constant and thus can no longer guide the training process. In WGAN [23], the Wasserstein distance was used instead of the original loss function to solve gradient disappearance. e objective function of WGAN is where f(x) is a discriminator function, which needs to satisfy Lipschitz constraints [38].

CGAN.
Due to the unstable training process and poor controllability of the original GAN, CGAN came into being. CGAN adds a condition variable y to the input of the generator and the discriminator to guide the generation process. e objective function of CGAN is where p data (x), p z (z), and E have the same meaning as formula (1) and y represents the introduced condition variable, which can be of any form.

Method
In this section, the first part introduces the overall framework of Com-GAN, and the latter two parts introduce the two building blocks.

Overall Structural Framework of the Model.
e fundus angiography image generation based on Com-GAN proposed in this paper consists of two networks: the im-WGAN network for generating a vascular tree and the im-CGAN network for generating a complete fundus angiography image. Both networks are improved to better adapt to fundus image generation, based on the original network. e overall framework is shown in Figure 1.
e fundus image is generated in two steps, and each step of generation improves the sample diversity. e training process can be divided into two stages, specifically described as follows: In the first stage, an image segmentation technique [39] is used to segment a vascular tree from the existing fundus image set, and an im-WGAN is trained based on the segmented vascular tree. After the model converges, a large number of vascular trees are generated using the trained im-WGAN generator, thereby expanding the vascular tree image set. In the second stage, based on the vascular tree segmented from the real image and the corresponding complete fundus image, a vascular tree-complete fundus image pair is formed to train the im-CGAN proposed in this paper. e network is improved based on CGAN, and the generator and the discriminator are alternately trained, until the model converges. Based on the expanded vascular tree image set, the trained im-CGAN generator is then used to generate a complete fundus image pair, which includes a normal fundus image and a fundus image containing hard exudates. e generated fundus images are added to the existing fundus image set to further expand the fundus datasets.

Im-WGAN.
Compared with the original GAN, WGAN has better training stability and is suitable for the generation "from nothing" studied in this paper. e purpose of this network is to generate a vascular tree image with perfect details. However, due to the complexity of the vascular tree structure, a conventional processing method will inevitably lead to too many parameters in the network structure, which will increase the amount of calculation and also heighten the overfitting risk. Considering the previous problems, this paper improves the model structure on the basis of WGAN. Im-WGAN includes two generators and two discriminators, with the overall structure shown in Figure 2.
e generator G 1 takes random noise z as the input and outputs a generated low-resolution vascular tree, and the discriminator D 1 takes a real low-resolution vascular tree or one that is generated by G 1 as the input and determines the probability of a real vascular tree. e generator G 2 takes the low-resolution vascular tree generated by G 1 as the input and outputs a reconstructed high-resolution vascular tree, and the discriminator D 2 takes a real high-resolution vascular tree or one that is generated by G 2 as the input and determines the probability of a real vascular. e size of a lowresolution image is 128 × 128 pixels, and that of a highresolution image is 256 × 256 pixels. e overall generation process can be divided into two stages: the first stage relies on the generator G 1 , describes the basic outline of the image, and generates a low-resolution image with a simple vascular structure; the second stage fills in the details of the low-resolution image and generates a more realistic high-resolution image. Experiments show that the two-stage generation approach can enhance the stability of the training process and improve the quality and diversity of the generated images. e generator G 1 is an improved version of the DCGAN [22] generator structure. e improvements made include increasing the number of deconvolution layers and changing the final output channel number to 1. e structure of generator G 2 is based on U-net [40]. e network structure of U-net includes downsampling encoders and upsampling decoders. Downsampling encoders are used to extract image features, and upsampling decoders combine the information of each layer of downsampling encoders and the input information of upsampling to restore detailed information and gradually restore the image accuracy. erefore, the generator G 2 includes downsampling encoders, residual blocks [41], and upsampling decoders, and the BN layer is added after the convolutional layer [42], where the residual blocks are used to increase the network depth. e specific structure of G 2 is shown in Table 1.  is paper uses the physical significance of the matrix spectral norm [43] to make the discriminator of the im-WGAN satisfy the Lipschitz constraint in the global scope. Here, the physical significance of the matrix spectral norm means that any vector, after undergoing matrix transformation, will have a length that is less than or equal to the length of the product of this vector and the matrix spectral norm. e formula is as follows: ‖f where σ(W) represents the spectral norm of the weight matrix, x represents the input vector of the layer, and δ represents the amount of change in x.

Im-CGAN.
e purpose of this network is to generate two types of complete fundus images: normal fundus images and fundus images with hard exudates.
First, a category label y is established for each real image to mark whether it contains hard exudates. en, based on the vascular tree segmented from the real image and the corresponding complete fundus image, a vascular treecomplete fundus image pair is formed.
An advantage about CGAN is that it can use labels to control the generation, making it quite suitable for the generation process in this paper. erefore, this paper improves the model structure based on CGAN.
e overall structure of im-CGAN is shown in Figure 3. During the training process, the generator G takes the segmented   vascular tree and label y as the input. e main purpose of G is to generate a complete fundus image. D takes the vascular tree and label y and the generated image or the corresponding real image as the input. Its main purpose is to guide the generation process. During the generation process, the segmented vascular tree or the one generated by im-WGAN and the category label of the image to be generated are input into the generator to obtain a complete fundus image. e generator uses an encoder-decoder structure and introduces a U-net skip-level structure, which specifically includes 4 convolutional layers, 9 residual blocks, and 3 deconvolution layers.
In order to generate a fundus image with high resolution, a discriminator with a large receptive field is needed. For a conventional processing method, the network capacity needs to be increased. is not only consumes too much memory but also easily causes network overfitting.
erefore, a multiscale discriminator is introduced based on dilated convolution to expand the receptive field of the discriminator under the same parameters. An example of expanded convolution is shown in Figure 4. Figure 4 shows the sizes of the receptive field when the 3 × 3 convolution kernel takes different expansion rates, where " * " represents the parameter point and the shaded part represents the receptive field. Figure 4(a) is a normal convolution with a corresponding expansion rate of 1; Figure 4(b) corresponds to an expansion rate of 2; and Figure 4(c) corresponds to an expansion rate of 3. As can be seen, the receptive field expands as the expansion rate increases.
In the improved discriminator model proposed in this paper, three discriminators are set, with three scales from coarse to fine. Among them, the coarsest discriminator D a corresponds to an expansion convolution with a cyclic expansion rate of {1, 2, 7} and has the largest receptive field, and thus it is responsible for the global judgment of the fundus image. e medium-scale discriminator D b corresponds to an expansion convolution with a cyclic expansion rate of {1, 2, 5}, which is responsible for guiding the generator to generate a smooth image; and the fine-scale discriminator D c corresponds to a cyclic expansion rate of {1, 2, 3}. Having the smallest receptive field and being more sensitive to details, it is responsible for guiding the generator to learn more realistic details. e multiscale structure is shown in Figure 5.
In order to ensure the generation quality, retain the original image features, and improve visual satisfaction, this paper introduces pixelwise mean squared error (pMSE) and perception loss on the basis of the original loss function. pMSE is defined as where I x,y and I x,y′ , respectively, represent the pixel values of the (x, y) pixels in the complete fundus image and the vascular tree; W and H represent the height and width of the image, respectively, both of which are 256 in this paper, and θ is the generator parameter. Because pMSE calculates the loss pixel by pixel, it will inevitably lead to too smooth texture and poor visual perception. erefore, this paper introduces perception loss to improve visual satisfaction. Visual perception loss is defined as follows: where ϕ i,j represents the feature map before the i-th largest pooling layer and after the j-th convolutional layer in the pretrained VGG19 network [44]; I and I ′ represent the complete fundus image and the vessel tree, respectively; and W i,j and H i,j represent the dimensions of each feature map in the VGG network. e overall cost function is where L CGAN is the adversarial loss function of CGAN, L pMSE is the pixelwise mean squared error, L pl is the perceptual loss, and α and β are the hyperparameters for controlling the proportion, both of which are set to 0.1 in this paper.

Experiments and Analysis
is paper evaluates the effectiveness of the Com-GAN by comparing different generation methods in terms of image   e self-selected dataset is used as the training set for the hard exudate detection system and also as the training set for Com-GAN. e public datasets e-ophtha EX and DIA-RETDB1 are used as the test sets for the hard exudate detection system to test the system performance. To facilitate training and testing, the sizes of all images were adjusted in the previous datasets to 256 * 256.
Evaluation criteria: this paper used different evaluation criteria for Com-GAN and hard exudate detection systems. For the Com-GAN, this paper conducted qualitative and quantitative evaluation on the generation quality from both subjective and objective aspects. Subjectively, three observers were asked to independently perform visual assessment during the experiment; objectively, Structural Similarity Index (SSIM) [47] and Sharpness Difference (SD) were applied to measure the similarity between the generated image and the real image at the pixel level, and Inception Score (IS) [48] and Fréchet Inception Distance (FID) [49] were used to evaluate the generated image from the perspective of high-level feature space. * * * * * * * * * SSIM models the similarity between the generated image and the real image as a combination of three different factors: brightness, contrast, and structure. e mean value is used as the brightness estimate, the standard deviation as the contrast estimate, and the covariance as the measure of structural similarity. e value can better reflect the subjective perception of human eyes, with the range being [0, 1]. e larger the value, the higher the similarity between images. e SSIM formula is where μ X and μ Y represent the mean values of the generated image X and the real image Y, σ X and σ Y represent the standard deviations of the generated image X and the real image Y, and C 1 and C 2 are constants introduced to prevent the denominator from being 0. Sharpness Difference (SD) is used to represent the difference in clarity between the generated image and the real image. e larger the SD value is, the smaller the difference in sharpness between the images is and the closer the generated image is to the real image. e formula of the SD between the generated image X and the real image Y is where MAX Y is the maximum pixel value of the image and grads X,Y is the gradient difference between the image X and the image Y. Inception Score (IS) evaluates the generated image from the aspects of quality and diversity. In theory, the closer the image is to the real image, the higher the IS score will be. e calculation formula is where x represents the picture generated from the generator, y the predicted label of x, and D KL the KL divergence between p(y | x) and p(y).
Since the ImageNet dataset does not contain labelled fundus categories, this paper does not directly use the pretrained Inception model, but the AlexNet model [50] trained on the Kaggle dataset instead for scoring.
FID assumes that the abstract features of the generated sample and the real sample in the middle layer of the classifier conform to a multivariate Gaussian distribution, and FID is the Fréchet distance between these two Gaussian distributions. e smaller the FID value is, the closer the two Gaussian distributions will be to each other and the closer the generated image will be to the real image. e FID calculation formula is where μ g and Σ g are the mean and variance of the generated sample Gaussian distribution and μ x and Σ x are the mean and variance of the true sample Gaussian distribution, respectively. T r represents the trace of the matrix. e function of the hard exudate detection system is to determine whether the image contains hard exudates. In this paper, the image containing hard exudates is marked as a positive sample, and accuracy (AC), sensitivity (SE), and specificity (SP) are used as performance evaluation indices. e calculation formulas are as follows: where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively. Experimental parameters and environment: both the generator model and the discriminator model used the Adam optimizer [51]. e parameter β1 was set to 0.9, the parameter β2 to 0.99, and the learning rate to 0.001. is paper used the Pytorch platform for coding as it can dynamically create new calculation charts to facilitate experiment debugging. e server configuration used is CPU E5-2620 v4 @ 2.10 GHz, NVIDIA Tesla V100 16 G.

Experimental Results of the Com-GAN.
is paper used the trained im-WGAN to generate vascular trees and then denoised the resulting images, with the results shown in Figure 6.
Vascular trees segmented from a real fundus image are in the first row in Figure 6. During the training process, they were input to the discriminator as a training set to guide the generator to generate images. e generated vascular trees are in the second row. e vascular trees generated by im-WGAN and the labels of the images to be generated were input into the trained im-CGAN generator to obtain the complete generated images. e specific results are shown in Figure 7. e generated fundus images without hard exudates are in the first row; generated fundus images with hard exudates are in the second. e segmented vascular trees are in the first column in Figure 7, which were input into the generator together with the labels during the training process. e real fundus images are in the second column. e real fundus images in the training process were input into the discriminator together with the corresponding vascular trees and the labels. e vascular trees generated by im-WGAN are in the third column. During the test, the vascular trees and the label of the images to be generated were input into the generator to generate the complete fundus images shown in the fourth column.

Generation Quality Evaluation.
is section compares the proposed method with the current mainstream generation models for evaluation of generation quality: Mathematical Problems in Engineering CGAN [27]: CGAN increases controllability by introducing labels in the original GAN, thereby generating images corresponding to the labels. Pix2pix [28]: this model introduces the image x in the generator and the discriminator to guide the generator to generate the image y, where x and y, respectively, represent images in different domains X and Y, so the function of pix2pix can also be understood as completing image translation from domain X to domain Y. In the experiment in this paper, the segmented vascular tree was used as the image x, which was then mapped to the complete image y using the pix2pix model. Figure 6: Results of vessel tree generation.

Qualitative Evaluation.
From the visual point of view, this paper conducted a qualitative evaluation on the images generated by the previous methods, as shown in Figure 8. Normal image samples are in the first row of Figure 8 , and image samples with hard exudates are in the second row. From the visual point of view, the evaluations of the three observers are summarized as follows: the images generated by the CGAN model only show the fundus outline and part of the main blood vessels, with almost no clear details of the blood vessels and hard exudates; those generated by pix2pix exhibit clear details of blood vessels, but the optic disc and macular area are blurred, and no obvious hard exudate area is observed.
Compared with those generated by the previous two methods, the images generated by BiGAN, IntroVAE, and the proposed method are more realistic. e images generated by Com-GAN have more semantic details and the sharpness is the closest to that of real images. As shown in Figure 9, BiGAN and IntroVAE fabricated some vascular details that do not conform to medical principles to deceive the discriminator, while the image generated by Com-GAN is based on a vascular tree, so the vascular details are more realistic.
All the other methods generate complete fundus images in one step. In order to control whether the generated image contains hard exudates, pix2pix, BiGAN, and IntroVAE, all use the reconstruction method to generate an image with the same label as the input image, so it is a one-to-one relationship between the generated image and the reconstructed one. CGAN and Com-GAN can control the type of image by the label y, so it is a one-to-many relationship between the label and the generated images. In this way, the diversity of the images generated is better than that by other methods.

Quantitative Evaluation.
is section conducted a comparative analysis of the normal images, images with hard exudates, and vascular trees generated by the previous models. e vascular trees under evaluation were segmented from the images generated by each model using the segmentation model. Here, FID was used to measure the similarity between these vascular trees and the real segmented ones. e experimental results are shown in Table 2. As can be seen, the larger the SSIM, SD, and IS and the smaller the FID, the higher the similarity between the generated image and the real one, the better the generation quality, and the richer the diversity.
It can be seen from the experimental results that, in terms of complete image generation, Com-GAN and IntroVAE were very close in SSIM, SD, and FID and better than the other three methods, but the IS score of Com-GAN was higher than those of the other four models, 22.20% higher than the average value of IntroVAE, the best among the other methods, indicating that Com-GAN is superior to other models in terms of both generation quality and diversity. e evaluation results of vascular trees show that the vascular trees generated by Com-GAN were the closest to the real ones. erefore, judging from the three aspects, the images generated by Com-GAN are more realistic than those generated by the other four models.

Performance Comparison of Intelligent Diagnostic
Systems. In order to verify the effectiveness of this method in practical applications, this paper applied the generated images to the training of the hard exudate detection system and tested the performance of the detection system on the data-enhanced test set. e original test set was a mixed dataset of DIARETDB1 and e-ophtha EX. e enhanced    dataset contained 470 fundus images with hard exudates and 385 fundus images without hard exudates. e hard exudate detection system was implemented using the AlexNet model, and the final classification results were changed into two categories, that is, input images with hard exudates and those without hard exudates.
is paper used the image sets generated by the previous models to train the hard exudate detection system, with the test results shown in Figure 10. In the figure, the horizontal axis represents the size of the training set. e initial set was a self-selected dataset containing 4000 real images. en, 4000 images were added each time, and finally it was increased to a size of 16000 images. e ratio of positive to negative samples was adjusted to 1 : 1 from the first expansion of data. e vertical axis represents the evaluation index score. Figures 10(a)-10(f), respectively, show the evaluation results of the method combining oversampling with data enhancement, CGAN, pix2pix, BiGAN, IntroVAE, and Com-GAN.
It can be drawn from Figure 10 that the method combining oversampling with data enhancement and the methods that directly generate images showed significant improvements in the first expansion of the dataset. However, when the training dataset was expanded to 12000 and 16000, no significant improvement was observed in the performance. On the other hand, with the expanded dataset of Com-GAN, the system performance improved in all of the last three evaluations. What is more, after the third expansion of data, the final SE, SP, and AC reached 0.787, 0.844, and 0.824, respectively, which were higher than the final results of the other models. Compared with those of the initial dataset, the indices were increased by 0.213, 0.065, and 0.157, respectively, and compared with those of IntroVAE, the best among the other methods, the indices were higher by 0.053, 0.013, and 0.046, respectively. is verifies that the proposed method is superior to other models in practical applications.

Conclusion
is paper proposes a new type of GAN : Com-GAN, used to generate fundus images. Com-GAN divides the fundus image generation process into two stages. First, im-WGAN is used to generate a vascular tree, and then im-CGAN is used to generate a complete fundus image on the basis of the vascular tree. e proposed method alleviates the problem of uneven distribution of samples in the fundus image training set and at the same time mitigates the problem of insufficient training samples. After qualitative and quantitative evaluation and application in the detection system, it is proved that Com-GAN can generate high-quality fundus images compared with current mainstream generation models, and the generated images are highly diversified, rather than simple repeats of the images in the training set. In addition, the proposed two-step generation method can be flexibly applied to expand other datasets. In the future, more research will be carried out to explore the application of this method in fields like image style transfer and image translation.

Data Availability
All data included in this study are available upon request to the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.