Study on Optimal Generative Network for Synthesizing Brain Tumor-Segmented MR Images

Due to institutional and privacy issues, medical imaging researches are confronted with serious data scarcity. Image synthesis using generative adversarial networks provides a generic solution to the lack of medical imaging data. We synthesize high-quality brain tumor-segmented MR images, which consists of two tasks: synthesis and segmentation. We performed experiments with two different generative networks, the first using the ResNet model, which has significant advantages of style transfer, and the second, the U-Net model, one of the most powerful models for segmentation. We compare the performance of each model and propose a more robust model for synthesizing brain tumor-segmented MR images. Although ResNet produced better-quality images than did U-Net for the same samples, it used a great deal of memory and took much longer to train. U-Net, meanwhile, segmented the brain tumors more accurately than did ResNet.


Introduction
General characteristics of medical imaging data are as follows. It is difficult to obtain a large volume of data, and it is more difficult to acquire labelled data necessary for supervised learning. As shown in Figure 1, since the picture archiving and communicating system (PACS) was introduced in hospitals, vast amounts of multimedia data in the medical imaging field have been stored. However, due to various institutional and privacy issues, external institutions have difficulty gaining access to such data. Additionally, utilizing the accumulated data for learning requires preprocessing the data, which consequentially takes considerable time and effort.
In addition, medical imaging data have the following characteristics. A typical chest X-ray image contains 2,000 pixels horizontally and 2,500 vertically, which results in a total of five million pixels. Meanwhile, the lesions usually occupy a relatively small part of the whole image. Magnetic resonance imaging (MRI) scans provide more detailed information about the inner organs such as the brain, skeletal system, and other organ systems than do computerized tomography (CT) scans. Although MRI has many advantages, it has some disadvantages such as prolonged acquisition time (about 45 min), high costs, and limiting patient factors such as claustrophobia or metal devices in their bodies [3]. Because MRI scanners use strong magnetic fields and magnetic field gradients, MRI scans could be dangerous especially for a patient with nonremovable metal inside the body [4], and therefore the acquired images could be blurred and abnormal. CT scans are combinations of X-ray images taken from different angles, and they are fast, painless, and noninvasive. However, they do expose the patient to radiation, albeit at a relatively low dose. To minimize radiation exposure, CTscans produce low-dose images, and as a result, they unfortunately tend to be severely degraded by excessive noise and streak artifacts. For these reasons, the medical imaging field contains a small number of available medical data.
Generative adversarial networks (GANs) [5] provide a generic solution to the lack of medical imaging data. As shown in Figure 2, they can be applied to diverse tasks such as image synthesis [7,9], segmentation [6,11,12], reconstruction, and classification [8]. Figure 3 shows the statistics of GAN-related papers categorized by tasks and imaging modalities.
ese statistics are based on the databases of PubMed, arXiv, proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), SPIE Medical Imaging, IEEE International Symposium on Biomedical Imaging (ISBI), and International conference on Medical Imaging with Deep Learning (MIDL) [13]. e cutoff date of the search was set as July 30, 2018, and the number of GAN-related papers increased significantly in 2017 and 2018. As Figure 3 shows, about 70% of these papers studies on image synthesis and segmentation, and MR is the most-studied imaging modality in the GAN-related publications.
In this paper, we use a CycleGAN [14] which has the significant performance in medical imaging so as to synthesize brain tumor-segmented MR images. Generating brain tumor-segmented MR images consists of two tasks, namely, synthesis and segmentation. One entails image-to-image  Figure 2: Example applications using GANs. (a) Organ (lung and heart) segmentation on chest X-ray of adult [6]. (b) Input MR image, synthesized CT image, and real CT image [7]. (c) Randomly generated skin lesions from random noise (a mixture of malignant and benign) [8]. (d) Ki67 synthetic image by using the segmentation/annotation [9]. (e) Generated retinal fundus images and vessel maps [10].
translation, which synthesizes a novel image by combining the content of one image with the style of another. e other involves locating and marking lesions on the images. To perform these two tasks, we conduct two experiments using two different generative networks. In the first experiment in this paper, we use a ResNet [15] model, which has significant advantages in style transfer. In the second, we use a U-Net [16] model, one of the most commonly used segmentation techniques. We compare the performance of each model and propose a more robust model for synthesizing brain tumor-segmented MR images, which could lead to high-quality multimedia data augmentation in the medical imaging field.

Related Work
In a study on brain CT image synthesis from MR images by Wolterink et al. [7], training using unpaired images was even better than using paired images [13]. at is, the performance of CycleGAN is greater than that of Pix2pix [17] in cross-modality synthesis of medical images. In addition, as shown in Table 1, the use of CycleGAN in the recent publications for medical image synthesis is increasing. In this section, we discuss CycleGAN, ResNet, and U-Net used in our experiments.

2.1.
CycleGAN. e problem that Pix2pix or CycleGAN has to solve is the interpretation of images from one domain to another. In Pix2pix, there must be data pairs corresponding to both domains, whereas CycleGAN can solve this problem without these data pairs [14]. CycleGAN is a GAN using two generators and two discriminators. We call one generator G, and it converts images from the X domain to the Y. e other generator is called F, and it converts images from the Y domain to the X. Each generator has a corresponding discriminator that attempts to tell apart its synthesized images from real ones. CycleGAN is not just a simple mapping technique. It considers the returning mapping and puts a constraint on coming back to its original state. As shown in Figure 4, not only mapping from X to Y but also mapping back to original X from Y should be defined, and this applies to the opposite mapping as well. e reason for doing this is X and Y are unpaired domains. When X goes to Y, it is checked that it looks like Y, and the actual constraint is to keep the original shape when it returns to X again. at is, the shape of X does not change much, only its style is changed to that of Y, and therefore it looks as if only the style has been transferred.
ere are two components in the CycleGAN objective function, an adversarial loss and a cycle-consistency loss. Both are essential for successful results. e adversarial loss alone is not sufficient to produce high-quality images, which leaves the model underconstrained [14]. In other words, it forces the generated output to be of the appropriate domain but does not force the input and output to be recognizably the same. e cycle-consistency loss addresses this underconstrained problem. e full objective function by putting these loss terms together and by weighting the cycle-consistency loss with a hyper parameter λ is defined as equation (1). Pix2pix can rely heavily on L1 loss; therefore, adversarial loss may play a supplementary role and it may be better to subtract it. In contrast, the CycleGAN does not learn at all except for the adversarial loss.   In the third column, * following the method denotes some modifications either on the architecture of the network or on the employed losses.

ResNet.
e general CNN network in Figure 5(a) receives the input x and yields the output H (x) through the two weighted layers. is output is the input to the next layer. Figure 5(b) shows the architecture of the ResNet, and it uses a shortcut connection that connects the input of the layer directly to the output [15]. It is simple network, but its performance is significantly high. As can be seen in Figure 5 can be seen as learning residual, and therefore it is called ResNet. It adds the results from the weight layer and the previous results and uses ReLU. ResNet learns in the direction that F (x) becomes zero. In addition, since x is directly connected to shortcut connection, there is no increase in computation, and it is possible to select which layer to include. For example, a fully connected layer as well as a convolution layer can be added [15].

U-Net.
Segmentation is the process of partitioning an image into different meaningful segments [23]. In medical imaging, these segments often correspond to different tissue classes, organs, pathologies, or other biologically relevant structures [24]. In the past, there were few medical images, and therefore experts could segment images directly. However, more needs for the automation of the segmentation have arisen as the volume of the medical images has increased exponentially. Analysing medical images can often be difficult and time consuming, and therefore deep neural networks can help doctors make more rapid and more accurate diagnoses [25].
U-Net is one of the most preferred models when segmenting images. Figure 6(a) shows the general encoding and decoding process, and Figure 6(b) shows a U-Net model with a skip connection added to the encoder-decoder structure. If the image size is reduced (down sampling) and then reraised (up sampling), sophisticated pixel information disappears. is is a big problem for image segmentation which requires dense prediction on a pixel-by-pixel basis. e skip connection, which passes important information directly from the encoder to the decoder, results in a much clearer image at the decoder section, allowing for more accurate prediction.

Materials.
e first CycleGAN used the U-Net model. e advantage of skip connection is that it has much more detail, but the disadvantage is that the performance is not good when the two contents are similar. On the other hand, the last CycleGAN used ResNet model, which is good for image quality but has the disadvantage of using a great deal of memory.
In this paper, we synthesize high-quality brain tumorsegmented MR images. It consists of two tasks, namely, synthesis and segmentation, and therefore we conduct experiments from two perspectives. One is to perform imageto-image translation, which synthesizes a novel image with the style of another image. e other is to locate and mark tumors in brain MR image. erefore, we perform two experiments with two different generative networks. In the first experiment, we use a ResNet model, which has significant advantages in style transfer. In the second experiment, we use a U-Net model, which is one of the most commonly used segmentation techniques. In this paper, we compare the performance of each model and propose a more powerful model for synthesizing brain tumor-segmented MR images. Figure 7 shows the datasets of source and target domains used in our experiments. Figure 7(a) represents the brain lesion images in the source domain, and Figure 7

Architecture of Our Discriminative Model.
e configuration of the discriminative model in our experiments is shown in Figure 8. It consists of four convolution layers, and we use leaky ReLU as an activation function for each layer. In the first step, we extract the features from the image, and in the last, we decide which specific category these features belong to. For that, we add a final convolution layer that produces a one-dimensional output. Both ResNet and U-Net generative models use this model as a discriminator in our experiments.

Architecture of Our Generative Model Using ResNet.
e generator has the job of taking an input image and performing the transformation to produce the target image. e architecture of our generative model using ResNet can be viewed in Figure 9. First, the encoding process consists of three convolution layers, and ReLU is used as an activation function for each layer. In the transformation process, nine residual blocks are constructed, and each block consists of convolution layer-ReLU-convolution layer. e decoding process consists of two deconvolution layers, and each uses ReLU as an activation function. In the last decoding step, we add a final convolution layer.

Architecture of Our Generative Model Using U-Net.
As shown in Figure 10, the encoder-decoder structure of our generative model using U-Net is as follows. First, the encoding process consists of eight convolution layers, and leaky ReLU is used as an activation function for each layer.  Mathematical Problems in Engineering e decoding process consists of eight deconvolution layers. ReLU is used as an activation function for each layer, and 50% dropout is performed in the first to third decoding processes. We also use concat function in the decoding process to perform a skip connection that passes important information directly from the encoder to the decoder.

Methods for Synthesizing Brain Tumor-Segmented MR Images.
e models in our work includes forward and backward cycles, just like the CycleGAN model proposed by Zhu et al. [14]. With these cycles, the novel synthesized image can only obtain the style of the target image while retaining the shape of the original image.
As shown in Figure 11, the architecture of our forward and backward cycles is composed of two generators (Gen A⟶B and Gen B⟶A ) and two discriminators (Dis B and Dis A ).
Forward process is as follows: A ⟶ Gen A⟶B (A) ⟶ Gen B⟶A (Gen A⟶B (A)) ≈ A. More specifically, our forward process can be explained in three steps. First, generator Gen A⟶B is trained to translate an input brain tumor domain (A) into a segmentation mask domain (B). Second, Dis B is trained to discriminate the generated image B(Gen A⟶B (A) ≈ B) from the real image B.
ird, Gen B⟶A is trained to translate the generated image B into the brain tumor MR image A(Gen B⟶A (B) ≈ A). Likewise, backward process is as follows: B ⟶ Gen B⟶A (B) ⟶ Gen A⟶B (Gen B⟶A (B)) ≈ B. More specifically, our backward process can be also defined in three steps. First, Gen B⟶A is trained to translate an input segmentation mask domain (B) into a brain tumor domain (A). Second, Dis A is trained to discriminate the generated image A(Gen B⟶A (B) ≈ A) from the real image A. ird, Gen A⟶B is trained to translate the generated image A into the segmentation mask MR image B(Gen A⟶B (A ) ≈ B). e goal of the discriminator is to distinguish the novel image generated by the two generators from the real one, and therefore the discriminative neural network is trained to minimize the final classification error. On the other hand, the goal of the generator is to fool the discriminator, and therefore the generative neural network is trained to maximize the final classification error. Both networks attempt to beat each other, and this competition between them makes them evolve with respect to their respective goals. Hence, the adversarial loss function that discriminator Dis B aims to  Figure 9: e architecture of our generative model using ResNet. It consists of encoding, transformation, and decoding processes. 6 Mathematical Problems in Engineering minimize and generator Gen A⟶B aims to maximize is defined as (2) Next, the adversarial loss function that discriminator Dis A aims to minimize and generator Gen B⟶A aims to maximize is defined as Adversarial loss alone cannot guarantee that the learned function can map an input domain A to a target domain B. To regularize the model and to transform source distribution into target and then back again, we should introduce the constraint of cycle-consistency into the model. An additional loss term, cycle-consistency loss, is defined as Our full objective function by putting these loss terms together and by weighting the cycle-consistency loss with a hyper parameter λ is defined as where we set λ to 10, which is the optimal value as introduced in CycleGAN.

Evaluation.
We basically use metrics like mean-squared error and mean absolute error to evaluate the performance of the ResNet model as well as that of the U-Net model. We also evaluate the performance of the discriminator Dis B in each model using the following metrics: which is the discriminator's loss for the real image B: which is the discriminator's loss for the fake image B synthesized by a generator Gen A⟶B , and which is the sum of the discriminator's losses for both Dis realB and Dis fakeB . We also evaluate the performance of the generator Gen A⟶B in each model using the following metrics: which is the forward and backward cycle-consistency loss, where the hyper parameter λ is set to 10, and which is the generator's loss for the fake image B synthesized by the generator Gen A⟶B , and vice versa for Dis A and Gen B⟶A . In addition to the above-metioned evaluation metrics, we use human perception to judge the visual quality of samples. We evaluate the quality of the generated images and whether they can segment the brain tumors well.

Results and Discussion
Our training process is as follows. We used 20 epochs, 100 steps, and Adam as an optimizer with initial learning rate of 0.0002 and Adam's momentum term of 0.5. e training took 28.6 hours for the ResNet model and 6.9 hours for the U-Net model. at is, the training of the ResNet took four times as long as that of the U-Net.  Figure 10: e architecture of our generative model using U-Net. We use concat in the decoding process to perform a skip connection that passes important information directly from the encoder to the decoder.    Table 2 shows all losses of the discriminator and the generator in each training process of the ResNet and the U-Net. Full loss means the sum of all losses not only from the brain tumor domain to the segmentation mask domain but also from the segmentation mask domain to the brain tumor domain. As shown in Table 2, the full loss of the discriminator is 0.430 for the ResNet and 4.890e − 3 for the U-Net. It indicates that the performance of the discriminator is significantly higher when using U-Net than using ResNet. In contrast, the full loss of the generator is 1.338 for the ResNet and 2.572 for the U-Net. at is, the performance of the generator is slightly higher when using ResNet than using U-Net.
Additionally, Figure 12 (A ⟶ B) and Figure 13  (B ⟶ A) show the novel images generated by the ResNet and the U-Net model for the same samples. As shown in Figure 12 (A ⟶ B), U-Net, which is a more robust model for segmentation, located and marked tumors in the brain MR image and produced brain tumor-segmented MR image synthesis similar to ground truth. ResNet, in contrast, did not segment the exact location of the brain lesion. Figure 13 (B ⟶ A) shows the novel brain images generated from the segmentation mask domain. Both networks produced high-quality images, whereas the synthesisquality of ResNet was higher than that of U-Net (see Figures 14 and 15 for more details about the novel images generated by the ResNet and the U-Net model for the same samples).

Conclusions
In our work, we augmented brain tumor-segmented MR images, which consists of two tasks: synthesis and segmentation.
erefore, we conducted two experiments, one to perform image-to-image translation, namely, image style transfer, and the other to locate and mark tumors in brain MR image, that is to say image segmentation. We performed experiments with two different generative networks, the first one using the ResNet model, which has great advantages in style transfer, and the second one, the U-Net model, one of the highly robust models for segmentation. e performance comparison between the ResNet and the U-Net generative model is as follows. When the generator used ResNet, its training loss was slightly less than that of U-Net, and moreover it produced better-quality images than U-Net. However, it was a memory-intensive network and took much longer to train, and it did not segment the brain tumors better than U-Net. On the other hand, when the generator was with U-Net, the discriminator performed the better discrimination of whether the generated image was real or fake. Additionally, for the same samples, U-Net segmented the brain tumors more accurately than did ResNet, i.e., the segmented images generated from the brain tumor domain marked the exact location of the brain lesions.
e generative networks proposed in our paper will enable the synthesis of not only brain tumor-segmented images, but also medical images in Figure 2, as well as the novel images of segmenting tumors from the breast, uterus, and other organs, depending on the intended application. In future work, we will apply a network that combines the advantages of two networks. If we merge two models, it will be possible to generate a high-quality synthetic image with accurate segmentation. We will also increase the number of epochs and adjust hyper parameters such as initial learning rate. High-quality multimedia data augmentation using GANs has a direct impact on radiology workflow and patient care improvement. Although promising results have been reported, the adoption of GANs in medical imaging is still in its infancy and there are no clinically adopted breakthrough applications yet. erefore, more studies and more diverse attempts are needed.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.