Cross-Model Transformer Method for Medical Image Synthesis

. Acquiring complementary information about tissue morphology from multimodal medical images is beneﬁcial to clinical disease diagnosis, but it cannot be widely used due to the cost of scans. In such cases, medical image synthesis has become a popular area. Recently, generative adversarial network (GAN) models are applied to many medical image synthesis tasks and show prior performance, since they enable to capture structural details clearly. However, GAN still builds the main framework based on convolutional neural network (CNN) that exhibits a strong locality bias and spatial invariance through the use of shared weights across all positions. Therefore, the long-range dependencies have been destroyed in this processing. To address this issue, we introduce a double-scale deep learning method for cross-modal medical image synthesis. More speciﬁcally, the proposed method captures locality feature via local discriminator based on CNN and utilizes long-range dependencies to learning global feature through global discriminator based on transformer architecture. To evaluate the eﬀectiveness of double-scale GAN, we conduct folds of experiments on the standard benchmark IXI dataset and experimental results demonstrate the eﬀectiveness of our method.


Introduction
Magnetic resonance imaging (MRI) is a versatile and noninvasive imaging technique widely used in clinical applications. Tailored MRI pulse sequences enable to capture specific characteristics of the underlying anatomical information. For instance, T1-weighted brain images clearly depict the gray matter and white matter tissue, while T2-weighted images depict the fluid in the cortical tissue. Hence, acquiring complementary information about tissue morphology from multimodal images enables to improve accuracy and confidence in clinical diagnosis [1]. Unfortunately, acquiring multimodal MR imaging is often challenging due to numerous factors, such as uncooperative patients, limited availability of scanning time, and the expensive cost of prolonged exams [2,3]. To address this issue, cross-modal medical image synthesis has been widely used, as it enables to synthesis unattained images in multimodal protocols from the subset of available images [4][5][6][7].
Currently, deep learning-based synthesis demonstrates more promising performance, which is compared with the traditional registration-based method [8,9] and intensitytransformation-based methods [10,11]. For the image synthesis task, convolutional neural network (CNN) architectures produce significant performance through minimizing pixelwise losses between synthetic and real images. However, pixelwise losses ignore high-level features in the training step. Since generative adversarial networks (GAN) were introduced by Goodfellow et al. [12], this problem was gradually solved by adversarial loss functions, which designed a training strategy between generator and discriminator networks based on the game theory. In this case, GAN enables to capture high-frequency texture information of medical images. erefore, GAN-based methods surpass many synthesis tasks based on traditional architectures [13,14]. To be specific, the generator and discriminator networks of GAN deploy compact convolution filters, whereas CNNs are plugged with spatial locality on the entire images by the sliding window. is makes the long-range dependencies between distant regions lost [15].
Moreover, CNNs not only exhibit a strong locality bias but also a bias towards spatial invariance through the use of shared weights across all positions [16]. is prevents the networks from fully understanding the local region of the input image. To guide networks towards critical image regions, Zhao et al. [17] proposed the attention mechanisms that strengthen the features of important regions by learning the weight map and multiplying it on the feature map. However, conventional attention mechanisms still do not explicitly model long-range dependencies. Recently, transformer architectures have been applied to language tasks and are increasingly adopted in other areas such as segmentation tasks [18] and classification tasks [19]. In contrast to the predominant vision architecture, the emergent transformer architectures are integrated to learn complex relationships among its inputs, since it contains no built-in inductive prior on the locality of interactions such as sliding window. Hence, we consolidate transformer into our model due to capture more global information and make a comprehensive understanding of the input [16].
In this paper, we propose a double-scale deep learning method for cross-modal medical image synthesis. Motivated by the fact that low-level image structure and high-level feature is equally important to cross-modal medical image synthesis we integrate the ability of transformer to efficiently seek long-range interactions inside our model, which enables to capture global feature as complementary information for CNNs. To achieve this, we carefully design double-scale discriminator GAN which specifically consists of the transformer-based global discriminator and CNNbased local discriminator. e main contributions of this paper are listed as follows. (1) We introduce a double-scale discriminator GAN for medical image synthesis. (2) e global discriminator of our model is designed on vision transformer that utilizes longrange dependencies between distant patches and captures global features.

Medical Image Synthesis.
Recently, GAN-based models have been successfully applied to kinds of tasks including data augmentation [20][21][22] and image synthesis tasks [23][24][25]. For example, Nie et al. [5] utilized MR images to synthesize computed tomography (CT) images with a context-aware GAN model; Wolterink et al. [7] utilized GAN to generate low-dose CTfrom routine-dose CT images. Nevertheless, as the traditional GAN has failed to meet the gradually higher application requirements, pix2pix [26] has recently begun to attract the attention of researchers, which utilizes paired data to enhance the pixel-to-pixel similarity between the real and the synthesized images, and then, Olut et al. [27] developed a CycleGAN-based method to synthesis MRA from T1-MRI and T2-MRI. ese methods are unable to capture the features of critical image regions. erefore, Zhao et al. [17] used a self-attention in the generator of GAN to enhance the feature of tumour and improve the performance of tumour detection. Isola et al. [26] used a patchbased discriminator to refine the extraction of features. However, these methods cannot solve the problem that the strong prior position information introduced by the sliding window in the convolution operation, which destroys the modelling of the distant dependence relationship, so that all the local information cannot be better captured.

e Transformer Architecture.
e transformer architecture is designed to handle complicated interactions between inputs regardless of their relative position to one another through modelling interactions between its inputs solely through attention mechanism. Transformer is originally applied to language tasks, Floridi and Chiriatti [28] introduced GPT to use language modelling as its pretraining task. Recently, this method also can be used in computer vision. Esser et al. [16] proposed a VQGAN which represents images as a composition of perceptually rich image constituents and thereby overcomes the infeasible quadratic complexity when modelling images directly in pixel space. However, the codebook of VQGAN requires numerous datasets to fit, which is impractical in the medical image field. Meanwhile, the increased expressivity of transformers comes with quadratically increasing computational costs, because all pairwise interactions are taken into account. Finally, our method is based on a vision transformer which crops interactions between inputs based on nonoverlapping patch-level.

Overview of Our Method.
e overview of double-scale GAN is illustrated in Figure 1. Our method is comprised of three main components: generator network, global discriminator network, and local discriminator network. In the remainder of this section, we explain the detailed composition of each network component and the loss functions.

Generator Network.
e first component of our method is a deep encoder network that contains a series of convolutional layers to capture a hierarchy of localized features of source images. To learn a meaningful and effective highlevel representation, we adopt an autoencoder structure as our main framework. In order to reduce the use of upsampling layer, deconvolution operation is used instead. e detail of generator is illustrated in Figure 1. In the downsampling process, our method uses two convolutional layers of kernel size with 3 and stride with 2. In the upsampling process, our method uses two deconvolutional layers of kernel size with 3 and stride with 2. Besides, we also introduce instance normalization after each convolutional layer. After the instance normalization, the activation function ReLU is used in the encoder and decoder. For spatial and depth feature extraction, our method also adds 9 ResNet blocks between downsampling and upsampling.

Local Discriminator Network.
e local discriminator is based on a condition PatchGAN architecture [26]. It receives as input the concatenation of the source and target contrast images [29] and then obtains 30 * 30 overlapped patches of 70 * 70 size through sliding window for prediction to real or fake. Although this patch-based discriminant is more robust than the image-based discriminant in the extraction of local detail features, the overlapping patches it extracts destroy the long-range dependencies by introducing a strong prior position relationship, so as to have a comprehensive understanding of the input images.

Global Discriminator Network.
In order to synthesize high-quality medical images, global and local features are equally important. Inspired by the DeblurGAN-v2 [30], we use a pure transformer method to replace convolutional network to capture long-range dependencies for a comprehensive understanding of the input image. e details of global discriminator network are depicted in Figure 2. e input image is first split into 32 * 32 nonoverlapping patches, in which kernel size is equal to stride: where P i denotes the i-th patch of the input image; we set N � 8 2 to divide the input into 64 patches. en, all patches are flattened to D dimension by a trainable linear projection. Similar to the class token in BERT [31], we also prepend a learnable embedding to the sequence of embedded patches. Position embeddings are added to the patch embeddings to retain positional information. Our method uses standard learnable 1D position embeddings because many studies have shown that using more advanced 2D-aware position embeddings not works [32], which can be therefore formulated as follows: where Z 0 denotes the input of transformer encoder; E denotes embedding projection which maps patch image to vector; and E pos denotes the learnable positional embedding that carries information about patch location. e transformer encoder consists of two parts: multihead self-attention (MSA) and multilayer perceptrons (MLP). MSA enables to learn different levels of features benefit from multihead attention. In addition, layer norm (LN) is applied before every block, and residual connections after every block. At the end of these blocks, the output is taken by the classification head to output the real/fake prediction. e output of the l-th layer in the transformer encoder can be formulated as where Z l−1 represents the feature extracted from the previous layer.

Loss Function.
e first component of the loss function in our method is a pixelwise loss as inspired by the pix2pix architecture [26]: where x denotes the source image and y denotes the target image.  Unlike loss functions based on pixelwise differences, perceptual loss relies on differences in higher feature representations that are often extracted from networks pretrained for more generic tasks [33]. A commonly used network is VGGNet which trained on the ImageNet [34] dataset for object classification. Here, following [33], we extracted feature maps right before the second max-pooling operation of VGG16 pretrained on ImageNet: where V(·) denotes pretrained VGG16. e local discriminator is based on the conditional discriminator; its loss function can be formulated as where z denotes the synthesis image from generator. e global discriminator uses hinge loss to optimize the generator; hinge loss can be formulated as By aggregating all the above losses, we can formulate our aggregate loss function as where λ L 1 denotes the weighing of the pixelwise loss; λ per denotes the weighing of the perceptual loss; λ Local denotes the weighing of the adversarial loss of local discriminator; and λ Global denotes the weighing of the adversarial loss of global discriminator.

Experiments
In this section, we will first describe the information about the dataset used in our method and then introduce the implementation details of experiments. We present experimental results that compare with several state-of-the-art methods.

Dataset.
e dataset used in the evaluation is provided by the IXI dataset. e experimental dataset we used totals 40 subjects, and each subject has corresponding T1-MRI and T2-MRI, where 30 subjects were used for training and 10 were used for testing. Acquisition parameters were as follows: T1-weighted images: TE � 4.603 ms, TR � 9.813 ms, and spatial resolution � 0.94 × 0.94 × 1.2 mm 3 . T2-weighted images: TE � 100 ms, TR � 8178.34 ms, and spatial resolution � 0.94 × 0.94 × 1.2 mm 3 . Since multicontrast images were unregistered, we use FSL [35] to register T1-MRI and T2-MRI. Finally, we use zero-padding to fill all images in axial cross-sections used in experiments to a consistent size of 256 * 256.

Implementation Details.
Our method is implemented in PyTorch. All methods were trained and tested on 1 NVIDIA Tesla V100 with 32 GB of memory for each GPU. In the stage of training of our method, we set the epoch as 100, learning rate as 0.0002, and batch size as 1 which causes the training time to increase to 5 hours. Model training was performed via the Adam optimizer with β1 � 0.5 and β2 � 0.999. In global discriminator, we use multihead attention with 4 heads and set D as 64. In each multihead attention, we performed GeLu activation and set dropout as 0.1. Limited by the small size of the medical image dataset, we utilize pretrained model in global discriminator for object classification tasks on the ImageNet database. All weights were initialized using normal distribution with 0 mean and 0.02 std. We set the hyperparameter in the aggregate loss function as λ L 1 � 1, λ per � 1, λ Local � 0.8, and λ Global � 0.3. For the fairness of the experiment, we designed 4-fold cross-validation by randomly sampling nonoverlapping training, validation, and testing sets in each fold.

Comparison Methods.
To validate the effectiveness of the proposed synthesis method, we compare it with three stateof-the-art cross-modality synthesis methods: (1) pix2pix [26]: this method is based on a convolutional GAN model and UNet backbone, which synthesizes the whole image by focusing on the pixelwise similarity.
(2) CycleGAN [27]: this method consists of two generators and two discriminators, which uses a cycle consistency loss to enable to train with unpaired data. In our comparison, we use the paired data to training this method and our method.   (3) PGAN [29]: this method is based on conditional GAN; its generator consists of a encoder, a decoder, and 9 ResNet blocks. Meanwhile, this method has shown superior performance in many cross-modal image synthesis tasks.

Results and Analysis.
We employ two measurements to evaluate the synthesis performance of the proposed methods and our method in comparison: structural similarity index measurement (SSIM) and peak-signal-to-noise ratio (PSNR). e data in all tables are represented by the mean and standard deviation. Further details can be found in Tables 1 and 2.
To demonstrate the effectiveness of our double-scale discriminator method with regard to subjective quality, a demonstrated example is shown in Figure 3.

Conclusion
In this paper, we have proposed a double-scale discriminator GAN for cross-modal medical image synthesis. By compositing both CNN and transformer to design double-scale discriminator, our method has explicitly exploited the localization power of CNNs and the sensitivity of vision transformers to global context meanwhile. Experimental results have demonstrated the effectiveness of the proposed method. In the future, we will focus on the medical image generation method which integrated multiview and multimodal information through transformer, which solves the problem that 2D medical image generation cannot exploit 3D information and 3D medical image generation needs high computing power.
Data Availability e data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.