CMMCSegNet: Cross-Modality Multicascade Indirect LGE Segmentation on Multimodal Cardiac MR

Since Late-Gadolinium Enhancement (LGE) of cardiac magnetic resonance (CMR) visualizes myocardial infarction, and the balanced-Steady State Free Precession (bSSFP) cine sequence can capture cardiac motions and present clear boundaries; multimodal CMR segmentation has played an important role in the assessment of myocardial viability and clinical diagnosis, while automatic and accurate CMR segmentation still remains challenging due to a very small amount of labeled LGE data and the relatively low contrasts of LGE. The main purpose of our work is to learn the real/fake bSSFP modality with ground truths to indirectly segment the LGE modality of cardiac MR by using a proposed cross-modality multicascade framework: cross-modality translation network and automatic segmentation network, respectively. In the segmentation stage, a novel multicascade pix2pix network is designed to segment the fake bSSFP sequence obtained from a cross-modality translation network. Moreover, we propose perceptual loss measuring features between ground truth and prediction, which are extracted from the pretrained vgg network in the segmentation stage. We evaluate the performance of the proposed method on the multimodal CMR dataset and verify its superiority over other state-of-the-art approaches under different network structures and different types of adversarial losses in terms of dice accuracy in testing. Therefore, the proposed network is promising for Indirect Cardiac LGE Segmentation in clinical applications.


Introduction
Multimodal CMR imaging is an essential tool in clinics for the screening and diagnosis of cardiac diseases. Different imaging modalities contain different sorts of useful information for cardiac disease screening task; the combination of different imaging modalities can overcome the limitations of an individual modality. The contrast agent for the LGE MR imaging is injected for 10-20 minutes; LGE images with distinctive locally brightness compared with the healthy tissues can enhance myocardial necrosis or scarring, which is a standard practice to evaluate cardiac structure, cardiac function, myocardial perfusion, and myocardial activity. Different from LGE images, the bSSFP can highlight the high signal area of the fluid but appear a uniform signal for other tissues; e.g., the large blood vessels and coronary arteries can be observed clearly in bSSFP because of more obvious contrast in the heart muscle and blood pool. T2-weighted MRI is effective in reducing false-positive results. Considering different MRI modalities is thus important for the acquisition of accurate cardiac information [1].
Segmentation of multimodal CMR images is a critical step in the process for the following diagnosis and surgical planning. However, it takes 20 minutes/case for an experienced doctor to manually segment the LGE images, it is extremely time-consuming to manually identify and delineate the corresponding structure in cardiac, and the result depends on the professional ability of doctors and varies from person to person. Therefore, the development of automatic and reliable LGE image segmentation algorithms is of high clinical values for patients suffering from myocardial infarction.
Tao and Der Geest proposed a method for segmenting the LGE images using myocardial morphological information [2]. Popescu et al. used a mask SLIC clustering method and Otsu threshold to segment LGE images [3]. In recent years, deep learning has achieved remarkable success in computer vision. More and more image processing methods are based on the CNN model [4,5]. Chen et al. [6] proposed to use the domain adaption to fuse the features of unlabeled LGE images and then use the fused features to train the segmentation network. In addition, many approaches based on attention mechanisms [7,8] and multiview methods [9] have been developed recently for segmenting medical images. Yang et al. combined multiview and attention mechanism to segment cardiac LGE images [10]. An automatic cardiac LGE segmentation algorithm based on the CNN is far more efficient and robust, and commonly more accurate than traditional methods [11,12], so it is necessary to automatically segment the LGE images.
However, automatic LGE CMR segmentation is still arduous. Besides the great variations of the location and geometry of the heart region across different patients, Zhuang [1] pointed three major challenges related to the intensity distributions of the LGE CMR modality: (i) the intensity range of myocardium in LGE imaging leads to indistinguishable boundaries from its adjacent organs; (ii) the pathologies result in heterogeneous intensity of the myocardium, making the assumption of a simple distribution such as the single component Gaussian density invalid; and (iii) the preprocessing enhancement for the LGE CMR modality can be complex. So it is more difficult to segment directly LGE modality, especially in case of a small amount of labeled LGE data.
GAN was first proposed by Goodfellow et al. [13] for image synthesis, which uses a generator network and discriminator network, to pit one against the other (thus the "adversarial") in order to generate fake synthetic instance that can pass for real data. Here, the generator generates a fake image by random noise, the discriminator judges whether the input data is true (data comes from real labels) or false (the data comes from the output of the generator). The aim of GANs is to learn the underlying distribution of training data in order to generate data that the discriminator cannot distinguish. At the same time, the game between the generator and the discriminator reaches the Nash equilibrium, i.e., the generated data distribution p g is equal to real data distribution p d . With the development of GANs [14], such models are widely used in image processing, including image and video generation [15], image segmentation [16], image synthesis [17], and image super resolution [18].
In this work, we propose a novel cross-modality multicascade framework for indirect LGE segmentation (CMMCSegNet), which is trained on multimodal cardiac MR data with a very small amount of LGE labels (for the LGE modality in Multisequence Cardiac MR Segmentation Challenge 2019 datasets [1], only five patients are labeled). The main contributions of this work are clarified as follows: (1) We develop a novel indirect LGE segmentation framework based on multimodal images; one of the primary components is to translate the LGE modality that needs to be segmented but only has very small amount of labeled data, into the bSSFP modality that is easy to be segmented by our proposed method (2) We propose a multicascade pix2pix network for image segmentation; that is, the generator is formed by cascading multiple subnetworks. In the segmentation network, we regard segmentation as the translation process from the original image to the segmentation target (3) We employ the perceptual loss that uses a pretrained VGG19 network to compare the feature differences between the labels and generation during the proposed multicascade pix2pix network training The rest of this work is organized as follows. We first give some preliminaries in Section 2. We describe our CMMCSeg-Net in details in Section 3. We give experimental results in Section 4. Finally, we conclude this work in Section 5.

Related Works
Tissue or organ segmentation plays an important role in the field of medical image processing. Medical image segmentation has been explored extensively; however, challenges in generality, robustness, and efficiency still remain. For brevity, we only focus below on the most closely related works.

Cascade Structure.
A cascading network is to connect multiple subnetworks together to form a multilevel network. The cascading method has been effectively used in many vision applications like classification [19], image translation [20], detection [21], super resolution [22], and semantic segmentation [23]. For example, Cui et al. proposed a deep cascade network for image super resolution [22]. Cai and Vasconcelos proposed the use of cascade structure for object detection [21]. Zhao et al. proposed the recursive cascaded networks for medical image registration [24]. Armanious et al. proposed the use of cascaded generator network for image translation [20]. Havaei et al. proposed a new cascade architecture for brain tumor segmentation [23]. Li et al. [25] proposed to classify easy regions in a shallow network and train deeper networks to deal with hard regions. Lin et al. [26] proposed a top-down architecture with lateral connections to propagate deep semantic features to shallow layers.
Different from previous cascade networks, the multicascade pix2pix network proposed in this paper is a multiple U-net cascade structure for image segmentation, which allows an innovative way to supervise each generator individually for pix2pix GANs. To our best knowledge, this is an early and original attempt to adopt a cascade architecture in pix2pix GAN-based medical image segmentation. We will introduce more details in Section 3.

Multimodal
Cardiac MR Image Segmentation. Recent literature suggests two main approaches to complete multimodal CMR image segmentation. One popular approach is about the GAN strategy based on cross-modality image translation that refers to the translation of images with modality X into images with modality Y, which plays an increasingly important role in computer vision. Isola et al. [18] proposed the use of conditional GAN to implement a paired image-toimage translation. Ben-Cohen et al. used CT images to 2 Computational and Mathematical Methods in Medicine synthesize PET images based on the pix2pix network [27]. Cycle-GAN [28] was proposed for unpaired image-to-image translation. BiCycle-GAN [29] solved the translation process from single image to multicategory image. In addition, some GAN networks including DualGAN [30] and UNIT [31] were also proposed for unpaired image-to-image translation. In CMR datasets [1], MR images of the different modalities are not strictly matched, so the classical unpaired imageto-image translation [32] can be applied to cross-modality CMR segmentation. Chen et al. [33] proposed to use UNIT to translate bSSFP images into LGE images and then train the segmentation network where the LGE images are provided by the translation architecture. Campello et al. also proposed to use Cycle-GAN to translate bSSFP images into LGE images but train the U-net network [34] for LGE segmentation. Tao et al. [35] proposed to integrate the translation network (Cycle-GAN) with the segmentation network to achieve LGE image segmentation.
Another promising approach is about the strategy on image registration. Roth et al. proposed to register LGE images with ground truths into LGE images without ground truths, after multiatlas label fusion by majority voting; they obtained a noisy LGE label and then trained a LGE segmentation network [36]. Liu et al. proposed a registration method for histogram matching to achieve augmentation of the LGE images [37].

Proposed Cross-Modality SegNet
The goal of this work is to achieve cardiac segmentation for LGE modality where a small amount of samples are labeled. Our CMMCSegNet (https://github.com/wangyu719/ CmmcSegNet) framework is designed to facilitate indirect segmentation for the multimodal CMR images. The total framework is shown in Figure 1, including a training architecture and a testing architecture.
Our datasets are from Multisequence Cardiac MR Segmentation Challenge 2019 datasets (MS-CMRSeg 2019) [1]. In this work, we use LGE modality with 45 patients and bSSFP modality with 35 annotated patients (see Figure 2 for more details). Only five ground truth annotations are available in LGE modality of MS-CMRSeg 2019 datasets; hence, it is difficult to directly segment LGE modality using deep CNN-based methods. Figure 2 shows the differences between LGE and bSSFP images from the same patient. Furthermore, it is found that the bSSFP modality has a more obvious contrast than LGE modality, so we believe that the bSSFP is easier to be segmented. Besides, the bSSFP modality has a large number of images (35 patients) with ground truth annotations, so it is not difficult to train the bSSFP modality using the deep learning-based method.
3.1. Cross-Modality Image Translation. One of the primary components in training architecture is a cross-modality translation network, which can be trained end-to-end with unpaired modalities. Before segmenting the bSSFP images to achieve indirect segmentation of LGE images, we first present a Cycle-GAN architecture of translating LGE into bSSFP images.
Inspired by the knowledge distillation between unpaired image-to-image translation networks [32], we employ Cycle-GAN to achieve cross-modality image translation for CMR datasets. Let X, Y be two image domains that represent the LGE and bSSFP modalities, respectively. G t A : X ⟶ Y and G t B : Y ⟶ X are two generators of the crossmodality translation network such that G t A and G t B are inverse mappings of each other; that is, are the discriminators of the cross-modality translation network, to distinguish that the input of discriminator is real or fake.
The Cycle-GAN architecture implementing crossmodality image translation for unpaired LGE/bSSFP datasets consists of two cycles: LGE cycle and bSSFP cycle. In the LGE cycle, the first generator ðG t A Þ is trained to transform LGE modality into fake bSSFP modality, the second generator ð G t B Þ is trained to transform the generated fake bSSFP modality back to the original LGE modality, and the discriminator D t A discriminates between real and synthesized bSSFP modalities. In fact, enlightened by the activation-based attention transfer strategies, the discriminator D t A is designed to extract the supervision information that modulates the learning of the generator G t A . In the bSSFP cycle, real bSSFP was transformed to fake LGE by using the generator G t B , the generator G t A transforms the generated LGE to the original bSSFP, and the discriminator D t B discriminates between real and fake bSSFP modality. Finally, the network framework is shown in Figure 1 The overall training loss of our translation network is defined as where L gan ðG t A , D t A , X, YÞ and L gan ðG t B , D t B , X, YÞ are two adversarial losses defined by and the generation similarity L cyc ðG t A , G t B Þ is defined by 3 Computational and Mathematical Methods in Medicine and λ 1 is the weight parameter for balancing the contributions of the generation loss L cyc ðG t A , G t B Þ and the two adversarial losses L gan ðG t A , D t A , X, YÞ and L gan ðG t B , D t B , X, YÞ.

Multicascade pix2pix Segmentation.
Recently, the GANbased framework is proposed to segment the retinal vessel [38]. We understand image segmentation as the translation from paired image to image (from an original image to a predicted segmentation results); hence, we propose a new image segmentation method using a multicascade technique and pix2pix structure, which we call a multicascade pix2pix network.

Multicascade Network.
Our multicascade pix2pix segmentation network shown in Figure 1(c) is based on the GAN architecture, which consists of multiple generators G s k (k = 1, ⋯, n) and a shared discriminator D s .     LGE can enhance myocardial necrosis or scarring, which can evaluate effectively cardiac structure, cardiac function, myocardial perfusion, and myocardial activity, while bSSFP can highlight clearly the large blood vessels and coronary arteries because of more obvious contrast in the heart muscle and blood pool. To better adapt to cardiac structure segmentation, we will build a cross-modality translation network based on Cycle-GAN (Figure 1(b)).

Computational and Mathematical Methods in Medicine
The generator G s 1 : Y ⟶ S translates I Y to I 1 S , where the original input I Y ∈ Y is 1 × 256 × 256 real or fake bSSFP image, and the first generation I 1 S ∈ S is a prediction for the corresponding label. The other generators G s k : S ⟶ S ðk = 2, ⋯, nÞ furtherly improve the previous predicted probability I k−1 S to obtain more optimal prediction where I Y and I k S have the same size. In this work, G s k is formed by the U-net [5] or ResNet [39] network for the purpose of more accurate segmentation. In experimental evaluation, we will compare the effects of different generator networks on the segmentation results. The purpose of this network is to obtain the final segmented result I f S of the original input I Y , which also is the result of the LGE segmentation. Therefore, the generated prediction obtained from the multicascade pix2pix segmentation network can be denoted as The discriminator D s is a binary classifier based on pixels or patch-images which provides a network learning-based stopping criterion during generating. For the discriminator D s in our multicascade pix2pix segmentation network, we employ a convolutional Patch-GAN [18] to distinguish real or fake between the prediction I k S and the ground truth I L , where I k S is divided into ℓ × ℓ patches with overlapping images, and each patch is discriminated with those of the ground truth I L , respectively; finally, a 2D probability map is obtained as the discriminator outputs.
To train an optimal segmentation network, the measures between I k S and target label I L can be estimated and minimized to update discriminator D s that enforces to discriminate the generation and the ground truth. The segmentation network we propose is a conditional version of pix2pix GAN with the multicascade architecture, so the adversarial input in D s is mainly composed of there components, where the first component is the source image I Y used as the condition and the others are the generation I k S and the ground truth I L . At the same time, each generator G s k is also optimized to generate domain-invariant representations I k S that confuses the discriminator D s .   5 Computational and Mathematical Methods in Medicine evaluation of image segmentation task. CNNs trained for image segmentation task are usually optimized by minimizing a weighted cross-entropy. In this work, we employ a specially designed loss function L s to simultaneously measure the generation similarity and the adversarial error, which contains three types of loss functions: adversarial loss L gan , L 1 loss, and perceptual loss L vgg .

Loss Functions in Segmentation
The original adversarial loss (Vanilla GAN loss) is given by the Kullback-Leibler (KL) divergence score as where ω L g k ðk = 1, ⋯, nÞ is the given weight enforcing the trade-off between the n cascade cross-entropy losses and I Y is a condition input of each convolutional Patch-GAN in our multicascade pix2pix segmentation network. Recently, the most commonly used adversarial losses are WGAN-GP [40] and LSGAN [41]. In the next section, we will compare the performances of three different adversarial losses in our experiments.
L 1 loss is a weighted sum of the absolute distance between the calculated output data I k S in the k-th cascade block and the ground truth I L , which can make the segmentation results closer to the real results [18], and is defined by where ω L 1 k ðk = 1, ⋯, nÞ are weight constants. Without loss of generality, we will take ω L g k = ω L 1 k for all k = 1, ⋯, n in our experiments.
Besides, we also employ the perceptual loss in our multicascade pix2pix segmentation network, which is composed of a pretrained VGG19 network and is first proposed in image super resolution application [42]. The perceptual loss focuses on feature maps between the output data and the ground truth [43]. It can hence be computed by where S ij,k pq = Dðφ i,j ðI L Þ pq − φ i,j ðI k S Þ pq Þ and φ i,j represents the feature map of the j-th feature channel of the i-th feature layer (after activation) [42], N i is the number of feature channels in the i-th feature layer and M is the number of convolution layers, and w ij and h ij represent the size of the feature map in the VGG19 network. Here, D is the error Table 2: Indirect segmentation performance comparisons between CMMCSegNet models based on U-net and ResNet generator blocks using different training losses, where only one cascade generation block is used, "P-L cosine " means to add cosine similarity perceptual loss into training loss, and "P-L manh " means to add L manh perceptual loss into training loss.

Block
Loss where X and Y are feature maps.
The total proposed segmentation model is trained by jointly minimizing the total loss L s for the three parts as follows: where the λ l and λ vgg are two given weight parameters.

Results and Discussion
The proposed CMMCSegNet framework is implemented using PyTorch. The experiments are conducted on a single GeForce RTX 2080Ti GPU with 11 GB RAM. To identify the model design, we performed several ablation experiments. They are described as follows.

Dataset and Experimental
Setting. To demonstrate our CMMCSegNet framework, we use MS-CMRSeg 2019 datasets [1], which contain three different modalities: LGE with 45 patients but only 5 patients being labeled and bSSFP with 35 annotated patients and T2-weighted. The goal of CMR segmentation challenge is to achieve LGE image segmentation. Since there are fewer T2-weighted slices for each patient in the dataset (about 3-7 slices for each patient), we only use bSSFP modality and LGE modality in our experiments. The cross-modality translation network is trained for 200 epochs, and the model that performs best on the validation set was selected for translation from LGE to bSSFP in the proposed CMMCSegNet framework. The dataset training the segmentation network contains two parts, most of them are from real annotated bSSFP images (slices from the 25 patients), and a small amount of fake bSSFP images are translated from the annotated LGE images (slices from about two patients) by the Cycle-GAN translation network.
We also train 200 epochs for the segmentation network. The both models are trained using Adam optimization with a minibatch size of 1, a decayed learning rate with an initial value 1:0e − 2, the size n D = 70 of patch D in the discriminator based on Patch-GAN, and the weight hyperparameters λ 1 = 10, λ gan = 1, λ l = 100, and λ vgg = 1.

Performance of Cross-Modality Translation.
We first use Cycle-GAN to achieve translation between LGE and bSSFP modalities; we also employ three evaluation metrics, including Structural Similarity (SSIM), Peak Signal To Noise Ratio (PSNR), and Mutual Information (MI), to evaluate the performance of Cycle-GAN translation network, which is tested on the whole LGE and bSSFP images. Many randomly chosen results from the translated (fake) LGE or bSSFP modalities are shown in Figure 3. In Table 1, our translation model also leads to a comparable synthesis quality between LGE and bSSFP modalities for the whole datasets, where A r , A f , B r , and B f denote real LGE, fake LGE, real bSSFP, and fake bSSFP, respectively.

Comparisons for Different Choices of Adversarial Loss and Perceptual Loss.
After the cross-modality translation, two fake bSSFP patients with annotated masks (obtained from the cross-modality translation of two LGE patients with the ground truth) and fully real labeled bSSFP patients (35 patients) are used to train our proposed segmentation Table 3: Performance comparisons for the number of cascade generators on the multicascade pix2pix segmentation network, where "P-L manh " means to use L manh perceptual loss and "simple" means to use cascade generator with the simplified U-net version (where the number of upsampling/downsampling layers in the middle part of the U-net generators is reduced from (8,8,8) to (2,4,5) for generators (G s 2 , G s 3 , G s 4 ), respectively). network. Next, we did several different comparison experiments for segmentation evaluations of fake bSSFP without annotated data (obtained from the cross-modality translation). Table 2 shows the dice score of cardiac LGE segmentation in using different adversarial losses (Vanilla GAN, LSGAN, and WGAN-GP) and different CMMCSegNet generator blocks (U-net and ResNet) and with/without perceptual loss (L manh or L cosine ). We can also see that the overall segmentation performance of the U-net generator is slightly better than that of the ResNet generator using 6 different losses in terms of the LV (left ventricle), MYO (myocardium), and RV (right ventricle). For the U-net generator, the model using LSGAN loss yields better diagnostic performance than those of both Vanilla GAN and WGAN-GP losses. Besides, the L manh perceptual loss or L cosine perceptual loss added for kernel feature comparisons can guarantee that the network learn relevant high feature levels and content features, which will improve the segmentation results for Vanilla GAN and LSGAN. However, the dice score of LV and RV segmentation slightly decreases when WGAN-GP with the L mannh perceptual loss is used, while in the ResNet generation network, the models with the perceptual loss (L mannh or L cosine ) achieve higher segmentation performance in all three terms and outperform those without the perceptual loss.    Table 3; we can see that the number of cascades is increased from one to four and the dice values of some terms dropped slightly for the model with/without perceptual loss. The reason for this may be that the increase in the number of cascades may cause a lot of edge information to be lost in the original fake bSSFP images. As we can see from Figure 1, when the first segmentation network G s 1 obtains the segmentation result of the input fake bSSFP images, if original fake bSSFP image I Y is not used as a con-ditional input in the later G s k+1 , modifying the previous result I k S , G s k+1 extracts fewer features comparing with the G s 1 . To optimize the computational costs, starting from the second generator, we reduce the number of upsampling/downsampling layers in the middle part of the U-net generators from (8,8,8) to (2,4,5) for generators (G s 2 , G s 3 , G s 4 ), respectively. From Table 3, we observe that the proposed network with the simplified U-net versions can improve the segmentation results. Figure 4 shows the original LGE images, the translated bSSFP images, the corresponding ground truths, and the prediction results with varying the numbers of cascades.

Number of cascades
LGE  LGE, ground truths with zoom-in views and prediction results with zoom-in views using FCNs, U-net, U-net++, and Attention U-net for segmentation on real LGE modality; (b) indirect segmentation, from left to right: LGE, fake bSSFP, ground truths with zoom-in views, and prediction results with zoom-in views using FCNs, U-net, U-net++, Attention U-net, and our CMMCSegNet for segmentation on fake bSSFP modality translated from LGE modality.  Table 4, the model with LSGAN adversarial loss and vgg perceptual loss is optimized solely using loss weights ðω /3, 1/2, 1/6Þ and achieves the better results on the evaluation dice of I 2 S . Due to the efficiency of the multicascade technique, the proposed segmentation network automatically improves image multilevel features that benefits the segmentation performance. Figure 5 shows the results of different generators in a multicascade pix2pix network with different weights; G s 2 can further modify the details of I 1 S making the output result closer to ground truth. Table 5 benchmarks the performance of the proposed framework against the direct and indirect LGE segmentation networks. First, we compare the performance of the four direct segmentation methods, including FCNs [4], U-net [5], U-net++ [44], and Attention U-net [45] networks by directly training a segmentation network from a small number of annotated LGE images. As reported in Table 5, although U-net performs better than others, it produces low dice value. Figure 6(a) visualizes the segmentation results by direct methods. We also compare the performance of the five indirect segmentation methods, including FCNs, U-net, U-net++, and Attention U-net networks and the proposed CMMCSegNet by indirectly training networks from a small number of annotated fake bSSFP images and fully real bSSFP annotated images. As shown in Table 5, the proposed technique provides the highest dice score of LV and MYO and the fair value in RV. This means that our proposed CMMCSegNet outperforms the other techniques. Figure 6(b) further illustrates a more detailed comparison between the proposed and other techniques; our proposed CMMCSegNet has obvious advantages that it is easier to learn the location information of the target area.

Conclusion
In this work, we proposed a CMMCSegNet framework based on multimodal cardiac MR images for indirect LGE segmentation. Firstly, we utilized Cycle-GAN to translate LGE modality into bSSFP modality and then segmented the translated (fake) bSSFP images to achieve indirect segmentation of LGE images. The advantage of this method is that only a small number of annotated LGE images can be required to achieve accurate segmentation of LGE by employing many annotated bSSFP images. This indirection also solved the problem of LGE images itself having a low contrast. Compared with the direct segmentation of LGE images, the indirect segmentation method has better segmentation performance.
For the multicascade pix2pix network, we regard the segmentation as a translation from image to ground truth; the purpose of multicascade architecture is to better improve the previous prediction through several generators. We also compared the use of different adversarial losses, the experimental results show LSGAN loss is better than the Vanilla GAN and WGAN-GP, and WGAN-GP loss is not significantly better than the Vanilla GAN loss. To improve the training effect of the model, the perceptual losses based on L manh and L cosine measures are also used to optimize the features of each feature layer. In addition, we investigated the influence of the weights of the generation loss of multicascade structures, where the optimal weight coefficient is set to (1/3, 1/2, 1/6) for 3 cascade generation networks.
We also demonstrated the effectiveness of the proposed CMMCSegNet by comparing with FCNs, U-net, U-net++, and Attention U-net. In the future, we will consider the end-to-end segmentation method to segment the multimodal cardiac MR, combining the translation and segmentation together.

Data Availability
Dataset is obtained from Multisequence Cardiac MR Segmentation Challenge (MS-CMRSeg 2019; https://zmiclab .github.io/mscmrseg19/). This challenge is aimed at creating an open and fair competition for various research groups to test and validate their methods, particularly for the multisequence ventricle and myocardium segmentation. Also refer to publication [1].

Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.