HybridCTrm: Bridging CNN and Transformer for Multimodal Brain Image Segmentation

Multimodal medical image segmentation is always a critical problem in medical image segmentation. Traditional deep learning methods utilize fully CNNs for encoding given images, thus leading to deficiency of long-range dependencies and bad generalization performance. Recently, a sequence of Transformer-based methodologies emerges in the field of image processing, which brings great generalization and performance in various tasks. On the other hand, traditional CNNs have their own advantages, such as rapid convergence and local representations. Therefore, we analyze a hybrid multimodal segmentation method based on Transformers and CNNs and propose a novel architecture, HybridCTrm network. We conduct experiments using HybridCTrm on two benchmark datasets and compare with HyperDenseNet, a network based on fully CNNs. Results show that our HybridCTrm outperforms HyperDenseNet on most of the evaluation metrics. Furthermore, we analyze the influence of the depth of Transformer on the performance. Besides, we visualize the results and carefully explore how our hybrid methods improve on segmentations.


Introduction
Medical image segmentation is an essential area in medical image analysis and is necessary for diagnosis and treatment, which aims to label each pixel in images. Motivated by the recent success of deep learning, researchers in this field have also attempted to apply deep learning-based approaches to medical image segmentation, including U-Nets [1][2][3][4][5][6] and fully CNNs [7][8][9][10][11][12][13]. ese methods have achieved superior performance compared to traditional methods in the medical image segmentation task. In order to obtain more accurate segmentation for advanced diagnosis, using multimodal medical images has a growing popularity. Compared with single images, multimodal images help to extract features from different views and bring additional information, contributing to diverse data representation and discriminative power of the network. Previous works usually follow a fully CNN architecture, which suffers from instinctive defects of CNNs. ese disadvantages include a deficiency in extracting nonlocal features and bad generalization. With the development of Transformers in language processing, these Transformer-based methods also attract much attention in image processing [14][15][16][17][18], containing classification, detection, and segmentation. ese successful applications show the great ability of nonlocal feature extraction for the Transformer-based architecture in images. However, it cannot be directly utilized on multimodal medical image segmentation yet. ese works have pretrained with various kinds of images and generated prior knowledge and representations. Pretraining is a self-supervised and auxiliary step and is crucial for the Transformer-based methods since these approaches cannot easily extract enough meaningful representation through the simple task and therefore need transcendental representations from pretraining. Nevertheless, the multimodal medical segmentation task lacks images and therefore cannot follows the pretraining strategy.
In order to tackle these problems, we design a hybrid architecture named HybridCTrm for combining advantages from CNNs and Transformers. Specifically, these networks encode images from CNN and Transformer, two parallel and independent paths, and then integrate representations for decoding and segmentation. In this way, CNNs generate local representations and control gradient descent for rapid convergence, while Transformers extract nonlocal features and avoid overfitting and local optimum in the postperiod of the training. Aiming at fully testing our method, we present two fusion strategies, namely, single-path strategy and multipath strategy, and apply our method on these two different strategies. Experiments and results show that our method can effectively overcome the disadvantages of both CNNs and Transformers, and we show a great improvement in the performance of multimodal medical image segmentation task. Besides, we visualize the results and carefully explore how our hybrid methods improve on segmentations.

Multimodal Medical Image Segmentation Based on CNNs.
Multimodal medical image segmentation derives from the medical image segmentation task, both aiming to achieve pixel-level classification for an input image. Traditional medical image segmentation methods generally follow an encoder-decoder architecture, such as U-Net [5] and fully CNN [12]. In these structures, an encoder is usually used to extract features while a decoder is to restore extracted features and output the final segmentation predictions. e U-Net [5] has been widely used for medical image segmentation, consisting of convolution, pooling, and skipconnection. Çiçek et al. [2] extended U-Net architecture to the application of 3D images and proposed 3D U-Net. Milletari et al. [4] proposed V-Net, with residual connections for a deeper network. Similarly, Yu et al. [1] proposed VoxResNet, Lee et al. [3] presented 3DRUNet, and Xiao et al. [6] proposed Res-UNet. In the multimodal medical image segmentation field, researchers generally apply a fully CNN architecture. To effectively employ information from different modalities, Nie et al. [11] proposed a new fully CNN architecture for the multimodal infant brain tissue segmentation. Kamnitsas et al. [10] trained three fully CNNs separately and then averaged the confidence of each network. Chen et al. [7] proposed a dual-pathway fully CNN multimodal brain tumor segmentation network. Wang et al. [13] proposed a cascaded anisotropic convolution network. Dolz et al. [19] proposed a 3D fully CNN based on Den-seNets [8].

Image Processing Based on Transformers.
Transformers were first applied to natural language processing tasks. Vaswani et al. [20] proposed an attention-pure architecture named Transformer for machine translation and sentence parsing. Devlin et al. [21] proposed BERT, a bidirectional Transformer for two-step training with pretraining and fine-tuning. Brown et al. [22] trained a larger Transformer. Recently, a sequence of Transformer-based methods emerged in the image processing field [14][15][16][17][18]. Among them, Vision Transformer [16] and Detection Transformer [14] are of the most importance. Detection Transformer formulated the detection task as a sequential prediction. Vision Transformer (ViT) cropped an image into a sequence of small patches, which aims to fit the structure of the original Transformer. ViT proved its power on longrange dependencies and showed great performance and therefore was treated as a strong backbone.

Methodology
3.1. Overview Architecture. In multimodal medical image segmentation, the goal is to assign labels for each pixel of the given input images from different modalities. We propose two hybrid architectures for multimodal medical image segmentation as shown in Figure 1. Specifically, we present two models based on two different strategies: single-path strategy and multipath strategy. Figure 1(a) describes how a single-path strategy works. We take MRI-T1 and MRI-T2 as input modalities. ese two modalities x 1 , x 2 are combined as a multichannel image x 1,2 and then the image is encoded using convolutions with m layers and Transformers with n layers. e representations are generated from these two encoders separately and independently and integrated for subsequent decoding. Multipath strategy is quite similar to the single-path one, with the way of input different and Figure 1(b) telling the difference. MRI-T1 (x 1 ) and MRI-T2 (x 2 ) are encoded with independent encoders and representations are collected afterward. A multipath network can effectively combine and fully use the information and features from different modalities, while the single-path one focuses more on how different modalities interact with each other. e most key part of our work is that we use Transformers and convolutions as two separate encoders and we will carefully describe the encoders in the rest of the section.

Convolution Encoder.
To avoid gradient vanishing and explosion, DenseNets [8] apply skip-connections for directly adding each layer to the follow-up layers. Inspired by the mentioned points, we keep this idea on our convolution encoder.
As shown in Figure 2, a single convolution layer is composed of four parts, batch normalization, PReLU, a 3 * 3 * 3 convolution kernel, and a skip-connection. It is calculated as follows: where x l is the output of the l-th layer of convolution and x l−1 is the input of this layer. Particularly, when l represents the first layer l � 1, then x 0 is the input of the images. For multipath strategy, it represents one modality x i where i is not greater than the modality number n. Similarly, x 0 represents a combination of input modalities x 1,2,...,n when it comes to the single-path model.

Transformer Encoder.
To avoid overfitting and deficiency of nonlocal dependencies generated from fully CNNs, we apply Transformers to encode multimodal images and advance the generalization of the model. Transformers [20] were firstly applied to natural language processing and recently a novel Transformer architecture named Vision Transformer (ViT) [16] was proposed. ViT performed quite well on the image classification task and was quickly accepted as common backbones for image processing.
Inspired by ViT, we modify and apply Transformers for multimodal medical image segmentation as shown in Figure 3. Firstly, we reconstruct a 3D image x 0 ∈ R H×W×D×C into a series of flattened 3D cubes x p ∈ R N×(P 2 ·C) , where N is the length of the sequence: where H, W, D, C are the height, width, depth, and channel of the origin 3D image x 0 . For the multipath model, C � 1 since each input image represents a single modality. For the single-path model, C is the modality number n. P is the size of the cube. We obtained a patch sequence x p from the above equations. Transformers apply a constant size D for the dimension of each hidden layer. erefore, we linearly project x p to D dimension and add a position embedding for patch embedding:  Journal of Healthcare Engineering where W is a learnable projection parameter and x pos is a learnable position vector. Position embedding x pos has the same dimension as x p . It is of great importance in image processing since Transformers do not generate position information and need supplementary inputs. e output, patch embedding x 0 , is treated as the input of the first layer of Transformer.
Afterward, x 0 is put into a multilayered Transformer: where x l and x l−1 are the output and input of the l-th layer of Transformer encoder. LN (Layer Normalization) is a common design in Transformers and FFN (Feedforward Network) consists of two linear projections and a nonlinear activation function PReLU. MHSA (Multihead Self-Attention) is the crucial part of Transformers and is carefully described in the following. Different from common Transformers in the image processing field like ViT [16], we modify the position of LN. Xiong et al. [23] proved that the training is more stable when normalization blocks appeared in residual blocks. In our work, multimodal medical image segmentation cannot utilize a pretraining procedure like ViT and thus resulting in much fluctuation when training. erefore, we apply this modified Transformer architecture to our networks.
MHSA (Multihead Self-Attention) is the core of Transformers and it can be treated as a stack of several simple attention networks. A simple attention mechanism can be calculated as follows: where the input is copied multiple times and then separated as independent Q (queries), K (keys), and V (values) with their dimension d q , d k , and d v , respectively. Multiple stacking with simple attention can focus on different representations from different subspaces.
erefore, MHSA projects Q, K, V into different subspaces and conducts attention independently with outputs stacked: where the projections are parameter matrices en, we integrate the features into one matrix x f : where ⊕ represents concat operation. en, feature x f is sent to a decoder consisting of l layers with batch normalization, PReLU, and 1 * 1 * 1 convolution kernel, which is calculated as follows:   Journal of Healthcare Engineering and the output of the last layer x f l is sent to a Softmax function p(x) for the final segmentation.

Loss Function and Learning Rate Decay.
We apply a common cross-entropy function as our cost function. Let θ denote the network parameters and y v s the label of pixel v in the s-th image segment. We optimize the following: where p v c (x s ) is the softmax output of the network for pixel v and class c, when the input segment is x s .
We utilize cosine learning rate decay as our decay strategy: where η t , η max , η min represent learning rate, maximum learning rate, and minimum learning rate, respectively. T cur is the iteration number after recurrence and T i is the current iteration number.

Results and Discussion
We conduct our experiment on two benchmark datasets, MRBrainS [24] and iSEG-2017 [25]. Both are multimodal medical image segmentation datasets and focus on segmenting three types of brain tissue, including white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). Between them, MRBrainS is a triple-modal segmentation dataset and iSEG-2017 is a double-modal one.  [19,26,27]. We talked about the cropping and flattening of a given image in the previous sections. Among them, P is the size of a cube and is crucial for computation costs and performance. When P � 1, MHSA can calculate with each pixel. However, it will cause high computation costs and go beyond limited GPU resources. With a big P, Transformers cannot effectively capture features. Based on these two points, we set P � 3. For Transformers, we set the head to be 4, hidden dimension to be 128, and depth to be 4 according to the experiment performance. To initialize the weights of the convolution path, we adopt the strategy proposed in [28], which yields fast convergence for very deep architectures. In this strategy, a zero-mean Gaussian distribution of standard deviation ��� � 2/n l is used to initialize the weights in layer l, where n l denotes the number of connections to the units in that layer. e initial learning rate (η max ) is set to be 1e − 4. For cosine decay, η min is 5e − 5 and T cur is 50. Considering limited GPU resources, we set the batch size to be 32.
We compare our modal with several fully CNN-based architectures, including fCNN [12], CNN small [27], CNN small − MS [27], and HyperDenseNet [19]. fCNN [12] was the first method that applied CNNs to segmentation tasks. en, Dolz et al. [27] set a smaller kernel for convolutions and presented two kinds of methods including traditional one and multiscaled one. HyperDenseNet [19] was a commonly used fCNN-based method in multimodal image segmentation tasks. ese models can be effectively compared with our hybrid model, HybridCTrm. To fairly compare with baselines, we conduct experiments with two strategies, single-path and multipath separately, and set a postfix, like -Multi and -Single, for each method.

Results on MRBrainS
We apply leave-one-out-cross-validation for a five-sample dataset, MRBrainS, that is, four samples for training and one for testing. We report DSC on three tissues (CSF, GM, and WM) and their average. Table 1 shows general results of our method and baselines on MRBrainS and the highest score on each column is bold. A general view shows that our hybrid network achieves the highest score on each metric. To have a clear comparison, we firstly compare two simple-path models. We mostly compare our method with HyperDenseNet, since this model performs the best among CNN-based architectures on each strategy.

Main Results.
Compared with HyperDenseNet-Single, HybridCTrm improves by about 3%, 1%, and 26% in WM, GM, and CSF, respectively. At the same time, the average score outperformance is 10%. For multipath models, HybridCTrm-Multi improves by 4% and 18% on CSF and WM tissue compared with HyperDenseNet-Multi. en, we focus on the performance of the same model on different strategies. For segmenting GM and CSF tissue, multipath strategy is more fitting, while single-path one performs better on WM. is may infer that, for segmenting GM and CSF information from different modalities, it is needed to encode independently first and then integrate with each other, while for WM it is the opposite.

Analysis of Hyperparameters.
In order to better analyze the influence of Transformers on performance, we conduct experiments with different depths of Transformers settings. Figure 4 shows how depth influences performance on single-path and multipath models. For HybridCTrm-Single, the average DSC reaches the peak at the depth of 4. e curve increases firstly and then drops with the increase of depth from 1 to 5. For HybridCTrm-Multi, the peak reaches the depth of 5 and the trendy of the curve is quite similar, from rising to declining. is similar trend is probably because different depths capture different representations and a suitable depth can extract the most effective information for segmentation.

Ablation Study.
Our hybrid architecture consists of a CNN encoder and Transformer encoder to capture local and nonlocal features for segmentation. To research how CNN and Transformer branches influence our hybrid model, we conduct an ablation study. As shown in Table 2, we separate the CNN branch and Transformer branch, respectively, based on the single-path model and multipath model. After removing the CNN branch, the mean DSC drops by about 48% and 46% for single and multistrategy, respectively, while it drops by about 16% and 13% with removing the Transformer branch. It shows that compared with the Transformer branch, the CNN branch is more important because nonlocal features extracted by the Transformer encoder play an auxiliary role in the segmentation.

Auxiliary Experiments with Dual Modality.
To further analyze the importance of segmentation of different modalities, we choose two modalities and conduct auxiliary experiments with dual modality.
As shown in Table 3, the results of three modalities perform better compared with the two modalities. However, a careful comparison clearly shows that the results based on the T1 and T2-FLAIR modalities have the lowest performance degradation (only 2%-3%), indicating a strong complementarity between these two modalities. e other two modal combinations, T1 and T1-IR, and T1-IR and T2-FLAIR, show significant performance degradation (10%-13%), indicating that the T1-IR modality does not bring more information. A further comparison of the performance of CSF, GM, and WM tissues shows that the performance of the combination of T1 and T2-FLAIR on CSF and WM is comparable to that of the three modalities under the corresponding strategies. And the former is even 0.6% higher than the latter under the multipath strategy, which indicates that the information in these two modalities is sufficient to segment the CSF and WM tissues.
is indicates that the information in these two modalities is sufficient to segment CSF and WM tissues. On the contrary, the performance of segmentation on GM tissues is best only when the information in the three modalities is complete, and there is a 5%-6% degradation in the results of the other two modalities.

Results on iSEG-2017.
We divide the iSEG-2017 dataset with 10 samples into a training set (train), a development set (dev), and a test set (test) in the ratio of 8 : 1 : 1. e detailed results are presented in Table 4, and the maximum value of each item is bold.

Main
Results. First, our HybridCTrm network outperforms in seven out of eight metrics among all these methods. Compared with HyperDenseNet-Single, HybridCTrm-Single outperforms in all metrics.
is is because the information in the single-path strategy is integrated at the beginning, and the Transformer structure can fully model this complex information. In the multipath strategy, HybridCTrm-Multi also outperforms HyperDenseNet-Multi in seven metrics, further validating the modeling capability of Transformer. However, the gain is not as great as the single-path model. is is probably because the information content of each individual modality is small, so the Transformer cannot learn enough features like the CNN with strong inductive bias.

Analysis of Generalization and Stability.
Having already concluded that hybrid structure-based models work better than pure fCNN models, we are also concerned about the stability and generalization performance of the models. Figure 5 shows the performance of the two models on the development set and the validation set under the two strategies after every 5 epochs. e left panel shows the performance of the two models under the single-path strategy. e red line represents HybridCTrm and the blue  e stability of Hyper-DenseNet is slightly better than that of HybridCTrm in the middle and late stages of training. is may be because the model has learned enough good features in the middle and late stages of training, and further learning is much difficult for Transformers to utilize each modal information separately.
erefore, there may be some bad features in the layers of the Transformer, resulting in a decrease in the stability of the model. For HyperDenseNet, the features of the CNN are more stable and therefore do not produce large fluctuations. On the other hand, the slight instability in the later stages also allows the model to learn more randomly and avoid falling into local minima that lead to poor global generalization performance.

Visualization.
To further explore the advantages and disadvantages of different models and preferences for segmentation content, segmentation results from a partial test set are shown for visual analysis. Figure 6(a) shows a segmented 2D section. In the frame region, the CSF tissue (blue part) shows a striped distribution, while HyperDenseNet performs poorly on CSF under both the single-path strategy and multipath strategy. On the contrary, HybridCTrm performs well in CSF segmentation, especially in the single-path strategy, where the continuous bar distribution of CSF is reflected. For GM tissue (green part), HyperDenseNet misclassifies a large part    of the gray matter as WM (yellow part) and generates a lot of "adhesions" in the WM region. e HybridCTrm, on the other hand, has a good performance on the GM and better reflects the contour structure, indicating that the hybrid structure fully exploits the long-distance dependence. Figure 6(b) shows another segmentation profile, in which we focus on the forebrain part: the boxed area in the figure. We can see that there is a clear section of the GM pathway in the target region, which is missegmented as WM tissue in the HyperDenseNet, but better segmented in the HybridCTrm. Meanwhile, on the rightmost side of the box, there is a small piece of CSF tissue, which is not segmented at all under the HyperDenseNet-Single and is minimally segmented under the HyperDenseNet-Multi. In the HybridCTrm, this part is also well segmented.

Conclusion
is paper discusses the application of hybrid networks based on CNNs and Transformers in the field of multimodal medical segmentation and proposes a novel multimodal medical segmentation architecture, HybridCTrm. Two multimodal segmentation strategies are proposed, namely, single-path strategy and multipath strategy. In the experiment, the HybridCTrm architecture is tested on two benchmark datasets under single-path and multipath strategies and compared with HyperDenseNet, an architecture based entirely on the fully CNNs. HybridCTrm outperforms HyperDenseNet in most of the metrics.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.