Multiple Sclerosis Lesion Segmentation in Brain MRI Using Inception Modules Embedded in a Convolutional Neural Network

Multiple sclerosis (MS) is a chronic and autoimmune disease that forms lesions in the central nervous system. Quantitative analysis of these lesions has proved to be very useful in clinical trials for therapies and assessing disease prognosis. However, the efficacy of these quantitative analyses greatly depends on how accurately the MS lesions have been identified and segmented in brain MRI. This is usually carried out by radiologists who label 3D MR images slice by slice using commonly available segmentation tools. However, such manual practices are time consuming and error prone. To circumvent this problem, several automatic segmentation techniques have been investigated in recent years. In this paper, we propose a new framework for automatic brain lesion segmentation that employs a novel convolutional neural network (CNN) architecture. In order to segment lesions of different sizes, we have to pick a specific filter or size 3 × 3 or 5 × 5. Sometimes, it is hard to decide which filter will work better to get the best results. Google Net has solved this problem by introducing an inception module. An inception module uses 3 × 3, 5 × 5, 1 × 1 and max pooling filters in parallel fashion. Results show that incorporating inception modules in a CNN has improved the performance of the network in the segmentation of MS lesions. We compared the results of the proposed CNN architecture for two loss functions: binary cross entropy (BCE) and structural similarity index measure (SSIM) using the publicly available ISBI-2015 challenge dataset. A score of 93.81 which is higher than the human rater with BCE loss function is achieved.


Introduction
Multiple sclerosis (MS) is a chronic disease that damages the nerves in the spinal cord, brain, and optic nerves. Axons in the brain are covered with a myelin sheath. Demyelination is a process in which the myelin sheaths start falling off and develops lesions in brain nerves. Millions of people are affected by MS worldwide which is mainly found in young people between 20 and 50 years of age. e symptoms caused by this disease are fatigue, memory problem, the problem in concentration, weakness, loss of balance, loss of vision, and many others. Diagnosing and treating this disease is very challenging because of its variability in its clinical expression. ese lesions can be traced in magnetic resonance imaging (MRI) using different sequences. Many features such as a volume and location are very important biomarkers for tracking the progression of the disease. Manually segmenting these lesions by expert radiologists is the most common practice in clinics, but this is tiresome, time consuming, and error prone. Figure 1 shows the manual segmentation of MS lesions by two raters in one slice of a brain MRI.
In recent years, automatic segmentation of MS lesions using convolutional neural networks (CNNs) have been investigated [1][2][3][4][5]. CNNs learn subtle features from the raw image data to facilitate 2D pixel (or 3D voxel) classification that ultimately leads to image segmentation. However, there is no one-fit-for-all CNN model that could work for every classification problem or data. An expert knowledge has to be incorporated during the design phase of the CNN model based on the nature of the application and the data. Complex problems such as MS lesion segmentation require careful selection of the CNN architecture and training model for an optimum solution. In addition, automatic segmentation of an MS lesion in MRI may be challenging due to the following: (i) e lesion size and location are highly variable (ii) e edges between anatomical objects are not well defined in MR images due to low contrast (iii) e MR image of clinical quality may have imaging artifacts such as noise and inhomogeneity In this work, we are proposing a novel CNN architecture for MS lesion segmentation. e MS lesions vary tremendously in size and shape, and sometime, it is difficult to detect in brain MR images. To address this particular challenge, inception modules, originally introduced by Google in GoogLeNet, are added in the CNN model [6]. e significance of the inception module lies in using multiple kernels of different sizes in parallel in an efficient way. is smart approach captures features of varying magnitude in the input data without overburdening the network with additional computations. e proposed model is trained for two loss functions, binary cross entropy (BCE) and structural similarity index measure (SSIM). e BCE loss function tries to maximize the difference of the probability distribution between two classes, in this case, lesion and nonlesion voxels [7]. SSIM, on the other hand, is a perception-based loss function that quantifies the similarity between two images [8].
e proposed solution for the MS lesion segmentation in brain MRI offers the following attributes: (i) Introduction of inception modules embedded in the CNN architecture for the segmentation of MS lesions with different shapes and sizes (ii) Comparison of MS lesion segmentation results using BCE and SSIM loss functions (iii) Improvement of performance of the proposed architecture in terms of the Dice coefficient, positive predicted value, lesion-wise true positive rate, and volume difference of the segmented lesions compared to the gold standard 1.1. Literature Review. In past decade, deep neural networks have shown promising results in the segmentation of MS lesions in brain MR images. In [9], a novel architecture for segmenting MS lesions in magnetic resonance images by using a deep 3D convolutional encoder with the connections of shortcut in pathways was proposed. e method was evaluated on publicly available data from ISBI-2015 [10] and MICCAI-2008 [11] challenges. Authors compared their method with other five available approaches used for MS lesion segmentation. e final results show that their method outperformed the previous existing methods for MS lesion segmentation. In [12], the authors used a fully automatic multiview CNN approach for segmenting a multiple sclerosis lesion in longitudinal MRI data and tested on the ISBI-2015 dataset. Various deep learning techniques for the medical image analysis are presented in [13].
Valverde et al. have proposed a novel architecture for segmentation of a white matter (WM) lesion in multiple sclerosis (MS) using small number of imaging data [14]. is approach proposed a cascaded CNN model working on 3D MRI patches from FLAIR and T1w modalities. In this method, the output of the first network is retrained on the second network in series to reduce misclassification from the first network. e proposed model score is evaluated on the publicly available dataset of MICCAI-2008 and outperformed all the participant approaches. Roy et al. proposed a fully convolutional neural network (FCNN) to segment WM lesions in multicontrast MR images using multiple convolutional pathways [15]. e first pathway of the CNN contains dual convolutional filters for two image modalities. In the second pathway, the convolutional filters are applied to the output of the first pathway which are in parallel and concatenated. is method was evaluated on the ISBI-2015 dataset. A novel approach of using a fully 2D CNN to segment MS lesions in MR images is proposed in [16]. Maleki  model for the detection and segmentation of MS lesions [17].
In recent studies, a multimodal MRI dataset in tissue segmentation has shown promising results. In a recent work for brain tumor segmentation, a deep multitask learning framework that performs a performance test on multiple BraTS datasets was shown [18]. e authors claimed improvement over the traditional V-Net framework by using a structure of two parallel decoder branches. e original decoder performs segmentation, and the newly added decoder performs the auxiliary task of distance estimation to make more accurate segmentation boundary. A total loss function is introduced to combine the two tasks with a gamma factor to reduce the focus on the background area and set different weights for each type of label to alleviate the problem of category imbalance. Zhang et al. proposed the ME-Net model and obtained promising results using the BraTS 2020 dataset [19]. Four encoder structures for the four modal images of brain tumor MRI were employed with skipconnections. e combined feature map was given as input to the decoder. e authors also introduced a new loss function, that is, Categorical Dice, and set different weights for different masks. In another study, a 3D supervoxel-based learning method was proposed that demonstrated promising results in the segmentation of brain tumor [20]. e added features from multimodal MRI images greatly increased the segmentation accuracy. In another earlier study, Gabor texton feature, fractal analysis, curvature, and statistical intensity features from superpixels were used to segment tumors in multimodal brain MR images using extremely randomized trees (ERTs) [21]. e experimental results demonstrated the high detection and segmentation performance of the proposed method. Soltaninejad et al. proposed a method that used machine-learned features learned by fully convolutional networks (FCNs) and texton-based histograms as hand-crafted features [22]. e random forest (RF) classifier was then employed for the automated segmentation of brain tumor in the BraTS 2017 dataset.
Segmentation results can be greatly affected by the quality of the MRI images. Low resolution, intensity variations, and image acquisition noise hamper the accuracy of a segmentation task. Jin et al. proposed a deep framework for the segmentation of prostate cancer [23]. ey had shown that the segmentation results were greatly improved by using bicubic interpolation and improved version of 3D V-Net. e bicubic interpolation of the input data helped in enhancing the relevant features required for prostate segmentation. Recently, attention-based methods have gained reputation in the segmentation of small but discrete objects in MRI images. In a study, for the enhancement of left atrium scars, a dilated attention network was used [24]. e proposed approach improved the accuracy of the scar segmentation to 87%. Liu et al. proposed a spatial attentive Bayesian deep learning network for the automatic segmentation of the peripheral zone and transition zone of the prostate with uncertainty estimation [25].
is method outperformed the state-of-the-art methods. e heterogeneity of MS lesions poses a challenge for the detection and segmentation in MR images. An attention-based fully CNN has also been used in the segmentation of prostate zones [26]. e authors in this work have proposed a novel feature pyramid attention mechanism to cope with heterogeneous prostate anatomy. Raschke et al. developed a statistical method to analyze heterogeneity of brain tumors in multimodal MRI [27]. e approach presented in the paper does not make any assumption on the probability distribution of the MRI data and prior knowledge of the location of tumors. is, according to the authors, gives an advantage for tumor segmentation of varying sizes and spatial locations. e proposed method consist of two deep subnetworks in which the first one was an encoding network that was responsible of extracting feature maps and the second was a decoding network and was responsible for upsampling feature maps. e proposed FCNN was evaluated on an ISBI-2015 dataset.

Proposed Methodology
As mentioned earlier, the shape and size of MS lesions vary dramatically. To detect these lesions using machine learning techniques is a challenging task. In the proposed methodology, a CNN model with inception modules is employed to automatically segment MS lesions in brain MRI. Filters of multiple sizes used in the inception modules capture features of MS lesions of different sizes. Prior to CNN model training, the images in the dataset are first preprocessed to remove image noise, intensity inhomogeneity, variability of intensity ranges, and the presence of nonbrain tissues. In this work, preprocessed ISBI-2015 image data have been used.

Dataset.
e proposed algorithm uses the dataset of ISBI-2015 challenge [10] which is grouped in two categories, training and testing data. e training data are named ISBI-21 and are available publicly with 21 MRI images from 5 patients. In the training set, MR brain images of four patients with 4 time points and one with 5 time points with a gap of approximately a year are gathered. e test data are named as ISBI-61 which are not available publicly and have 14 subjects with 61 images. Each subject in the testing set has 4-5 time points, and each time point has a gap of approximately a year. ese images contain longitudinal scans of all five patients, as shown in Figure 2. During training, we used 80 percent of the total patches of 100 × 100 size for training and the remaining 20 percent for validation.

Proposed Deep Network Architecture.
In the CNN architecture, a kernel size and type of filters have to be selected carefully so that it can learn all the features which are useful in the classification of objects. Generally, filters of different sizes and pooling schemes are employed in different CNN layers in order to learn most present features in the data. e inception module, however, uses multiple kernels in each layer in parallel and then pools the features [28]. In the proposed framework, we have investigated the efficacy of inception modules embedded in the CNN model for the segmentation of MS lesions.

CNN Model.
In the proposed method for segmentation of multiple sclerosis disease, we incorporated three inception modules in our CNN model. Each module consists of 1 × 1, 3 × 3, 5 × 5, max pooling, and average pooling. e CNN model consists of two convolution layers with 64 feature maps followed by inception modules and then three convolutions layers. e final layer has one feature map for the prediction of lesion and nonlesion voxels. Figure 3 shows the complete architecture with inception modules embedded in the CNN layers. e model is trained with two different loss functions, i.e., binary cross entropy (BCE) and structural similarity index measure (SSIM). BCE is a measure of the difference between two probability distributions for a given random variable or a set of events and is used in binary classification tasks, whereas SSIM is a perceptual metric that quantifies image quality degradation caused by losses in data compression. For high similarity in images, the value of BCE is low and the value of SSIM is high.

Inception Module.
e fundamental idea behind the GoogLeNet is the introduction of inception modules or inception blocks in the CNN architecture. In CNN, the feature maps learned from the previous layer are given as input to the next layer. e inception module takes the previous layer output and passes it to four different filter operations in parallel, as shown in Figure 4. e feature maps from all the filters are then concatenated to form the final output. e fundamental idea of using a 1 × 1 kernel in the inception module is just to shrink the depth of the feature maps [29]. e 1 × 1 convolutions preserve the parameters spatially that can be used when needed. is strategy in the inception module can lower the dimensions of the feature maps which can eventually drop the computational cost.

Loss Functions.
e proposed model is trained for two loss functions, binary cross entropy (BCE) and structural similarity index measure (SSIM). e BCE loss function tries to maximize the difference of the probability distribution between two classes, in this case, lesion and nonlesion voxels. It measures the performance of a classification model whose output is the probability between 0 and 1, i.e., the output of sigmoid activation. Mathematically, BCE loss for an output y with probability p can be computed as SSIM is a perception-based loss function that quantifies the similarity between two images. In SSIM, similarity between two images can be computed using a statistical model. Let μ x and μ y be the means, σ x and σ y be the variances, and σ xy be the covariance of the two images x and y; then, where C 1 and C 2 are regularization constants.

Model
Implementation. e CNN model is implemented in Python using Keras [30] with TensorFlow library [31]. All the experiments were performed on the Nvidia GeForce RTX 2080 GPU. e deep network is trained end to end using patches. During the training phase of the CNN model, the patches are extracted from each slice in MR images. e training set is divided into two subsets, one for training the network and the other for validating the results. e optimization technique employed to update the parameters in the model is the Adam method [32]. In neural network parameter optimization, the Adam method shows better convergence. e hyperparameters used during network training include the fixed learning rate of 0.0001 for 50 epochs. ese parameters' setting has produced sufficient convergence to optimal network parameters without overfitting the data. e size of the minibatch is set to 64, and each minibatch includes random number of patches. e best model from the validation set is selected at the 24 th epoch which takes 48 hours on the GPUs.

Performance Metrics.
Standard performance metrics for the assessment of the proposed CNN model have been employed. e Dice similarity coefficient measures reproducibility of segmentation as a statistical validation of manual annotation. Another similar metric is the Jaccard similarity index that gives the intersection between the machine segmentation and the ground truth. Positive predicted value is the probability that people with a positive screening test result indeed have the condition of interest. e portion of positive voxels in ground truth that is also identified as positive in the automatic segmentation is captured by true positive rate. Lesion-wise true/false positive rate is the number of lesions that overlap/do not overlap in automatic segmentation and the ground truth. e difference is volume of automatic segmentation, and the ground is another important metric for the assessment of the performance of the CNN model. e Pearson correlation coefficient computes the correlation between the automatic segmentation and the ground truth. e overall score gives the average of the combined effect of all these performance metrics in a single number. Table 1 shows formulas for these performance metrics.

Feature Learning by Inception Modules.
As suggested by the literature, the proposed CNN model is trained on T1w, T2w, and FLAIR sequences of the MRI data. Table 2   Journal of Healthcare Engineering 5 information.
e performance metrics observed for the proposed CNN model have significantly outperformed when compared with the existing techniques, as shown in Table 3. Kernels of different sizes used in the inception modules help in extracting discriminative features for the automatic segmentation of MS lesions and background tissues in brain MRI. e most present features are ultimately pooled using max pooling and average pooling at various stages of the inception modules. e number of inception modules used in the CNN model is also very crucial in the architecture design. Using too many inception modules in MS lesion segmentation has degraded the results due to overfitting the model to the data. Also, poor results are obtained when the number of inception modules has been lowered. is may correspond to underfitting the CNN model for the segmentation of MS lesions. Experiments have also confirmed that a mix of average pooling and max pooling works better by keeping the most present features in the high-level feature maps and averaging them in the low-level feature maps. e authors suggest that, for a specific application, the number and placement of inception modules, filter size, and pooling strategy have to be selected accordingly.         loss function seems to work better than SSIM. is sounds very intuitive as BCE tries to evaluate the difference in the maximum likelihood between the predictions and ground truths. SSIM, on the other hand, quantifies the perceptual differences between the predictions and the ground truths. It uses luminance, contrast, and structure features to compute the similarity between two images. e reason why the BCE loss function works better than SSIM is because loss functions also depend on the activation functions used in the output layer. For sigmoid activation, the literature suggests that the BCE loss function is the natural choice due to its accuracy and efficiency. e automatic MS lesion segmentation using BCE and SSIM loss functions is illustrated in Figure 5.

Comparison with Existing
Techniques. e proposed methodology is compared with different published techniques for MS lesion segmentation using the ISBI-2015 dataset. e comparison of the results is shown in Table 3

Limitations in Real Clinical Studies
e proposed work is an attempt to prove the efficacy of AIbased techniques in medical applications. In recent years, AI has gained reputation in automating tedious routine works in clinical settings. However, the diversity and inadequacy of the patient data for training a deep network have hampered practical use of AI-based techniques in clinics. As more and more data will become available and as deep neural networks will become more efficient, the practicability of these techniques will definitely improve.

Conclusions and Future Works
In this work, a CNN model with inception modules is investigated in automatic segmentation of MS lesions in MRI.
e CNN model with inception modules seems to pick MS lesions of different sizes and shapes more successfully. e key advantage of inception modules is the use of different kernels such as 1 × 1, 3 × 3, and 5 × 5 that tend to extract salient features in the input of varying sizes. is improves the Dice coefficient, PPV, LTPR, and VD of the segmentation when compared to the existing techniques. 35%. e value of LTPR was the only metric that was worse than Valverde's and Aslani's models. In the present study, we have also discovered that the BCE loss function works better than the SSIM loss function. e intuition behind this behavior of the model is that BCE tries to maximize the differences between the probability distributions predictions and ground truths. SSIM, on the other hand, seems to converge to local minima while quantifying the error loss. Another important reason is the sigmoid activation function used in the output layer for the binary classification. e authors believe this naturally supports the BCE loss function to produce more accurate and efficient results. In the future, this work can be further extended to integrate in different architectures such as the residual network (ResNet), UNet, parallel CNN, and cascaded CNN on multiple datasets which are publicly available. e incorporation of event-driven processing can improve the performance of the suggested solution in terms of computational efficiency and compression [33][34][35][36]. Investigation based on this axis is another prospect.

Conflicts of Interest
e authors declare no conflicts of interest.