A Deep-Learning Model with Learnable Group Convolution and Deep Supervision for Brain Tumor Segmentation

The segmentation of brain tumors in medical images is a crucial step of clinical treatment. Manual segmentation is time consuming and labor intensive, and existing automatic segmentation methods suﬀer from issues such as numerous parameters and low precision. To resolve these issues, this study proposes a learnable group convolution-based segmentation method that replaces convolution in the feature extraction stage with learnable group convolution, thereby reducing the number of convolutional network parameters and enhancing communication between convolution groups. To improve utilization of the feature maps, we added a skip connection structure between learnable group convolution modules, which increased segmentation precision. We used deep supervision to combine output images in the network output stage to reduce overﬁtting and enhance the recognition capabilities of the network. We tested the proposed algorithm model using the open BraTS 2018 dataset. The experiment results revealed that the proposed model is superior to 3D U-Net and DMFNet and has better segmentation results for tumor cores than No New-Net and NVDLMED, the winning methods in the BraTS 2018 challenge. The segmentation precision of the proposed method with regard to whole tumors, enhancing tumors, and tumor cores was 90.25%, 80.36%, and 86.20%. Furthermore, the proposed method uses fewer parameters and a less complex model.


Introduction
Early diagnosis is crucial for the surgical treatment of brain tumors. is has been aided by recent advances in medical imaging technology. Magnetic resonance imaging (MRI) technology can display brain tissue information in great detail and is widely used for the diagnosis of brain tumors. Four types of MRI modes are used: T1 weighted, T2 weighted, postcontrast T1 weighted, and fluid-attenuated inversion recovery (FLAIR). Each of these reflects different aspects of brain tissue. T1-weighted scans highlight tumor contours, T2-weighted scans show distinct tumor regions, and FLAIR scans can distinguish edema from cerebrospinal fluid. e accurate segmentation of brain tumors in medical images is a critical step before treatment. Manual segmentation is time consuming and labor intensive, and as a result, efficient and accurate automatic segmentation methods have become a popular research topic in recent years. Brain tumor segmentation methods can generally be divided into three categories: manual segmentation, semiautomatic segmentation, and fully-automatic segmentation. e semiautomatic and fully-automatic methods can be further divided into two categories: unsupervised segmentation and supervised segmentation [1]. Depending on the segmentation principle, unsupervised segmentation includes threshold-based segmentation [2][3][4], region-based segmentation [5][6][7][8][9], graphic-element classification-based segmentation [10][11][12][13], and model-based segmentation [14,15]. e disadvantages of unsupervised methods are that they require a confirmed number of segmentation regions in advance and the MRI images must first undergo intensity nonuniformity correction and skull stripping. Supervised methods are based on graphic-element classification, including conventional machine learning and convolutional neural networks (CNNs). Segmentation methods using conventional machine learning include support vector machines [16][17][18][19][20][21][22], conditional random fields (CRFs) [23,24], and random forests (RFs) [25,26]. In conventional machine learning methods, the features must be manually selected, in which boundary and tumor region details can be easily overlooked.
CNN-based methods include CNN models, fully convolutional neural network (FCNN) models, and U-Net models. CNN models include the CNN structure with small kernels proposed by Pereira [27] and the cascade CNN model proposed by Havaei [28]. FCNN-based models include the residual module-containing FCNN model structure presented by Chen [29] and the model structure integrating FCNNs and CRFs proposed by Zhao [30]. Cicek [31][32][33] examined 3D convolution operations, upgraded U-Net from 2D to 3D, and proposed a 3D U-Net for the segmentation of 3D medical images. Models based on U-Net include the 3D U-Net structure used by Sherman [34], in which residual structures were added between convolutions in the same layer. Nuechterlein and Mehta [35] developed 3D-ESPNet, which applies the pointwise convolution of semantic segmentation to medical image processing to reduce the number of network parameters; however, the resulting precision of segmentation is lower. Kao et al. [36] employed an ensemble comprising seven 3D U-Nets with different parameters and training strategies for brain tumor image segmentation; the higher number of models resulted in longer training time. In the BraTS 2018 challenge, Isensee et al. [37] made minor structural modifications to a 3D U-Net and obtained No New-Net. With additional training data and a simple postprocessing technique, this approach won the second place in the challenge. Myronenko [38] proposed an encoder-decoder architecture network, NVDLMED, added another decoder pathway to recover input images, and imposed additional constraints. is approach won the first place in the BraTS 2018 challenge.
CNN-based methods all involve a large amount of computation and highly complex models and room for improvement in segmentation precision remains. To reduce the number of segmentation network parameters, Chen et al. [39] proposed a dilated multifiber network (DMFNet), which replaces regular convolutions with group convolutions, greatly reducing the number of parameters while maintaining the precision of the segmentation network. Group convolutions were first proposed for Alexnet [40] and were later successfully applied in ResNeXt [41]; they are currently popular in network design. However, in standard group convolutions, each group processes information independently, and there is no communication between groups, which limits their feature representation capabilities. Zhang et al. [42] presented a dynamic group convolution, which can learn different numbers of convolution groups in training data and improves the information flow between groups, thereby achieving better performance than regular group convolutions.
Skip connections can accelerate network convergence and increase the precision of the segmentation network. Deep convolutional neural networks exhibit better performance than shallow networks, but the gradient vanishing problem may apply. Residual connections were, thus, introduced to ResNet [43] to solve this degenerative issue. DenseNet [44] presented densely connected layers with more shortcut connections and used the cascade strategy to combine the feature maps of the first few layers. Residual connections and dense connections all use information from the previous convolutional layers and are added to networks in the form of skip connections. DenseNet achieves better performance, but as the number of input channels increases, the network consumes more memory. e idea underlying deep supervision is to directly supervise the hidden layer rather than just pay attention to the output layer. In GoogLeNet [45], supervising the two hidden layers of a 22-layer network achieves better effects. Dou et al. [46] applied deep supervision to segment 3D liver CT scans. After the features of the lower and middle layers were deconvoluted in the convolutional network, they were then combined with the output layer, reducing training and verification errors and granting the network better convergence. Chen et al. [47] utilized three classifiers to classify features in intermediate layers.
e outputs of the classifiers serve as moderators during training, and the network combines multilevel contextual information for deep supervision, thereby enhancing the recognition capabilities of the network.
Regarding the problems of too many parameters and low segmentation precision in conventional CNNs, we modified DMFNet, a group convolution network with a smaller number of parameters. To enhance the communication between groups in DMFNet, we replaced the regular group convolution with learnable group convolution. To make full use of the feature maps, we added a skip connection structure between the learnable group convolution modules and introduced deep supervision to merge output images in the network output stage, thereby enhancing the segmentation precision of the network. We refer to this lightweight brain tumor image segmentation algorithm with learnable grouping as DLSDNet.

Lightweight Brain Tumor Image Segmentation Network with Learnable Grouping
We modified DMFNet for our segmentation network, replacing the regular group convolution in the feature extraction stage of DMFNet with learnable group convolution. We also added a skip connection and introduced deep supervision to the network output stage to merge the network outputs.

DMFNet.
DMFNet comprises lightweight 3D convolutional neural networks. e network structure is similar to that of U-Net. It divides complex neural networks into lightweight networks or fiber sets, replaces regular convolution with group convolution, and uses a multiplexer to exchange information.
e multiplexer consists of two 1 × 1 × 1 convolutions to promote information flow between fibers [39]. To expand the receptive field and capture the multiscale 3D spatial correlations of the brain tumors, we added dilated group convolution to the fiber units in the encoder stage. e feature extraction stage involves multiple dilated multifiber units (DMFunits). In the output stage, we used a regular group convolution multifiber unit (MFunit), as shown in Figure 1.

Learnable Grouping.
To further enhance communication for group convolution and the feature extraction capabilities of convolutions, we replaced the group convolution in DMFNet with learnable group convolution (LGConv).
We suppose a convolutional feature map is where N, C, H, and W, respectively, denote the number of samples, number of channels, and the height and width of the channels in the small batches. If a convolution with k × k kernels is applied to F, the output feature map is O ∈ R N×C out ×H×W , where each unit output is o ij ∈ R N×C out . Learnable group convolution (LGConv) can be defined as follows: represents the hidden unit F of the input feature map, ω mn ∈ R C in ×C out denotes the convolution weight, and ⊙ indicates the element-wise product.
LGConv is an expansion of group convolution. It can use the binary relation matrix U ∈ 0, 1 { } C in ×C out to learn group principles. Many convolutions can be regarded as unique forms of learnable grouping.
Let U � 1, which gives 1 ⊙ ω mn � ω mn and represents a regular convolution, as shown in Figure 2(a). Let U � I, where I is a unit matrix. I ⊙ ω mn then becomes a matrix where the diagonal elements are 1 and the nondiagonal elements are 0, as shown in Figure 2 LGConv is a depth-wise separable convolution [48]. If U is a binary block diagonal matrix as shown in Figure 2(c), then U ⊙ ω mn divides the channels into groups. If U is a constant matrix where all of the diagonal elements are 1, then LGConv represents regular group convolution in which adjacent channels are grouped together. If U is any given binary matrix as shown in Figure 2(d), it will result in unstructured convolution. us, appropriately constructing binary relation matrix U can produce various convolutional operations.
To reduce the complexity of U, we decompose it into K submatrices: where ⊗ indicates the Kronecker product. us, K k�1 C in k � C in and K k�1 C out k � C out . rough a series of Kronecker products, we can decompose matrix U into a set of submatrices [49].
To construct each submatrix U k , let C in � C out . is is a general setting in ResNet and ResNeXt. To reduce the parameters in the convolutional operations, we use a single binary variable to express U k : where 1 denotes a 2 × 2 constant matrix where the elements are 1, I represents a 2 × 2 unit matrix, g k is the kth component, g ∈ R K is a learnable gate vector of continuous values, g ∈ 0, 1 { } K is a binary gate vector output from g, and sign(·) denotes a sign function: We can combine equation (4) with equation (3) as follows: With such a structural relationship, the parameter in need of optimization becomes g, so the number of parameters in U reduces from C in · C out to log 2 C in . U is a unit matrix where all of the diagonal elements are 1. When K � 3 and g 1 � 1, g 2 � 1, g 3 � 0, equation (6) becomes 1 ⊗ 1 ⊗ I, which is an 8 × 8 matrix with two groups, as shown in Figure 3(a). When g 1 � 0, g 2 � 1, g 3 � 0, equation (6) becomes I ⊗ 1 ⊗ I, which is an 8 × 8 matrix with four groups, as shown in Figure 3(b). It is shown from that mentioned above that LGConv can group nonadjacent channels. In this way, only three continuous parameters g 1 , g 2 , and g 3 are needed to generate g 1 , g 2 , and g 3 and learn the original large 8 × 8 matrix in which 64 parameters require training.
is study replaced the regular group convolution in the network with learnable group convolution to enhance the agility of the network and increase the precision of the segmentation network. e network units following the replacement are as shown in Figure 4.

Skip Connection Unit.
is study proposed a novel skip connection unit to extract early feature maps and enhance feature reuse. Features are transferred between key layers so that the radiation range of early features expands to deeper levels, thereby enhancing the global integration of information flow. e neural network mapping function of the novel skip connection can be expressed as follows: x n � F n x n i , x n ii , . . . , x n i···ii , (7) where F n (·) is the nonlinear transform after each level and n indicates level n. e output of layer n expressed as x n . [x n i , x n ii , . . . , x n i···ii ] refers to the cascade of feature maps generated by the selected layers n i , n ii , . . . , n i···ii .
Downsampling of the feature maps in the upper layers is first conducted, and then, cascading is used to merge the feature maps with the posterior layer features. Each input includes the features selected from the first layer of the current block and the last layer of the previous block. e structural schematics with the newly added skip connection are displayed in Figure 5.

Deep Supervision. Deeper networks encode higher-level
features. In the training of deep neural networks, deep supervision helps to reduce overfitting, extract more meaningful features, promote network convergence, and solve the problem of vanishing gradients [46,50]. By adopting deep supervision in every stage of the decoder, the outputs of each intermediate stage can be used for supervision. Via upsampling, the output of each decoder can be adjusted to have the same dimensions as the final output segmentation map. e outputs of these intermediate stages are merged into the final output segmentation map, and then, softmax is used to derive the probability map. Losses can be calculated using ground truths and softmax outputs. In this way, the intermediate stages and the final output will implicitly contain the loss and gradient backpropagation, and the outputs of the intermediate stages will also gradually approach the ground truths. Figure 5 presents the structure of the network following the inclusion of deep supervision. We refer to this network model (which includes LGconv, a skip connection, and deep supervision) as DLSDNet.

Dataset and Evaluation Indices.
In our experiment, we employed the BraTS 2018 dataset [51,52], which contains multimodal MRI scans from multiple institutions and has served as the official dataset in a challenge.
is dataset comprises four types of MRI sequences: T1 weighted, T2 weighted, postcontrast T1 weighted, and FLAIR. e dimensions of the data are 240 × 240 × 155. e dataset contains a training set and a validation set.
e training set provides 285 sets of data for training with ground truth, while the testing set contains 66 sets of data with no ground truth. e objective of the BraTS 2018 challenge was to segment the data images into background, necrotic and nonenhancing tumors, edemas, and enhancing tumors. e researchers had to submit their validation results to an online evaluation platform to validate the effectiveness of their algorithms.
Segmentation accuracy was gauged using the Dice similarity coefficient and the Hausdorff distance. e former indicates the degree of similarity between the experimental segmentation results and the ground truth, with a higher value indicating greater segmentation precision. e latter calculates the maximum distance between the contours of the segmentation results and the ground truths to indicate the segmentation quality of the tumor boundaries, and a smaller absolute value represents better segmentation performance. e number of model parameters (Parameters) represents the computer memory consumed by the model, and the amount of calculation represents the computing running time of the model, expressed in floating point numbers per second. e calculation is as follows: In the formula given above, k h , k w , and k d denote the height, width, and depth of the convolution kernel, C in and C out are the number of input and output channels, and h, w, and d denote the height, width, and depth of the input data.
For the loss function, we adopted generalized Dice loss (GDL), which was proposed to cope with data imbalance issues. Using Dice loss is disadvantageous for the detection of small targets, so GDL combines multiple Dice classes and uses a weight to quantify and weight the segmentation results: In the formula given above, g ln denotes the ground truth of voxel n in class l, p ln is the corresponding predicted value, N is the total number of voxels, L is the total number of classes, and w l is the weight of each class:

Experiment Environment and Preprocessing.
We used the deep-learning platform Pytorch to achieve the proposed model. We employed two NVIDIA GeForce 2080Ti GPU for 500 epochs of training. While training the model, we used the Adam optimizer with a self-adjusting learning rate. e initial learning rate was 0.0001. e L2 norm was used to normalize the model, and the weight decay rate was 10-5. Due to the different imaging equipment and protocols, data artifacts were present within the MRI images [53], so we used the N4 bias field correction algorithm to correct the bias in T1-weighted, T2-weighted, and postcontrast T1weighted modes.
Due to GPU memory limitations, we expanded the data using the following techniques: random cropping of the original images to 128 × 128 × 128, random mirroring of the axial, coronal, and sagittal views with a probability of 0.5, random rotation of the images between angles [−10°, 10°], random changes to intensities between [−0.1, 0.1], and scaling of images between [0.9, 1.1].

Implementation of Experiment.
Our model uses residual models as building blocks. e overall structure is similar to that of an encoder-decoder. e inputs are four channels of data corresponding to four modes of MRI data. During the feature encoding stage, the residual units of learnable groups are used, and the modified skip connection enhances the multiscale representation capabilities. During the decoding stage, the high-resolution features of cascading the encoder are used to supplement lost information. Upsampling is performed using trilinear interpolation. After each convolution block, Batch Normalization and Rectified Linear Unit (ReLU) activation are executed.
At the encoder end of the network, the skip connection was modified for information links between stages. Before encoding, max pooling is used for the downsampling of the high-level features in the encoder to match the scale on lower levels; in other words, max pooling is applied to the output of the previous level. Previous features can be accessed directly for the inputs of each stage to enhance feature reuse.
In the decoding stage, the outputs of each stage can guide the final segmentation results, adjusting the outputs of each decoder so that they have the same dimensions as the final output segmentation map, thereby producing three different outputs that are combined and then subjected to the softmax operation to derive the final segmentation map.

Experiment Results and Analysis.
To verify the effectiveness of the proposed network, we trained and validated the proposed network and the original DMFNet using the same training set and validation set. Table 1 presents the experiment results of the original DMFNet model, a network using learnable group convolution (DMFNet + LG), and the proposed DLSDNet (DMFNet + LG + Skip + DS). Performance in terms of brain tumor image segmentation was compared using the Dice score and the Hausdorff distance. Wt, Et, and Tc denote whole tumors, enhancing tumors, and tumor cores, respectively.
A comparison of the first and second rows in Table 1 show that using learnable grouping can improve the Dice score by 1.2% for Tc, thereby indicating that learnable grouping facilitates the detection of small targets. A comparison of the second and third rows in Table 1 show that the addition of the skip connection significantly improves the   Table 2 compares the proposed network model and typical brain tumor image segmentation networks. As shown in Table 2, the proposed network model exhibits better segmentation performance than 3D U-Net [31]    Mathematical Problems in Engineering 7 challenge (No New-Net and NVDLMED) [37,38], DLSDNet achieved the best Dice score and Hausdorff on Tc with 86.20% and 5.74 mm, respectively. e proposed network model DLSDNet involved a smaller number of network model parameters, was less complex, and occupied fewer resources due to fewer FLOPs, as calculated in [39]. e visualized segmentation results are as shown in Figure 6. As can be seen, the original DMFNet can already roughly segment the contours of the tumor region. However, in some details, such as smaller tumor core targets, the segmentation performance is poorer. Figure 6(b) displays the results using the network model with learnable grouping. e segmentation performance of this model with regard to tumor cores was superior to that of the original DMFNet. Figure 6(c) shows the segmentation results using the network model with a skip connection and deep supervision. e segmentation performance of this model with regard to whole tumors, enhancing tumors, and tumor cores  LG_MFunit improved further, and the results were even closer to the ground truth. As can be seen, the proposed DLSDNet model is better at segmenting small targets.

Conclusions
Brain tumors vary significantly in intensity and are irregular in shape. is study modified DMFNet to use fewer parameters and introduced LGConv to the feature extraction stage so that it can flexibly choose the number of groups based on dataset and network characteristics. is facilitates adaptation to more complex data features and has wider applicability. We added a skip connection between LGConv blocks to enable thorough utilization of multiscale information and to enhance feature reuse. We added deep supervision to the network output stage to merge the outputs of different stages and to reconstruct outputs with the same dimensions for the extraction of more distinctive features and the enhancement of segmentation accuracy. Experiments using the BraTS 2018 dataset revealed that the proposed model is superior to networks with conventional U-Net structures and has greater precision than other lightweight brain tumor image segmentation methods. e segmentation precision of the proposed network with regard to whole tumors, enhancing tumors, and tumor cores was 90.25%, 80.36%, and 86.20%. Compared with the methods that won the first place in the BraTS 2018 challenge (NVDLMED), the Parameters and FLOPs are reduced by 7 and 40 times, respectively. e proposed network, thus, has significantly greater precision in the segmentation of enhancing tumors and tumor cores than the original DMFNet and also offers strong competition against the methods that won the first and second place in the BraTS 2018 challenge. Furthermore, the proposed method uses fewer parameters and is a less complex model.

Data Availability
e data used to support the findings of this study can be obtained from the corresponding author on request.  Mathematical Problems in Engineering 9