Bayesian Fully Convolutional Networks for Brain Image Registration

The purpose of medical image registration is to find geometric transformations that align two medical images so that the corresponding voxels on two images are spatially consistent. Nonrigid medical image registration is a key step in medical image processing, such as image comparison, data fusion, target recognition, and pathological change analysis. Existing registration methods only consider registration accuracy but largely neglect the uncertainty of registration results. In this work, a method based on the Bayesian fully convolutional neural network is proposed for nonrigid medical image registration. The proposed method can generate a geometric uncertainty map to calculate the uncertainty of registration results. This uncertainty can be interpreted as a confidence interval, which is essential for judging whether the source data are abnormal. Moreover, the proposed method introduces group normalization, which is conducive to the network convergence of the Bayesian neural network. Some representative learning-based image registration methods are compared with the proposed method on different image datasets. Experimental results show that the registration accuracy of the proposed method is better than that of the methods, and its antifolding performance is comparable to that of fast image registration and VoxelMorph. Furthermore, the proposed method can evaluate the uncertainty of registration results.


Introduction
Image registration is an image-processing process that aligns two or more images of the same scene captured at different times and different perspectives or by using different sensors [1,2]. Nonrigid medical image registration is a key step in medical image processing. In clinical diagnosis, it can judge a patient's progress by aligning the brain magnetic resonance images of the patient with Alzheimer's disease at different periods [3,4]. In tumor surgery, rapid medical image registration can aid doctors in surgical navigation [5][6][7]. In demography research, image registration is helpful for studying differences in the brain tissue structures of people from different countries.
However, traditional registration methods face real-time challenges. Large amounts of input data must be processed when performing nonrigid registration modeling on 3D data with high resolution, a step that requires a long time. e optimization part usually uses an iterative algorithm, thereby further increasing the total time needed to obtain the final result [35].
With the development of deep learning, several researchers have proposed deep neural networks to learn features of unregistered images. Registration methods based on deep learning can be supervised [36,37] or unsupervised [38][39][40]. Most supervised registration methods rely on anatomical labels. However, marking anatomical labels is difficult, a step that not only consumes a lot of time of experts but also sometimes hardly guarantees accuracy. In practice, supervised registration methods are restricted. In their place, some scholars have proposed unsupervised medical image registration methods.
Several unsupervised medical image registration methods have been proposed. VoxelMorph, a recently proposed unsupervised learning-based method for deformable medical image registration, has a better registration accuracy and a faster speed than other registration methods [41]. Some researchers combined the advantages of classical methods and learning-based methods to produce a probabilistic generative model and derive a diffeomorphic inference algorithm [42]. A registration method called fast image registration (FAIM) for 3D medical image registration has been proposed. Compared with the registration network based on U-net, FAIM has fewer trainable parameters to obtain a higher registration accuracy. In addition, FAIM has less irreversible regions because of the penalty loss for negative Jacobian determinants [43]. Some scholars recently proposed the Probab-Mul registration method, which is a feature-level probability model that can perform regularization on the hidden layers of two deep convolutional neural networks [44]. e focus of the present study is mainly on the accuracy of registration methods and barely on the uncertainty of their registration results. e uncertainty of registration results is very important in clinical applications as it can be used to judge whether the registration result is meaningful. For example, if a model is modeling normal human brain images, it will never see abnormal brain images that have brain tumors, malformations, and edema. When the uncertainty of a registration result is higher than a certain threshold, the source image can be judged as an abnormal brain image. During testing, the Bayesian neural network can obtain the uncertainty of results. Bayesian neural networks are used in autonomous driving, classification tasks, and segmentation tasks. Some researchers recently applied Bayesian neural networks to image registration. Deshpande et al. employed a Bayesian deep learning approach for deformable medical image registration. ey reported that this approach has a better performance than existing state-ofthe-art approaches [45]. Khawaled  eir approach provided better estimates of the deformation field, thereby improving registration accuracy [46]. However, these aforementioned methods do not sufficiently consider and discuss the uncertainty of registration results. Furthermore, they are suitable for 2D images only.
In this paper, a method based on Bayesian fully convolutional networks is proposed for image registration. e proposed method generates a geometric uncertainty map to measure the uncertainty of registration results. us, when the source image obtained is abnormal data, the model will provide a hint that the source image is problematic instead of immediately accepting the registration result of the model. Group normalization (GN) is also added in networks. GN groups channel similar features into one group. Hence, GN can make the model easier to optimize and converge to improve registration accuracy. e performance of the registration model in evaluating uncertainty is determined.
is paper is organized as follows. Section 2 introduces the principle of the proposed method. Section 3 describes the experimental setup. Section 4 discusses the experimental results. Finally, Section 5 summarizes the results of the study and considers directions for future work. Figure 1 presents an overview of the proposed method. We used CNN to model the function g c (S, T) � u, where c is the parameter of the convolutional layers, S is the source image, T is the target image, and u is the displacement field between the source image and the target image. S and T are defined over a 3D spatial domain Ω ⊂ R 3 . For each voxel p ∈ Ω, u(p) is the displacement, where the map ϕ � Id + u is formed using an identity transform and u. e network takes S and T as the input and uses a set of parameters c to calculate ϕ. We used a spatial transformation function to warp S to S ∘ ϕ and evaluate the similarity between S ∘ ϕ and T. During testing, given the images T and S of the test set, we obtained the registration field by evaluating g c (S, T).

Architecture.
In this section, the architecture of the convolutional neural network used in the proposed method is described in detail ( Figure 2). During training, the moving image and the target image are stacked together as the input fed into the Bayesian fully convolutional network module (BFCNM) [43]. e first layer is inspired by Google's inception module. e purpose of this layer is to compare and capture information on different spatial scales of later registration. Parametric rectified linear unit [47] activation is utilized at the end of each convolution block, and linear activation is employed in the last layer to generate the displacement field. Instead of inserting max-pooling layers, a kernel stride of 2 is used to reduce image size. ree "add" skip connections are present in downsampling and upsampling [43]. e "add" skip connection is conducive to the fusion of upsampling information and its corresponding downsampling information. During the upsampling phase, two Bayesian blocks are used.
e Bayesian blocks are composed of a transposed convolutional layer, a convolutional layer, PReLU, a group normalization layer, and a Dropout layer. e detail of the Bayesian block is shown in Figure 2  Dropout) is introduced; it is interpreted as a Bayesian approximation of Gaussian processes.

Spatial Transformation Function.
e spatial transformation function uses the ϕ generated by BFCNM to resample S and obtain the warped image S ∘ ϕ. e proposed method learns the optimal parameter values by minimizing the difference between S ∘ ϕ and T. A differentiable operation is constructed on the basis of the spatial transformer network [41,48] via the standard gradient-based method to calculate S ∘ ϕ. For each voxel p, a voxel position p′ � p + u(p) is calculated in the source image. e image values are only defined in integer positions.
us, linear interpolation is performed at eight adjacent voxels: where Z(p ′ ) is the voxel neighbors of p′ and d is iterated in the dimension of Ω. Errors can be backpropagated during the optimization process because gradients or subgradients can be calculated.

Loss Function.
e total training loss is the sum of an image dissimilarity term L image and the regularization terms, as shown in Table 1. e loss function [43] is defined as follows: e main loss L image with cross-correlation (CC) in this paper is for the similarity between the warped source image and the target image. e definition of CC is as follows: where t(x) is the grey value of the target image, t(x) is the average grey value of the target image, s ∘ ϕ(x) is the grey value of the warped image, and s ∘ ϕ(x) is the average grey value of the warped image. e first regularization term R 1 regularizes the overall smoothness of the predicted displacements. e parameter of the regular term is α, and its value is always 1. e purpose of the second regularization is to penalize transformations that have many negative Jacobian determinants. e parameter of the regular term is β. e transformations of all nonnegative Jacobian determinants will not be penalized. If the Jacobian determinant is negative, then the transformation result will be folded, which is physically unrealistic.

Group Normalization.
Group normalization (GN) is a feature-normalization technique that is inserted into the architecture of deep neural networks as a trainable process. e purpose of GN is to reduce internal covariant shifts. With training iterations, the distribution of features often continuously changes. Under this condition, the parameters in the convolutional layer must be continuously updated to adapt to the changes in distribution. GN normalizes the feature to a fixed distribution (mean value is zero, and the standard deviation is 1) and then adjusts the feature to an ideal distribution, which is learned in the training process [48].
Here, x is the feature computed by a layer, and i is an index. In the case of 3D images, i � (i N , i C , i D , i H , i W ) is a 5D vector indexing the features in (N, C, D, H, W) order, where N is the batch axis; C is the channel axis; and D, H, and W are the spatial depth, height, and width axes, respectively.
Formally, the group normalization layer must compute for mean µ and standard deviation σ in a set S i . S i is a group and defined as follows: where G is the number of groups, which is a predefined hyperparameter; C/G is the number of channels in each group; ⌊ · ⌋ represents floor operation; k C /(C/G) � i C / (C/G) means that the indexes i and k are in the same group of channels, assuming that each group of channels is stored in sequential order along the C axis; and S i contains all voxels along the (D, H, W) axes and along with a group of (C/G) channels. e mean μ i and standard deviation σ i of S i are computed as follows: where ε is a small constant, and m is the size of set S i . GN then performs the following computation: GN learns a per-channel linear transform to compensate for the possible loss of representational ability: where c and β are trainable scale and shift, respectively. Given the S i in (4), the GN layer is defined by equations (5)- (7). Specifically, the voxels in the same group are normalized by the same μ i and σ i . GN also learns the c and β of each channel.

Bayesian Neural Network.
In this section, the registration network based on Bayesian inference is introduced. e credibility of the results is important in solving medical Table 1: Loss and regularization functions used.
Journal of Healthcare Engineering problems. Several researchers proposed a Bayesian neural network by studying the uncertainty of deep learning [49,50]. e Bayesian neural network is a statistical model derived from the perspective of probability. e parameters in the model are initialized by an a priori distribution, and the parameters are further optimized by Bayesian inference.
In the Bayesian neural network, given the data set D and weight W, the dataset D contains data X and label Y. e goal of Bayesian neural network training is to optimize the parameters, that is, to seek the posterior distribution of weight W. According to Bayesian criterion, the posterior distribution of weight W is written as where x and y are the data in the training set and the corresponding label, respectively, and p(W) is the initial value of the parameter (i.e., the prior distribution). e posterior distribution of the labels predicted by the Bayesian neural network can then be calculated as follows [51]: In equation (9), the weight parameter W in the network is used to predict the unknown distribution of label y * . In the Bayesian neural network model, the solution of the posterior distribution p(W|D) of the parameters is the key to the entire model. However, this solution is computationally intractable for neural networks of any size. erefore, many researchers use approximate methods to obtain the solution [52,53].
A common approach is to use variational inference to approximate the posterior distribution of the weights. is method introduces the variational distribution q θ (W) of weight w, which is parameterized on θ. e approximate posterior distribution q θ (W) is obtained by minimizing the Kullback-Leibler (KL) divergence between q θ (W) and the true posterior distribution p(W|D).
Minimizing KL divergence is equivalent to minimizing the Negative Evidence Lower Bound (NELBO): with respect to the variational parameter θ. e first term (commonly referred to as the expected log-likelihood) encourages q θ (W) to place its mass on the configurations of the latent variable that explains the observed data. However, the second term (referred to as prior KL) encourages q θ (W) to be similar to the prior distribution p(W), preventing the model from overfitting. e goal is to develop an explicit and accurate approximation for the expectation.
Our approach uses Bernoulli approximating variational inference and Monte Carlo sampling [54]. In practice, Dropout is used for Bayesian neural network approximation.
When Dropout [55] is applied to the output of a layer, the output can be written as where, for a single K i−1 dimensional input v, the i th layer of a neural network with K i units would output a K i dimensional activation vector; w i is the K i × K i−1 weight matrix; σ(·) is the nonlinear activation function; ⊙ signifies the Hadamard product; z i is a K i dimensional binary vector with its elements drawn independently from z (k) i ∼ Bernoulli(p i ) k � 1,. . ., K i ; and p i is the probability of keeping the output activation.
e solution of the posterior distribution p(W|D) of the parameters is further improved after introducing the Bernoulli distribution into the weight parameters of our model. e Monte Carlo sampling method is used to estimate the first item in (11): log p y n |x n , W n , where W ∧ n is not the maximum posterior estimation but the random variable realizations from the Bernoulli distribution; and W ∧ n ∼ q θ (W), which is the same as applying Dropout to the weights of the network. For the second item in equation (11) (i.e., KL term), the approximate solution is given in the literature [56]. e KL term has been shown to be equivalent to L i�1 ‖W i ‖ 2 2 . us, equation (11) can be rewritten as Equation (14) is the unbiased estimation of equation (11). Interestingly, it is the same as the loss function used in standard neural networks with L 2 weight regularization, and Dropout is applied to all weights of the network. erefore, training such a neural network with stochastic gradient descent has the same effect as minimizing the KL term in (10). is scheme is similar to a Bayesian neural network and can generate a set of parameters that can best explain the observed data while preventing overfitting. Predictions in this model follow (9) replacing the posterior p(W|D) with the approximate posterior q θ (W). e integral can be approximated with Monte Carlo integration [51,54]: Journal of Healthcare Engineering p y * |x * , D ≈ p y * |x * , W q θ (W)dW where W ∧ t ∼ q θ (W), which means that, at test time, the Dropout layers are kept active to keep the Bernoulli distribution over the network weights. is integration is referred to as the Monte Carlo Dropout.
e Monte Carlo Dropout reflects the need to conduct multiple forward propagation processes on the same input. In this manner, the output of "different network structures" can be obtained under the action of Dropout during testing. e prediction results and the uncertainty of the model can be obtained by calculating the average and statistical variance of these outputs. e advantage of Bayesian deep learning is that Monte Carlo Dropout can give a prediction value and the confidence of the predicted value.

Measuring Model Uncertainty.
Uncertainties in a network are a measure of how certain the model is with its prediction. In general, Bayesian modeling has two types of uncertainty. Model uncertainty, also known as Epistemic uncertainty, measures what the model does not know owing to the lack of training data. is uncertainty can be reduced with more data. During testing, model uncertainty can measure whether the testing data exists in the distribution of the training data. Aleatoric uncertainty measures the noise inherent in the observation data and cannot be reduced by collecting more data [51].
By computing the result of stochastic forward passes of the Bayesian neural network, the model's confidence of its output can be estimated. In this paper, the mean μ and the standard deviation σ of all displacements produced by Monte Carlo sampling are calculated. e mean μ is used in the registration image, whereas the standard deviation σ provides an estimate of the uncertainty of registration results. e mean μ of the displacement fields is calculated as follows: where M represents the number of Monte Carlo sampling (M � 48 in this paper). y i represents the displacement field sampled by ith. After calculating the mean value of the displacement fields, the standard variance of the displacement fields can be calculated as follows: where σ can be expressed as the uncertainty of registration results.

Experimental Setup.
e dataset we adopted herein was created by Arno et al. who based it on a collection of 101 T1weighted MRIs from healthy subjects [57]. In this paper, we used brain images from the four subsets of Mindboggle101, namely, NKI-RS-22, NKI-TRT-20, MMRR-21, and OASIS-TRT-20, for a total of 83 images. ese images are already warped to MNI152 space. Each image had a dimension of 182 × 218 × 182, each of which we truncated to 144 × 180 × 144. In the preprocessing stage, we utilized the FMRIB Software Library (FSL) to perform affine registration on NKI-RS-22, NKI-TRT-20, MMRR-21, and OASIS-TRT-20. We initially normalized the voxel intensity of each brain image and then normalized voxel intensity to 0-255. Finally, we performed a registration test on the five main anatomical regions of the cerebral cortex.

Dice Scores.
If the registration field ϕ represents an accurate correspondence, then the corresponding anatomical regions in T and S ∘ ϕ should overlap well. erefore, we evaluated registration accuracy by using the Dice score. e Dice score is defined as follows [43]:

Regularized Penalty Folding.
We also evaluated the regularity of deformation fields. Specifically, the Jacobian matrix captures the local properties of ϕ around voxel p. We counted all nonbackground voxels where the Jacobian determinant det(∇ϕ) < 0 is negative [43]: where δ(·) indicates that if it is true, then the return value is 1.

Uncertainty Evaluation
Metrics. We adopted the method proposed in the literature to evaluate uncertainty performance [51]. We used metrics that incorporate the ground truth label, model prediction, and uncertainty value to evaluate the performance of such models in estimating uncertainty. Figure 3 shows the required processing steps to prepare these quantities for our metrics in a registration example. We computed the map of correct and incorrect values by matching the ground truth labels and the model predictions. We converted the uncertainty map into a map of certain and uncertain predictions by setting the uncertainty threshold T, which varies between the minimum and the maximum uncertainty values in the entire test set. e following indicators can reflect the characteristics of a good uncertainty estimator. Negative predictive value (NPV): in the output of certain results by the model, NPV is the percentage of voxels that is correctly predicted and can be written as a conditional probability: True positive rate (TPR): if a model is making an incorrect prediction, then the proportion of uncertain voxels is called TPR. TPR can be written as a conditional probability: Uncertainty accuracy (UA): UA is the overall accuracy of uncertainty estimation and can be measured as the ratio of the desired cases explained above (TP and TN) over all possible cases: UA � P(correct, certain) + P(uncertain, incorrect) P(correct) + P(incorrect) Clearly, in all the metrics proposed above, higher values indicate that the model performs better. e values of these metrics depend on the uncertainty threshold.

Baseline Methods.
In the comparative study, we used FSL, a comprehensive library of analytical tools for fMRI, MRI, and diffusion tensor imaging brain imaging data, as the baseline to perform an affine registration experiment with 12 degrees of freedom on the test set. We used the second baseline symmetric normalization (SyN) with mutual information as a similarity measure in the publicly available Advanced Normalization Tools (ANTs) software package [58]. We also tested the recently developed CNN-based methods, namely, VoxelMorph [41], FAIM [43], and Probab-Mul [44], and compared their performance with that of the proposed method. e hyperparameters of the CNNbased methods were consistent. Finally, we adopted various methods for ablation study. e method that only adds GN was denoted as Our-GN, and the method that only adds Dropout was labeled as Our-DO. ese two methods were consistent with our method in terms of hyperparameter settings.

Implementation.
We divided the data set into training and test image sets. e training set consisted of all ordered brain image pairs from the union of the NKI-RS-22, NKI-TRT-20, and MMRR-21 subsets, which comprised 3906 pairs in total. e test set consisted of all ordered pairs from the OASIS-TRT-20 subset with 380 pairs in total. We trained FAIM, VoxelMorph, Probab-Mul, and our method on all pairs of images from the training set and then examined their predicted deformations by using the pairs of images from the testing set.

Results and Discussion
In this experiment, we separately trained the proposed networks with different β values. We optimized the parameters by the validation set and reported results in our test set. e predicted deformation field could not guarantee diffeomorphism; therefore, the transformation of irreversible regions caused an image to "fold" on itself. In these regions, the determinant of the Jacobian matrix of the deformation field was negative ( Figure 4). However, spatial folding is physically impossible; hence, this phenomenon causes registration errors in clinical applications. e frequency of such errors limits the application of neural networks in image registration. e mean Dice scores of the different methods across all predicted labels with their corresponding target labels are shown in Table 2. We selected five scales of regularization strength β from 0 to 10 −2 . Results showed that FSL was not suitable for fine registration because of its few registration parameters in the affine registration with 12 degrees of freedom. ANT (SyN) is a nonrigid registration method, and its registration accuracy was found to be higher than that of affine registration. e Dice scores of FAIM slightly decreased as β increased, and its Dice scores were higher than those of VoxelMorph. e registration accuracy of Probab-Mul was slightly better than that of FAIM. e proposed method achieved the highest registration accuracy under all β values.
e results of the ablation study revealed that the Dice score of Our-GN was higher than FAIM by 2.8% on average ( Table 2). Experimental results showed that GN could improve the accuracy of registration. Moreover, the registration accuracy of Our-DO was slightly lower than that of FAIM ( Table 2), illustrating that adding a Dropout layer had little impact on registration accuracy.
When β takes 0, 10 − 5 , 10 − 4 10 − 3 , and 10 − 2 , the Dice scores of our method were higher than those of FAIM by 2.63%, 2.80%, 2.84%, 2.50%, and 3.37%, respectively, higher than those of Probab-Mul by 2.31%, 2.49%, 2.55%, 2.05%, and 2.88%, respectively, and higher than those of Vox-elMorph by 4.78%, 5.17%, 5.27%, 5.21%, and 5.90%, respectively ( Table 2). is result implied that inserting GN layers into the network architecture could indeed enable the network to learn better parameters and could make the network easier to optimize and converge, thereby improving registration accuracy. During training, we set epochs � 10, and each epoch took about 100 min to perform 2900 iterations. Figure 5 presents the boxplot of the Dice scores of the five main anatomical regions of the cerebral cortex when β � 10 −3 .
e Dice scores of ANTs in each label were quite different, indicating that ANTs were unstable. e flatness of the boxplots indicated that the stability of the proposed method was comparable to that of other deep learning methods. e proposed method achieved the highest registration accuracy in the five regions of the cerebral cortex. Figure 6 shows the mean Dice scores corresponding to the different β values of all methods. e accuracy of the proposed method was relatively consistent with different β values and was higher than that of the other methods. Figure 7 visualizes the effects of the second regularization term R 2 (u), which directly penalized "foldings" during training. β � 0 means the regularization was not used, and multiple locations were visible in the transformation whose Jacobian determinants were negative. e number of "foldings" voxels greatly reduced when β � 10 −5 . Only several "folding" voxels were observed when β � 10 −4 . e number of "folding" voxels was almost eliminated when β � 10 −3 .

Data-Regularized Penalty Folding.
We listed the mean values of the number of voxels of different methods whose Jacobian determinants were negative in Table 3. e proposed method had a lower number of "foldings" in the predicted deformations as β increased.

Uncertainty Measure.
In this section, the performance of the registration model in estimating uncertainty was evaluated. Figure 8 shows the results of uncertainty evaluation by the proposed method. In the experiment, the Dropout rate of the Dropout layer was set to 0.5. During the test, the Dropout layer was always on, and 48 Monte Carlo samplings were performed. e threshold T of the uncertainty map had an impact on the uncertainty measure. We set the threshold between 0 and 1 with an interval of 0.1. As the threshold increased, the proportion of the uncertain part   Journal of Healthcare Engineering of the uncertainty map decreased; hence, the TPR curve gradually decreased. e maximum value of 0.756 in the NPV measurement was obtained at the threshold of 0.1. is value slightly decreased as the threshold further increased but still greater than 0.72. If the model was certain about its prediction, then the accuracy of the prediction was higher. Uncertainty accuracy was also the largest (0.77) when the threshold was 0.1 and slightly declined as the threshold further increased. It remained greater than 0.712. Uncertainty accuracy is the overall accuracy of uncertainty measurements. It shows the ratio of the cases we desired in all possible cases. e uncertainty accuracy of our model was relatively high (Figure 8).

Conclusion
We developed an unsupervised 3D medical image registration method that uses Bayesian fully convolutional networks for registration. e proposed method introduces probability distributions for network weights and obtains the uncertainty of registration results. We introduced GN into the neural network architecture, which is conducive to the optimization and convergence of the neural network.
e experimental results showed that the proposed method can obtain higher registration Dice scores than other stateof-the-art models and achieve an antifolding performance comparable to that of FAIM and VoxelMorph. e proposed method can also estimate the uncertainty of registration results. Although penalty folding can reduce the irreversible area of registration result, it cannot guarantee that the irreversible area is zero. us, the nonrigid registration of diffeomorphism with high accuracy is one of our research directions in the future.

Data Availability
All datasets used to support the findings of this study were supplied by the publicly available Mindboggle101 database. e URL to access this data is https://osf.io/yhkde/.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this work.    Figure 8: Quantitative uncertainty estimation performance for registration task using the evaluation metrics. e abscissa is the threshold (T). e ordinate is negative predictive value (NPV), true positive rate (TPR), and uncertainty accuracy (UA), respectively.