Uncertainty Estimation Using Variational Mixture of Gaussians Capsule Network for Health Image Classification

Capsule Networks have shown great promise in image recognition due to their ability to recognize the pose, texture, and deformation of objects and object parts. However, the majority of the existing capsule networks are deterministic with limited ability to express uncertainty. Many of them tend to be overconfident on out-of-distribution data, making them less trustworthy and hence reducing their suitability for practical adoption in safety-critical areas such as health and self-driving cars. In this work, we propose a capsule network based on a variational mixture of Gaussians to train distributions of network weights as opposed to a single set of weights and enable the model to express its predictive uncertainty on out-of-distribution data. Training distributions of weights have the added advantage of avoiding overfitting on smaller datasets which are common in health and other fields. Although Bayesian neural networks are known to exhibit slow training and convergence, experimental results show that the proposed model can retrieve only relevant features, converge faster, is less computationally complex, can effectively express its predictive uncertainties, and achieve performance values that are comparable to the state-of-the-art models. This is an indication that CapsNets can exhibit the transparency, credibility, reliability, and interpretability required for practical adoption.


Introduction
Recently, there has been an upsurge in the adoption of Deep Learning (DL) to perform complex tasks such as Visual Question Answering [1], and plant disease detection [2], among others, due to their excellent performance in terms of speed and accuracy compared to humans. Capsule Networks [3,4], for example, have demonstrated the ability to recognize the pose, texture, and deformation of an object and its parts. ey have thus been proposed for use in sensitive areas such as health [5,6] and agriculture [7,8], among others. Irrespective of the sensitivity of the application area, capsule networks (just like many other deep learning models) do not incorporate uncertainties in their predictions. e inability to model uncertainties leads to model over/under confidence [9]. We propose a Bayesian Capsule Network (BCN) motivated by [10,11] and on the background that the Bayesian framework provides the capability for modeling uncertainties in neural network predictions [12]. Bayesian Neural Networks (BNNs) estimate uncertainties by defining a distribution over the network weight parameters whose posterior weight distribution p(. | x) permits the BNN to capture the prediction uncertainties.
BNNs are known to have a longer convergence time during training [11] since training occurs on larger distribution parameters compared to single points in deterministic models. However, the choice of appropriate normalization and weight initialization schemes can allow the network to converge faster. Since Bayesian models replace the fixed weights with probability distributions, they are capable of training on smaller datasets without overfitting. is work, therefore, proposes a Variational Mixture of Gaussian-based capsule network (CapsNet) that will contribute to solving problems such as those caused by the lack of huge datasets in critical areas (e.g., in health). Additionally, we aim at reducing model complexity, reducing convergence time, and improving accuracy on difficult datasets that are small and imbalanced. ese are difficult targets for a Bayesian model known for its complexity, and inability to converge faster to achieve. We also aim to leverage the ability of the BNN to model uncertainties and introduce some form of reliability in the predictions of the model on input images. e motive is to enable such models to gain the confidence of the practitioner for practical adoption in safety-critical areas such as autonomous cars and medicine. e lack of sufficient training data is a major limiting factor to the adoption of deep learning in areas such as health due to concerns related to overfitting.
is work, therefore, uses Bayesian NNs to elegantly avoid this problem by acting on the distributions weights as opposed to deterministic models which train on a single set of weights. For instance, the parameter θ of a distribution on the weights p(w | θ) is learned by Variational Inference leading to the minimization of Kullback-Leibler (KL) divergence. is method provides a principled framework for the usage of model components leading to better monitoring of model complexity and avoiding its associated problems such as overfitting. In addition, regularization is natural to BNNs such that the regularization parameters get consistent treatment in the Bayesian setting thus eliminating the need for techniques such as cross-validation [13]. Perhaps, one of the main benefits of our method to the health and other critical sectors is the model's ability to avoid overconfident predictions in regions of sparse data.
Experimental results show that our proposed Variational Mixture of Gaussians Routing (VMGs-Routing) achieves a significant reduction in model complexity while achieving competitive results compared to the state-of-the-art models. Our routing algorithm improves upon similar existing routing algorithms by training and learning faster to achieve convergence within a few epochs (approximately 100 epochs).
is method further reduces the infinite likelihood and zero variance problem inherent in Maximum Likelihood solutions caused by Gaussian clusters that try to take sole possession of data points (also known as polarization in Capsules). e contributions of this paper can be summarized as follows: (1) We propose a routing method from a variational mixture of Gaussians that clearly relies on the maximization of the evidence lower bound (ELBO) to activate a capsule.
(2) We provide empirical results that are comparative to state-of-the-art previous works on Bayesian and deterministic capsules to demonstrate that our approach does not result in the loss of any of the inherent strengths of capsules such as viewpointinvariance, robustness. (3) We show that our proposed Bayesian CapsNet is not overconfident and is reliable from the high uncertainty it expresses on out-of-distribution data.
(4) e proposed model is less computationally complex and performs comparatively well with deep Bayesian CapsNet models from the literature in terms of accuracy, uncertainty estimation, and prediction. Comparatively, our model achieves better speedup during training and testing without performance degradation. (5) We provide extensive visualizations of layer activation maps, and predictive uncertainty plots, among others in an attempt to increase the interpretability of our model which is presumed (as a Bayesian model) to be a complex probabilistic 'black box' model. e rest of the paper is organized in the following way: Section 2 presents the related works in the literature followed by Section 3 which discusses the Bayesian methods adopted for this work. Section 4 presents the experiments and experimental results after which the paper is concluded in Section 5.

Related Work
Some works in the literature have relied on variational inference to propose capsules to solve varied problems. Smith et al. [14] proposed a probabilistic capsule (CapsNet) to encode the capsule assumptions and separate the generative and inference parts from each other. ey showed that their model can generalize well on out-of-distribution data, but did not express the uncertainty of their model. Ribeiro et al. [11] proposed a Bayesian CapsNet routing algorithm based on a mixture of transforming Gaussians to address the variance collapse problem and to model the uncertainty of the pose parameters. However, experimental results of the uncertainty of the pose parameters were not provided. In this implementation, a parent capsule j is activated if there is an agreement between the votes of adjacent capsules. e agreement is measured by the entropy of the multivariate Gaussian distribution. A conditional variational CapsNet [15] was proposed to detect classes that are not known during training as a contribution to the open set recognition problem. To this end, they adopted the variational autoencoder approach enabling similar features to assume the shape of a Gaussian, such that each unique feature assumed a different Gaussian. A flow-based model with a long flow structure is capable of finding the approximate posterior probability compared to utilizing a simple family of distributions to approximate the intractable posterior. However, as the data increase in dimensionality, this solution gives rise to huge computational complexity and variance. To address this shortcoming, Hua et al. [16] utilized a dynamic routing flow with variational inference to achieve a shorter flow structure and a significant improvement in precision and accuracy. To introduce routing uncertainties in CapsNet, Ribeiro et al. [17] proposed a global view of the local iterative routing between capsules of adjacent layers, enabling them to capture the uncertainty in the assignment of parts to objects. Compared to the two previous works mentioned earlier, this partial Bayesian CapsNet produced results on out-of-distribution predictive entropies that were consistent with uncertainties of model predictions. To avoid the singularity problem caused by maximum likelihood estimation (MLE), a variational routing CapsNet [18] has been proposed to utilize the variational distribution and integrate the prior distribution for automatic determination of the class of data and avoid overfitting. A Bayesian capsule encoder [19] was proposed to regulate the standard deviation and mean in latent space. e authors argue that it is a better approach for the retrieval of relevant features and image reconstruction from latent space. To demonstrate that deep variational CapsNets can achieve better performance on image synthesis and analysis, Huang et al. [20] proposed a variational model in which the divergence between a capsule and a given prior distribution defines the presence of different entities in an object.
Traditionally, uncertainty is modeled with probability theory and is increasingly becoming more relevant due to the adoption of deep learning (DL) models in practical and safety-critical applications such as medicine and self-driving cars. is type of modeling uses a single probability distribution to capture the required knowledge and struggles to express the two types of uncertainties in a DL model [21]. Aleatoric uncertainty arises from the element of randomness due to the variability of the outcome of events, while epistemic uncertainty measures the modeler(s) inability to design the best model for the task at hand. In the literature, Bayesian networks with latent variables have been proposed [22] to measure both the predictive aleatoric and epistemic uncertainties. is approach played a significant role in the interpretability of the model, which, like other neural network models is perceived to be a "black box." With the inherent advantages of CapsNets over other neural networks, our work proposes a variational mixture of Gaussians routing-based capsules to effectively capture the predictive uncertainty on the in and out-of-distribution data to improve reliability, interpretability, and model confidence for safety-critical applications.

Proposed Methods
In this section, we outline a brief introduction to the concepts of Variational Inference and Gaussian mixture models on which our routing algorithm is based.

Bayesian Mixture of Gaussians.
Suppose X assumes a Gaussian distribution; a linear combination of these Gaussians forms the basis for the formulation of a mixture of probabilistic (Gaussian) models known as a mixture of Gaussians [10].
is convex combination creates the opportunity to adjust the means, covariances, and coefficients as a basis for approximating any continuous density function to arbitrary accuracy. Considering a superposition of K-Gaussian densities taking the form of the joint probability p(x, z) � p(x | z)p(z), z can be marginalized out to give Realizing that the mixing coefficient π z � p(z) � 1/K (K is a one-hot-vector) is the probability of choosing one cluster out of K clusters, the marginal probability can be rewritten in the form of a Gaussian Mixture Model (GMM), shown in equation (1): (1) e Gaussian density (also called component) in the above expression has its own mean μ k and covariance Σ k .
Since routing in capsules operates on the concept of clustering, they can naturally be modeled via a mixture of transforming Gaussians [11].

Variational Bayes.
Bayesian algorithms perform inference on unknown random variables by finding a posterior probability density [23] in situations where the posterior is intractable to compute. Approximate inference (using Variational Inference (VI)) provides a reasonable approximation to the problem compared to Markov Chain Monte Carlo (MCMC) methods that provide an exact solution but with slow convergence time.
Using the Bayes theorem, the posterior probability density can be computed as follows: where θ p(x, ϑ)dϑ is the marginal probability (also called the evidence). is term is intractable, requiring the use of approximate solutions such as VI. VI does this by searching a family of distributions Q for the distribution q that is closest to the posterior p(. | x). e distance between the variational ("nice") distribution q and the true posterior p(. | x); is measured by the Kullback-Leibler (KL) divergence.
erefore, minimization of the KL over q now becomes maximization of the Evidence Lower Bound (ELBO) to avoid the intractability issues of the true posterior p(ϑ | x).
To maximize the ELBO, the vector of hidden random variables θ � (θ 1 , θ 2 , . . . , θ n ) (distributed according to the Computational Intelligence and Neuroscience 3 variational distribution q) are assumed to be made up of independent random variables allowing their joint distribution to be obtained from the product of their marginal distributions.
is mean-field (MF) approximation makes it possible to obtain a free-form optimization of the ELBO L[q] with respect to all the distributions q i (θ i ) by optimizing each of the factors in turn. When the L[q] is fully described by the MF distribution, every data point described by a variational distribution will have its own free parameters. e task is to then find the free parameters that will maximize L[q].
In this study, it is assumed that data points, which are the realization of the random variables X 1 ,. . ., X N , are taken from the m-dimensional Euclidean space R D .
us, the dataset X � (X 1 , . . . , X N ) is a vector with R D -valued random coordinates that are to be classified into K clusters with random centroids H 1 ,. . . , H K that are multinormally dis- In what follows, f k will be written for the density of N(μ k , Δ −1 k ). Whenever the random variable X n is in the k th cluster, it then assumes the distribution of the centroid of that cluster. us, each data point X n is distributed according to N(μ k , Δ −1 k ), for some = 1,.., K. In the sequel we denote by C n , the cluster label of the random variable X n , for n = 1,. . ., N. To each data point X n , corresponds a latent variable Z n , that is a 1-of-K binary vector with π k being the probability Z nk = 1, for some k = 1,.., K. erefore, π = (π 1 ,. . ., π K ), called the vector of mixing coefficients, is a probability vector and N = (N 1 ,. . ., Y K ) = Z 1 + Z 2 + . . . + Z N is a random vector with K nonnegative coordinates that sum up to N. In fact, Y is multinomially distributed with parameters N and π. Observe that for any n = 1,. . ., N, the probability that Z n = z n is given by the following equation: Putting θ = (Z, π, μ,Λ), with Z = (Z 1 ,. . ., Z N ), μ = (μ 1 , μ 2 ,. . ., μ K ) and Λ = (Λ 1 , Λ 2 ,. . ., Λ K ), the joint distribution of X and θ can be written as follows: e second equality of equation (7) uses that p(Z | π, μ, Λ) � p(Z | π) and p(π | μ, π) � p(π). We assume further that conditioning on θ, the components of X are independent. Similarly, given π and Λ, the components of Z and μ are respectively independent. Furthermore, the components of Λ are also independent. In addition to the above prescription, we use the plate notation (directed graph) [10,24] to derive our priors and put the problem in a Bayesian setting.
us, using the conjugate priors of , Λ and π, and the above-given result in erefore, where From the joint distribution in (7), we identify the posterior and variational ('nice') distributions as p (Z, μ, Λ, π | X) and q(Z, μ, Λ, π) i.e., the p(θ | X) and q(θ) respectively, providing the ingredients for the computation of KL[q(Z, μ, Λ, π)‖p(Z, μ, Λ, π | X)]. Accordingly, the variational distribution (VD) is factorized based on the MF approximation method to obtain q(Z, μ, Λ, π) � q(Z)q (μ, Λ, π). Meanwhile, from the MF approximation, it can be shown that the best distribution q j for maximizing the ELBO is q * j (. | x), satisfying ln q * j (z | x) � ln p(z j , x) + constant. We consequently model the joint distribution in (7) according to the aforementioned best variational distribution. Initial calculations involve the determination of q * (z | x) followed by q * (π, μ, Λ). In other words, &9; +const. (15) Pushing the variables not dependent on z (i.e. p(π)p(μ, Λ)) into the constant, we obtain the following equation: Substituting (9) and (10) into the expression for log q * (z | x), produces where Exponentiating logq * (z | x) and normalizing it to let ρ nk sum to 1 over all the values of k produces where e best q * (z | x), therefore, is a product of categorical distributions for each latent variable having r nk for k � 1, 2, . . . , K as parameters.
On the other hand, the best variational distribution q * (π, μ, Λ) can be divided into two components q * (π) and q * (μ, Λ). It follows from the product rule, the deductions leading to equations (15), (9) and (10) Taking exponentials of both sides of the above expression and taking care of the normalizing term result in where α k � α 0 + y k and Upon some computations, the variational distribution q * (μ, Λ) for the joint distribution q(μ, Λ) takes the form where f 0 k and Wi are respectively the Gaussian and Wishart densities (see equations (12) and (13)) with parameters m k , β k , W k , and v k . ese parameters are given as follows: To evaluate r nk , the quantities in ρ nk are expressed as follows: where ψ is the log derivative of the multinomial gamma function.
After the substitutions, ρ nk becomes, Computational Intelligence and Neuroscience where K i�1 α i � α i . ere is a circular dependency between these variational parameters requiring n iterative updates that ensure the algorithm converges to an approximate posterior.
Using equation (7), the ELBO for a VGM model is obtained as follows: Applying the product rule, we obtain the following equation: and substituting the following expressions, and where H[q(Λ k )] is the entropy of the Wishart distribution. L then becomes the objective function to maximize and is given by the following equation: 6 Computational Intelligence and Neuroscience where In this paper, we implement the maximization of equation (32) through the iterative updates of the GMM parameters mentioned earlier.

Variational Mixture of Gaussians (VMGs) Routing-Based
Capsule Network. Motivated by [10,11], and [4] based on the discussions in Sections 3.1 and 3.2, we let L n and L k , respectively, represent capsules at the lower and higher-level layers. Let X k | n ∈ R 4x4 matrix represent the show of similarity between the features of a lower-level capsule n to a higher-level capsule k, with x k | n ∈ R D as its vectorized version (i.e. x k | n is a flattened vector of the matrix X k | n with D � 16). A higher-level capsule's pose matrix M k ∈ R 4x4 is flattened to obtain capsule k's pose vector μ k ∈ R D . For ease of computations, we use the precision matrix Λ k instead of the covariance matrix Σ, and use λ k ∈ R D to represents the diagonal entries of Λ k . As mentioned earlier, r nk represents the vector form of the routing responsibilities while π k is the mixing coefficient used for a single one-hot-vector representation (1/k) necessary for indicating the choice of a cluster(capsule). On a larger scale, z is a latent variable that serves as a collection of one-hot-vectors with similar features signifying the preference of each lower-level capsule feature to a corresponding higher-level capsule Gaussian cluster of features. Finally, we compute the activation probability a n to represent the likelihood that cluster k is activated by computing the ELBO (equation (32)) and paying a fixed cost of β a as indicated in [4]. Based on the above-given discussions, we derive Algorithm 1 as the routing procedure between capsules.

Uncertainty Estimation.
Aleatoric and epistemic uncertainties are common with neural network models. Randomness is a property that characterizes aleatoric uncertainty [21]. For this type of uncertainty, there is sufficient variability in the outcome of events as a result of a random phenomenon. Epistemic uncertainty, on the other hand, expresses the uncertainty resulting from the designer's lack of knowledge of the best design choices leading to the development of the best model. Both uncertainties together form the total uncertainty of the model. Several other methods exist for finding the total uncertainty of a model, but there is no consensus on which method is the best [25].
In this work, we experimentally determine the aleatoric and epistemic uncertainties of our model on some of the datasets. Since a deterministic model has no epistemic uncertainty [25], we determine its aleatoric uncertainty on the in and out-of-distribution data. For our Bayesian model, we determine both uncertainties.

Experiments
e experiments in this work were carried out using PyTorch 1.7 GPU version on a 64 bit NVIDIA GeForce GTX 1060 Windows machine. Each model was trained for 100 epochs using a learning rate of 0.001, 3 routing iterations, and patience of 10,000. During training, the best model is saved to be used for inference. e code used in our Computational Intelligence and Neuroscience implementation is a modification of the code in [11], which can be found in [26].

Loss Function.
We adopted the spread loss in [4] as well as the negative likelihood loss as used in [11].

Model Architecture.
Our model begins with a 2 × 2-filter convolutional layer to perform convolutions on a 32 × 32 × 1 input image with a stride of 2. is layer precedes three capsule layers and the ensuing VMG routing layers before the final class capsule layer which produces one capsule for each capsule class. Each capsule layer converts its respective filters into a 4 × 4 p i capsule pose matrix and activation. e final layer broadcasts its weight matrices to produce a capsule p 4 per class for each category in the dataset. Taking the filter f and the capsule types p i produced by each capsule layer into consideration, the network for the model can be represented as [f, p 1 , p 2 , p 3 , p 4 ]. e complete architecture is shown in Figure 1

Datasets and Data Preprocessing.
ree popular computer vision datasets and one health-related dataset were adopted to experimentally evaluate the methods proposed in this paper. MNIST [27] is a handwritten dataset consisting of 70,000 28 × 28 grayscale images commonly partitioned into 60,000 training and 10,000 test sets. Comparatively, this dataset is less complex but effective and very popular for testing the performance of computer vision algorithms. Fashion-MNIST [28] is another dataset obtained from 70,000 greyscale fashion products. e original partition into training and test sets is similar to MNIST. is dataset is (1) function VMG ROUTING (a n , x k | n ) (2) Initialize weights ∀n, k: r nk ←1/size[L k ] (3) Initialize priors ∀k: α 0 , m 0 , β 0 , S 0 , v 0 (4) for i iterations do (5) r nk ←r nk ⊙ a n (6) UPDATE BEST q(π, μ, Λ) ALGORITHM 1: Variational mixture of Gaussians routing. 8 Computational Intelligence and Neuroscience relatively complex to MNIST. e third and most complex dataset among the three is CIFAR-10 [29]. is dataset is very challenging to most computer vision algorithms due to the presence of background as well as background objects. Each of the aforementioned datasets is made up of ten classes and was partitioned into 55000 training, 5000 validation, and 10,000 test sets. e fourth dataset is a COVID-19 Radiography dataset [30][31][32] collected from four countries by a team of doctors. It consists of three classes of infected chest X-ray images and one class of healthy X-rays. is dataset is highly imbalanced and for purposes of this work, was partitioned into 16,952 training, 2,000 validation, and 4,227 test images. Even though the performance of some machine vision algorithms largely depends on extensive preprocessing to obtain high informative image data, we did not employ any of these preprocessing algorithms irrespective of the fact that digital images contain Gaussian noise introduced by the limitations of the acquisition sensor/camera during image capturing. Fortunately, there are techniques to reduce its effect [33]. However, we evaluated the model on the raw images, enabling us to understand the actual extent to which the model can recognize real-life digital images (such as the COVID-19 images) without human interference.

Experimental Results.
e results presented in this section are from the implementation of our model (Variational Mixture of Gaussians Routing model-VMG-Routing), the baseline Multilane LBP-Gabor Capsule (ML) network [32], and the VB-Routing [11] {64, 8, 16, 16, #c} architecture; where #c is the number of output classes. However, our GPU device could not run the higher architectures of the other VB-Routing models, consequently, for those models, we reported the results from the work in [11].

Model Learning and Convergence.
e training and validation curves in Figure 2 show the proposed model's ability to learn and converge faster. For less complex images such as MNIST and Fashion-MNIST, the model converges as early as epoch 30. For relatively complex and imbalanced images such as CIFAR-10 and COVID-19 Radiography, the model attains an accuracy approximately equal to the final accuracy at epoch 90. Our VMG-Routing learns faster compared to the models in [11] which only show stability beginning from epoch 150. Fast learning and convergence are desirable attributes for image recognition systems applied in critical areas such as self-driving cars where every passing minute counts and is valuable.  accuracy on CIFAR-10 between the proposed VMG-Routing CapsNet and the largest model is only 1.07% with our model having an added advantage of being less computationally complex.

Model Complexity.
e VMG-Routing CapNet produced fewer parameters compared to its counterparts in the literature as can be seen in Table 2. is makes the VMG-Routing model less computationally complex and increases its potential for implementation on embedded and mobile devices that naturally have limited memory. In addition, model complexity poses a threat of overfitting [34] that ultimately leads to poor performance.

Inference.
To test the models' generalizability on unseen images, we used the trained (saved) models to perform inference, respectively, on 10,000 and 4,227 sample images from MNIST, CIFAR-10, Fashion-MNIST, and the COVID-19 Radiography datasets. A comparison of the test accuracies is reported in Table 3. e average time for each model to perform inference on the sample images is also reported in Table 3. It can be observed that the VMG-Routing model produced results that compare favorably well with the results of other state-of-the-art models.
We further performed inference on individual in-distribution images for both models to determine the level of confidence/certainty each model places on its prediction probabilities. Figure 3 shows that the deterministic model is overconfident in its predictions (column 3) while the VMG-Routing CapsNet exercises some caution in the confidence it imposes on its predictions (column 2).

Model Uncertainty.
Daily scenarios involve decisionmaking influenced by the level of uncertainties/certainties prevailing at the time. Depending on the field under consideration, uncertainty estimation can be a critical part of the decision-making process. For instance, the reliability and efficacy of a deep learning model for medical applications   such as Artificial Intelligence (AI) assisted surgery depends on the uncertainty with which it identifies the medical condition correctly. Bayesian methods have advantages over other neural networks as they provide the avenue to effectively model uncertainty [12]. e inability of machine learning applications to provide reliable uncertainty estimates is a potential limiting factor in their acceptability and widespread adoption for critical tasks.
To demonstrate the reliability of the uncertainty estimates of our VMG-Routing model, we present a comparison of experimental results from the prediction of both indistribution ( Figure 4) and out-of-distribution ( Figure 5) images for the VMG-Routing model and the baseline deterministic ML-LBP capsule model.
We use p out to express the aleatoric uncertainty shown by the distribution across the classes for the deterministic model. is uncertainty assumes a value of zero if a class gets a probability of one and all other classes obtain a probability of zero. Since deterministic CapsNets have fixed weights, they cannot express epistemic uncertainties [25] and will produce the same output when inference is carried on the same input image N times. e output of the SoftMax layer p out (see Figure 3) sums up to one and measures the certainty (certainty � p out ) of the model in its predictions. We obtain the aleatoric uncertainty of the deterministic CapsNet from the same quantity p out by computing the negative log likelihood (NLL) or the entropy of the predictions.
where 0 ≤ i < #c and #c is the number of classes in the dataset under consideration.
On the other hand, our VMG-Routing CapsNet replaces the fixed weights with Gaussian distributions giving it the ability to express both epistemic and aleatoric uncertainties in its predictions. e aleatoric uncertainty is expressed in the distributions similar to the deterministic CapsNets, except that it is based on average prediction probabilities. Meanwhile, the epistemic uncertainty is measured in the spread of the inference probabilities and is zero for a zero spread. For this scenario, N different multinomial conditional probability distribution p(y | x, w n ) conditioned on the weight distribution w n are obtained out of N predictions on the same input image. e mean probability p * out is computed for each class i and the maximum mean conditional probability is chosen as the predicted class of the input image.
e averaging in the measure in equation (35) ensures that the epistemic uncertainty in the model is captured. Subsequently, NLL * � −log(p * inf ) is possible to compute. In addition, the uncertainty based on the entropy and total variance obtained from the averaging naturally follows from the following expressions: (37) Figure 5 shows the uncertainty of both models on the respective out-of-distribution images. e spread of the prediction probabilities of a given class expresses the epistemic uncertainty while the distribution across the different classes epitomizes the aleatoric uncertainty of the models [25].
Even though both models produce wrong predictions for the out-of-distribution images, the VMG-Routing CapsNet produces predictive probabilities ( Figure 5, column 2) that significantly vary in the distribution and spread of the N � 100 predictive runs. e VMG-Routing CapsNet, therefore, can express both uncertainties. On the contrary, the deterministic model cannot express epistemic uncertainty since performing N � 100 predictive runs on the same input image produces the same probabilities ( Figure 5, column 3).
e ability of a model to express its uncertainty is a desirable property since it can be shown that models that produce higher uncertainties are likely to produce accurate predictions [25]. Finally, the shape of the VMG-Routing CapsNet's predictive probability distribution has some semblance to that of the Gaussian distribution which may be attributed to the model being driven by a variational mixture of Gaussians.

Model's Ability to Extract Relevant Features.
To enable us to understand and tune the VMG-Routing model for further performance improvement, we investigated the ability of the layers in the model to extract the relevant features.
rough experimentation via this approach, redundant layers were eliminated, resulting in a reduction in the model size/complexity, convergence time, and excessive oscillations during training. More specifically, we visualized the output (feature maps) of the layers by feeding an input image into the trained (best saved) model. e feature maps for the various layers are shown in Figure 6. It can be observed that the layers of the model can extract the most relevant features from the input images.
4.4.6. reats to Validity. Deep Learning (DL) is capable of learning and modeling real-life scenarios when extreme care is taken, during the design and development stages, to consider all the factors that have the potential to prevent the model from achieving optimal performance. For instance, the choice of hyperparameters and their values is an important exercise that has a direct impact on the validity of the model outputs. For stochastic gradient descent (SGD)-based methods and their variants, a fraction of the dataset used for training are organized into batches whose size is relevant to the computation of the gradient. Practically, larger batch sizes reduce the quality of the model during generalization  [35]. is work, therefore, sampled from 16-32 data points for the experiments as batch sizes. We also avoided the sorting of the dataset and introduced randomization of batches in a bid to prevent the possibility that a given batch will have the same labels. In addition, the learning rate controls the rate at which the model should be modified in response to the error anytime there is an update in the model weights. We chose a smaller learning rate to allow the model to learn the optimal set of weights even though this has the potential to increase training time and the risk of overfitting. Other methods for solving this include implementing a learning rate decay function which returns an updated learning rate value that drops by half every n number of epochs. Furthermore, nonlinear activation functions are useful for DL to effectively model real-life scenarios which are nonlinear. e choice of the appropriate activation function determines the speed of computations necessary to speed up the training process as well as the ability to reduce the likelihood of generating vanishing gradients and improve performance [36]. To introduce nonlinearity and activate the capsule, we adopted the Sigmoid activation function since it encourages unambiguous predictions with 1 or 0, plus the fact that it can return a value between 0 and 1 when used with (−∞, +∞).
Another scenario that poses a threat to the validity of the Bayesian model outputs is the covariate shift, where the distributions of training and target data are different [37]. Covariate shift may also occur due to pixelate-corrupted test data, spurious correlations, and domain shift. is problem is well pronounced with Bayesian models that make use of unconstrained Λ (covariance matrix) and is worsened when there exists linear independence in the features. In this work, we employed mean-field variational inference (MFVI) which constraints the Λ to be a diagonal matrix, limiting the effect of linear dependence in the features [38] and hence the impact of covariate shift.

Conclusion and Future Work
In this work, we proposed a capsule network based on a variational mixture of Gaussian routing to express the uncertainties associated with performing predictions on outof-distribution data. e results show that a Bayesian capsule can be less computationally complex, converge faster, and outperform both the state-of-the-art deterministic and probabilistic models during inference. Furthermore, our work demonstrates that Bayesian capsules may have advantages over their deterministic counterparts since they have a bigger potential to exhibit transparency, credibility, reliability, and interpretability required to gain the confidence of industry players.
In the future, we intend to carry out a full investigation into Bayesian capsule interpretability in a quest to unravel the "black box" concept.