Bayesian Inference for Nonnegative Matrix Factorisation Models

We describe nonnegative matrix factorisation (NMF) with a Kullback-Leibler (KL) error measure in a statistical framework, with a hierarchical generative model consisting of an observation and a prior component. Omitting the prior leads to the standard KL-NMF algorithms as special cases, where maximum likelihood parameter estimation is carried out via the Expectation-Maximisation (EM) algorithm. Starting from this view, we develop full Bayesian inference via variational Bayes or Monte Carlo. Our construction retains conjugacy and enables us to develop more powerful models while retaining attractive features of standard NMF such as monotonic convergence and easy implementation. We illustrate our approach on model order selection and image reconstruction.


Introduction
In machine learning, nonnegative matrix factorisation (NMF) was introduced by Lee and Seung [1] as an alternative to k-means clustering and principal component analysis (PCA) for data analysis and compression (also see [2]). In NMF, given a W × K nonnegative matrix X = {x ν,τ }, where ν = 1 : W, τ = 1 : K, we seek positive matrices T and V such that where i = 1 : I. We will refer to the W × I matrix T as the template matrix, and I × K matrix V the excitation matrix.
The key property of NMF is that T and V are constrained to be positive matrices. This is in contrast with PCA, where there are no positivity constraints or k-means clustering where each column of V is constrained to be a unit vector. Subject to the positivity constraints, we seek a solution to the following minimisation problem: (T, V ) * = arg min T,V>0 D(X TV).
Here, the function D is a suitably chosen error function. One particular choice for D, on which we will focus here, is the information (Kullback-Leibler) divergence, which we write as Using Jensen's inequality [3] and concavity of log x, it can be shown that D(·) is nonnegative and D(X||Λ) = 0 if and only if X = Λ. The objective in (2) could be minimised by any suitable optimisation algorithm. Lee and Seung [1] have proposed a very efficient variational bound minimisation algorithm that has attractive convergence properties and which has been successfully applied in various applications in signal analysis and source separation, for example, [4][5][6]. The interpretation of NMF, like singular value decomposition (SVD), as a low rank matrix approximation is sufficient for the derivation of a useful inference algorithm; yet this view arguably does not provide the complete picture about assumptions underlying the statistical properties of X. Therefore, we describe NMF from a statistical perspective as a hierarchical model. In our framework, the original nonnegative multiplicative update equations of NMF appear as an expectation-maximisation (EM) algorithm for maximum likelihood estimation of a conditionally Poisson model via data augmentation. Starting from this view, we develop Bayesian extensions that facilitate more powerful modelling and allow more sophisticated inference, such as Bayesian 2 Computational Intelligence and Neuroscience model selection. Inference in the resulting models can be carried out easily using variational (structured mean field) or Markov Chain Monte Carlo (Gibbs sampler). The resulting algorithms outperform existing NMF strategies and open up the way for a full Bayesian treatment for model selection via computation of the marginal likelihoods (the evidence), such as estimating the dimensions of the template matrix or regularising overcomplete representations via automatic relevance determination.

The Statistical Perspective
The interpretation of NMF as a low-rank matrix approximation is sufficient for the derivation of an inference algorithm; yet this view arguably does not provide the complete picture. In this section, we describe NMF from a statistical perspective. This view will pave the way for developing extensions that facilitate more realistic and flexible modelling as well as more sophisticated inference, such as Bayesian model selection.
Our first step is the derivation of the information divergence error measure from a maximum likelihood principle. We consider the following hierarchical model: Here, P O(s; λ) denotes the Poisson distribution of the random variable s ∈ N 0 with nonnegative intensity parameter λ, where P O(s; λ) = exp(s log λ − λ − log Γ(s + 1)) (6) and Γ(s + 1) = s! is the gamma function. The priors p(T | ·) and p(V | ·) will be specified later. We call the variables S i = {s ν,i,τ } latent sources. We can analytically marginalise out the latent sources S = {S 1 · · · S I } to obtain the marginal likelihood This result follows from the well-known superposition property of Poisson random variables [7], namely, when s i ∼ P O(s i ; λ i ) and x = s 1 + s 2 + · · · + s I , then the marginal probability is given by p(x) = P O(x; i λ i ). The maximisation of this objective in T and V is equivalent to the minimisation of the information divergence in (3). In the derivation of original NMF in [8], this objective is stated first; the S variables are introduced implicitly later during the optimisation on T and V . In the sequel, we show that this algorithm is actually equivalent to EM, ignoring the priors p(T | ·) and p(V | ·).

Maximum Likelihood and the EM Algorithm.
The loglikelihood of the observed data X can be written as where q(S) is an instrumental distribution, that is arbitrary provided that the sum on the right exists; q can only vanish at a particular S only when p does so. Note that this defines a lower bound to the log-likelihood. It can be shown via functional derivatives and imposing the normalisation condition S q(S) = 1 via Lagrange multipliers that the lower bound is tight for the exact posterior of the latent sources, that is, Hence the log-likelihood can be maximised iteratively as follows: Here In the E step, we compute the posterior distribution of S. This defines a lower bound on the likelihood For many models in the exponential family, which includes (5), the expectation on the right depends on the sufficient statistics of q(S) (n) and is readily available; in fact calculating q(S) should be literally taken as calculating the sufficient statistics of q(S). The lower bound is readily obtained as a function of these sufficient statistics, and maximisation in the M Step yields a fixed point equation.

The E
Step. To derive the posterior of the latent sources, we observe that For the model in (5), we have Computational Intelligence and Neuroscience 3 It follows from (5), (12), (13), and (7) log p(S | X, T, V ) where Here, M denotes a multinomial distribution defined by where s = {s 1 , s 2 , . . . , s I }, p = {p 1 , p 2 , . . . , p I }, and p 1 + p 2 + · · · + p I = 1. Here, p i , i = 1 · · · I are the cell probabilities, and x is the index parameter where s 1 + s 2 + · · · + s I = x. The Kronecker delta function is defined by δ(x) = 1 when x = 0, and δ(x) = 0 otherwise. It is a standard result that the marginal mean is that is, the expected value of each source s i is a fraction of the observation, where the fraction is given by the corresponding cell probability.

The M
Step. It is indeed a good news that the posterior has an analytic form. Since now the M step can be calculated easily as follows: Fortunately, for maximisation with respect to T and V , the last two difficult terms are merely constant, and we need only to maximise the simpler objective where we only need the expected value of the sources given by the previous values of the templates and excitations: Maximisation of the objective Q and substituting s ν,i,τ (n) give the following fixed point equations: Equation (20) is identical to the multiplicative update rules of [8]. However, our derivation via data augmentation obtains the same result as an EM algorithm. It is interesting to note that in literature, NMF is often described as EM-like; here, we show that it is actually just an EM algorithm. We see that the efficiency of NMF is due to the fact that the W × I × K object S needs not to be explicitly calculated as we only need its marginal statistics (sums across τ or ν).
We note that our model is valid when X is integer valued. See [9] for a detailed discussion about consequences of this issue. Here, we assume that for nonnegative real valued X, we only consider the integer part, that is, we let X = X + E, where E is a noise matrix with entries uniformly drawn in [0, 1). In practice, this is not an obstacle when the entries of X are large.
The interpretation of NMF as a maximum likelihood method in a Poisson model is mentioned in the original NMF paper [1] and discussed in more detail by [5,10]. The equivalence of NMF and probabilistic latent sematic analysis is shown in [11]. Kameoka in [5] focuses on the optimisation and gives an equivalent description using auxiliary function maximisation. In contrast, the auxiliary variables can be viewed as model variables (the sources s) that are analytically integrated out [10]. A general framework is described in [12]. Prior structures are placed on conditionally Gaussian NMF models to enforce sparsity in [13]. However, all of these approaches are based on regularisation, that is, aim at calculating a maximum a posteriori estimate. In contrast, we provided in this article a full Bayesian treatment where the templates and excitations are integrated out.

Hierarchical Prior Structure.
Given the probabilistic interpretation, it is possible to propose various hierarchical prior structures to fit the requirements of an application. Here we will describe a simple choice where we have a conjugate prior as follows:  Each source element s ν,i,τ is Poisson distributed with intensity t ν,i v i,τ . The observations are given by x ν,τ = i s ν,i,τ . In matrix notation, we write X = S i . We can analytically integrate out over S. Due to superposition property of Poisson distribution, intensities add up, and we obtain X = TV. Given X, the NMF algorithm is shown to seek the maximum likelihood estimates of the templates T and excitations V . In our Bayesian treatment, we further assume that elements of T and V are Gamma distributed with hyperparameters Θ.
Here, G denotes the density of a gamma random variable x ∈ R + with shape a ∈ R + and scale b ∈ R + defined by The primary motivation for choosing a Gamma distribution is computational convenience: Gamma distribution is the conjugate prior to Poisson intensity. The indexing highlights the most general case where there are individual parameters for each element t ν,i and v i,τ . Typically, we do not allow many free hyperparameters but tie them depending upon the requirements of an application. See Figure 1 for an example. As an example, consider a model where we tie the hyperparameters such as a t a v , and b v i,τ = b v for i = 1 · · · I, ν = 1 · · · W, and τ = 1 · · · K. This model is simple to interpret, where each component of the templates and the excitations is drawn independently from the Gamma family shown in Figure 2. Qualitatively, the shape parameter a controls the sparsity of the representation. Remember that G(x; a, b/a) has the mean b and standard deviation b/ √ a. Hence, for large a, all coefficients will have more or less the same magnitude b, and typical representations will be full. In contrast, for small a, most of the coefficients will be very close to zero, and only very few will be dominating, hence favouring a sparse representation. The scale parameter b is adapted to give the expected magnitude of each component.
To model missing data, that is, when some of the x ν,τ are not observed, we define a mask matrix M = {m ν,τ }, the same size as X where m ν,τ = 0, if x ν,τ is missing and 1 otherwise (see Appendix A.4 for details). Using the mask variables, the observation model with missing data can be written as The hierarchical model in (21) is more powerful than the basic model of (5), in that it allows a lot of freedom for more realistic modelling. First of all, the hyperparameters can be estimated from examples of a certain class of source to capture the invariant features. Another possibility is Bayesian model selection, where we can compare alternative models in terms of their marginal likelihood. This enables one to estimate the model order, for example, the optimum number of templates to represent a source.

Full Bayesian Inference
Below, we describe various interesting problems that can be cast to Bayesian inference problems. In signal analysis and feature extraction with NMF, we may wish to calculate the posterior distribution of templates and excitations, given data and hyperparameters Θ ≡ (Θ t , Θ v ). Another Computational Intelligence and Neuroscience 5 (1) Initialise: important quantity is the marginal likelihood (also known as the evidence), where The marginal likelihood can be used to estimate the hyperparameters, given examples of a source class or to compare two given models via Bayes factors This latter quantity is particularly useful for comparing different classes of models. Unfortunately, the integrations required cannot be computed in closed form. In the sequel, we will describe the Gibbs sampler and variational Bayes as approximate inference strategies.

Variational Bayes.
We sketch here the Variational Bayes (VB) [3,14] method to bound the marginal log-likelihood as where q = q(S, T, V ) is an instrumental distribution, and H[q] is its entropy. The bound is tight for the exact posterior q(S, T, V ) = p(S, T, V | X, Θ) but as this distribution is complex, we assume a factorised form for the instrumental distribution by ignoring some of the couplings present in the exact posterior as follows: where α ∈ C = {{S}, {T}, {V }} denotes set of disjoint clusters. Hence, we are no longer guaranteed to attain the exact marginal likelihood L X (Θ). Yet, the bound property is preserved, and the strategy of VB is to optimise the bound. Although the best q distribution respecting the factorisation is not available in closed form, it turns out that a local optimum can be attained by the following fixed point iteration: where q ¬α = q/q α . This iteration monotonically improves the individual factors of the q distribution, that is, given an initialisation q (0) . The order is not important for convergence; one could visit blocks in arbitrary order. However, in general, the attained fixed point depends upon the order of the updates as well as the starting point q (0) (·). We choose the following update order in our derivations:

Variational Update Equations and Sufficient Statistics.
The expectations of log p(X, S, T, V | Θ) are functions of the sufficient statistics of q (see the expression in the Appendix A.2). The update equation for the latent sources (30) leads to the following: These equations are analogous to the multinomial posterior of EM given in (14); only the computation of cell probabilities is different. The excitation and template distributions and their sufficient statistics follow from the properties of the gamma distribution: 3.3. Efficient Implementation. One of the attractive features of NMF is easy and efficient implementation. In this section, we derive that the update equations of Section 3.2 in compact matrix notation are to illustrate that these attractive properties are retained for the full Bayesian treatment. A subtle but key point in the efficiency of the algorithm is that we can avoid explicitly storing and computing the W × I × K object S , as we only need the marginal statistics during optimisation. Consider (33). We can write Here, the denominator has to be nonzero. In the last line, we have represented the expression in compact notation where we define the following matrices: The matrices subscripted with t are in R W×I + and with v are in R I×K + . For notational convenience, we define . * and ./ as elementwise matrix multiplication and division, respectively, and 1 W as a W × 1 vector of ones. After straightforward substitutions, we obtain the variational nonnegative matrix factorisation algorithm, that can compactly be expressed as in panel Algorithm 1.
Similarly, an iterative conditional modes (ICM) algorithm can be derived to compute the maximum a posteriori (MAP) solution (see Appendix A.4): Note that when the shape parameters go to zero, that is, A t , A v → 0, we obtain the maximum likelihood NMF algorithm.

Markov Chain Monte Carlo, the Gibbs Sampler. Monte
Carlo methods [15,16] are powerful computational techniques to estimate expectations of form Computational Intelligence and Neuroscience 7 (1) Initialize: The Markov Chain Monte Carlo (MCMC) techniques generate subsequent samples from a Markov chain defined by a transition kernel T , that is, one generates x (i+1) conditioned on x (i) as follows: Note that the transition kernel T is not needed explicitly in practice; all is needed is a procedure to sample a new configuration, given the previous one. Perhaps surprisingly, even though subsequent samples are correlated, provided that T satisfies certain ergodicity conditions, (39) remains still valid, and estimated expectations converge to their true values when number of samples N goes to infinity [15]. To design a transition kernel T such that the desired distribution is the stationary distribution, that is, , many alternative strategies can be employed [16]. One particularly convenient and simple procedure is the Gibbs sampler where one samples each block of variables from full conditional distributions. For the NMF model, a possible Gibbs sampler is Note that this procedure implicitly defines a transition kernel T (· | ·). It can be shown [15] that the stationary distribution of T is the exact posterior p(S, T, V | X, Θ). Eventually, the Gibbs sampler converges regardless of the order that the blocks are visited, provided that each block is visited infinitely often in the limit n → ∞. However, the rate of convergence is very difficult to assess as it depends upon the order of the updates as well as the starting configuration (T (0) , V (0) , S (0) ). It is instructive to contrast above (41) with the variational update of (30)-(32): algorithmically the two approaches are quite similar. The pseudo-code is given in Algorithm 2.

Marginal Likelihood Estimation with Chib's Method.
The marginal likelihood can be estimated from the samples generated by the Gibbs sampler using a method proposed by Chib [17]. Suppose we have run the block Gibbs sampler until convergence and have N samples as follows: The marginal likelihood is (omitting hyperparameters Θ) This equation holds for all points (V , T, S). We choose a point in the configuration space; provided that the distribution is unimodal, a good candidate is a configuration near the mode ( T, V , S). The numerator in (43) is easy to evaluate. The denominator is The first term is full conditional, so it is available for the Gibbs sampler. The third term is 8

Computational Intelligence and Neuroscience
The second term is trickier The first term here is full conditional. However, the original Gibbs run gives us only samples from p( The idea is to run the Gibbs sampler for M further iterations where we sample from (V (m) Chib's method estimates the marginal likelihood as follows:

Simulations
Our goal is to illustrate our approach in a model selection context. We first illustrate that the variational approximation to the marginal likelihood is close to the one obtained from the Gibbs sampler via Chib's method. Then, we compare the quality of solutions we obtain via Variational NMF and compare them to the original NMF on a prediction task. Finally, we focus on reconstruction quality in the overcomplete case where the standard NMF is subject to overfitting.

Model Order Determination.
To test our approach, we generate synthetic data from the hierarchical model in (21) with W = 16, K = 10, and the number of sources I true = 5. The inference task is to find the correct number of sources, given X. The hyperparameters of the true model are set to In the first experiment, the hyperparameters are assumed to be known and in the second are jointly estimated from data, using hyperparameter adaptation. We evaluate the marginal likelihood for models with the number of templates I = 1 · · · 10, with the Gibbs sampler using Chib's method and variational lower bound B via variational Bayes. We run the Gibbs sampler for MAXITER = 10 000 steps following a burn-in period of 5000 steps; then we clamp the sources S and continue the simulation for a further 10 000 steps to estimate quantities required by Chib's method. We run the variational algorithm until convergence of the bound or 10 000 iterations, whichever occurs first. In Figure 3 via Chib's method. We observe, that both methods give consistent results. In Figure 4, we show the lower bound as a function of model order I, where for each I, the bound is optimised independently by jointly optimising hyperparameters a t , b t , a v , and b v using the equations derived in the appendix. We observe, that the correct model order can be inferred even when the hyperparameters are unknown a priori. This is potentially useful for estimation of model order from real data.  As real data, we use a version of the Olivetti face image database (K = 400 images of 64 × 64 pixels available at http://www.cs.toronto.edu/∼roweis/data/olivettifaces.mat). We further downsampled to 16 × 16 or 32 × 32 pixels, hence our data matrix X is 16 2 × 400 or 32 2 × 400. We use a model with tied hyperparameters as a t ν,i = a t , b t ν,i = b t , a v i,τ = a v , and b v i,τ = b v , where all hyperparameters are jointly estimated. In Figure 4, bottom, we show results of model order determination for this dataset with joint hyperparameter adaptation. Here, we run the variational algorithm for each model order I = 1 · · · 100 independently and evaluate the lower bound after optimising the hyperparameters. The Gibbs sampler is not found practical and is omitted here. The lower bound behaves as is expected from marginal likelihood, reflecting the tradeoff between too many and too few templates. Higher resolution implies more templates, consistent with our intuition that detail requires more templates for accurate representation.
We also investigate the nature of the representations (see Figure 5). Here, for each independent run, we fix the values of shape parameters to (a t , a v ) = [(10, 10), (0.1, 0.1), (10, 0.2), (10, 0.5)] and only estimate b t and b v . This corresponds to enforcing sparse or nonsparse t and v. Each column shows I = 36 templates estimated from the dataset conditioned on hyperparameters. The middle image is the same template image above weighted with the excitations corresponding to the reconstruction (the expected value of the predictive distribution) below. Here, we clearly see the effect of the hyperparameters. In the first condition (a t , a v ) = (10, 10), the prior does not enforces sparsity to the templates and excitations. Hence, for the representation of a given image, there are many active templates. In the second condition, we try to force both matrices to be sparse with (a t , a v ) = (0.1, 0.1). Here, the result is not satisfactory as isolated components of the templates are zeroed, giving a representation that looks like one contaminated by "salt-and-pepper" noise. The third condition ((a t , a v ) = (10, 0.2)) forces only the excitations to be sparse. Here, we observe that the templates correspond to some average face images. Qualitatively, each image is reconstructed using a superposition of a few of these templates. In the final representation, we enforce sparsity in the templates but not in the excitations. Here, our estimate finds templates that correspond to parts of individual face images (eyebrows, lips, etc.). This solution, intuitively corresponding to a parsimonious representation, also is the best in terms of the marginal likelihood. With proper initialisation, our variational procedure is able to find such solutions.
Prediction. We now compare variational Bayesian NMF with the maximum likelihood NMF on a missing data prediction task.
To illustrate the self regularisation effect, we set up an experiment in which we select a subset of the face data consisting of 50 images. From half of the images, we remove the same patch ( Figure 6) and predict the missing pixels. This is a rather small dataset for this task, as we have only 10 images for each of the 5 different persons, and half of these images have missing data at the same spot. We measure the quality of the prediction in terms of signal-to-noise ratio (SNR). The missing values are reconstructed using the mean of the predictive distribution X pred ≡ X P O(X;T * V * ) = T * V * , where T * and V * are point estimates of the template and excitation matrix. We compare our variational algorithm with the classical NMF. For each algorithm, we test two different versions. The variational algorithms differ in how we estimate T * and V * . In the first variational algorithm, we use a crude estimate of T * and V * as the mean of the approximating q distribution. In the second condition, after convergence of hyperparameters via VB, we reinitialise T and V randomly and switch to an ICM algorithm (see (38)). This strategy finds a local mode (T * , V * ) of the exact posterior distribution. In NMF, we test two initialisation strategies: in the first condition, we initialise the templates randomly; in the second, we set them equal to the images in the dataset with random perturbations.
In Figure 6, we show the reconstruction results for a typical run, for a model with I = 100 templates. Note that this is an overcomplete model, with twice as many templates as there are images. To characterise the nature of the estimated template and excitation matrices, we use the sparseness criteria [18] of an m × n matrix X, defined as . This measure is 1 when the matrix X has only a single nonzero entry and 0 when all entries are equal. We see that the variational algorithms are superior in this case in terms of SNR as well as the visual quality of the reconstruction. This is perhaps not surprising, since with maximum likelihood estimation; if the model order is not carefully chosen, generalisation performance is poor: the "noise" in the observed data is fitted but the prediction quality drops on new data. An interesting observation is that highly sparse solutions (either in templates or excitations) do not give the best result; the solution that balances both seems to be the best in this setting. This example illustrates that sparseness in itself may not be necessarily a good criteria to optimise; for model selection, the marginal likelihood should be used as the natural quantity. On the same face dataset, we compare the prediction error in terms of the SNR for varying model order I. Our goal is to compare the prediction performance of the full Bayesian approach with the ML-NMF for a range of conditions (under-complete, complete, and overcomplete). The results shown in Figure 7 are averages of several runs with hyperparameter adaptation and different hyperparameter tying. In the simulations, the shape parameters are tied always as a v i,τ = a v (and a t ν,i = a t ). The scale parameters are untied or tied as and jointly optimised. Regardless of the hyperparameter tying structure, the results were quite similar. The best SNR values are attained with untied scale parameters for both excitations and templates. We observe that, due to the implicit self-regularisation in the Bayesian approach, the prediction performance is not very sensitive to the model order and is immune to overfitting. In contrast, the ML-NMF with random initialisation is prone to overfitting, and prediction performance drops with increasing model order. Interestingly, when we initialise the ML-NMF algorithm to true data points with small perturbations, the prediction performance in terms of SNR improves. Note that this strategy would not be possible for data where the pixels were truly missing. However, visual inspection shows that the interpolation can still be "patchy" (see Figure 6).
We observe that hyperparameter adaptation is crucial for obtaining good prediction performance. In our simulations, results for VB without hyperparameter adaptation were occasionally poorer than the ML estimates. Good initialisation of the shape hyperparameters seems to be also important. We obtain best results when initialising the shape hyperparameters asymmetrically, for example, a v < 1 and a t > 10 (see 3rd and 4th panels from left in Figure 5). When the shape hyper-parameters are initialised to small a v , a t 1, the EM seems to get stuck in a local minima more often. Consequently, the prediction results are poorer. We have also carried out tests with more undercomplete representations when the model order is low I < 10. For these simulations, while the marginal likelihood was in favour of the VB solutions, we have not observed statistically significant differences between VB and ML in terms of SNR. The SNR improvement of VB over ML was on average about 0.1 dB only.

Discussion and Conclusions
In this paper, we have investigated KL-NMF from a statistical perspective. We have shown that KL minimisation formulation the original algorithm can be derived from a probabilistic model where the observations are superposi-tion of I independent Poisson-distributed latent sources. Here, the template and excitation matrices turn out to be latent intensity parameters. The interpretation of NMF as a maximum likelihood method in a Poisson model is mentioned in the original NMF paper [1] and discussed in more detail by [5,10], and [5] focuses on the optimisation and gives an equivalent description using auxiliary function maximisation. In contrast, [10] illustrates that the auxiliary variables can be viewed as model variables (the sources s) that are analytically integrated out. The relationship between KL divergence and the Poisson distribution is not just a lucky coincidence. There exists a duality between divergence functions and exponential family distributions. If a cost function is a Bregman divergence, there exists a regular exponential family where minimising the cost corresponds to maximum likelihood parameter estimation [19]; also see [12] it in the context of matrix factorisation models.
The novel observation in the current article is the exact characterisation of the approximating distribution q(S) or full conditionals p(S | T, V , X) as a product of multinomial distributions, leading to a richer approximation distribution than a naive mean field or single site Gibbs (which would freeze due to deterministic p(X | S)). This conjugate form leads to significant simplifications in full Bayesian integration. Apart from the conditionally Gaussian case, NMF with KL objective seems to be unique in this respect. For several other distance metrics D(· ·), we find that full Bayesian inference not as practical as p(S | T, V , X) is not standard.
We have also shown that the standard KL-NMF algorithm with multiplicative update rules is in fact an EM algorithm with data augmentation. Extending upon this observation, we have developed an hierarchical model with conjugate Gamma priors. We have developed a variational Bayes algorithm and a Gibbs sampler for inference in this hierarchical model. We have also developed methods for estimating the marginal likelihood for model selection. This is an additional feature that is lacking in existing NMF approaches with regularisation, where only MAP estimates are obtained, such as [13,18,20].
Our simulations suggest that the variational bound seems to be a reasonable approximation to the marginal likelihood and can guide model selection for NMF. The computational requirements are comparable to the ML-NMF. A potentially time-consuming step in the implementation of the variational algorithm is the evaluation of the Ψ function but this step can also be replaced by a simple piecewise polynomial approximation since exp(Ψ(x)) ≈ x − 0.5 for x > 5.
We first compare the variational inference with a Gibbs sampler. In our simulations, we observe that both algorithms give qualitatively very similar results, both for inference of templates and excitations as well as model order selection. We find the variational approach somewhat more practical as it can be expressed as simple matrix operations, where both the fixed point equations as well as the bound can be compactly and efficiently implemented using matrix computation software. In contrast, our Gibbs sampler is computationally more demanding, and the calculation of marginal likelihood is somewhat more tricky. With our implementation of both algorithms, the variational method is faster by a factor of around 13. Reference implementations of both algorithms in Matlab are available from the following url: http://www.cmpe.boun.edu.tr/∼cemgil/bnmf/.
In terms of computational requirements, the variational procedure has several advantages. First, we circumvent sampling from multinomial variables, which is the main computational bottleneck with the Gibbs sampler. Whilst efficient algorithms are developed for multinomial sampling [21], the procedure is time consuming when the number of latent sources I is large. In contrast, the variational method estimates the expected sufficient statistics directly by elementary matrix operations. Another advantage is hyperparameter estimation. In principle, it is possible to maximise the marginal likelihood via a Monte Carlo EM procedure [22,23]; yet this potentially requires many more iterations. In contrast, the evaluation of the derivatives of the lower bound is straightforward and can be implemented without much additional computational cost.
The efficiency of the Gibbs sampler could be improved by working out the distribution of the sufficient statistics of sources directly (namely, quantities τ s ν,i,τ or ν s ν,i,τ ) to circumvent multinomial sampling. Unfortunately, for the sum of binomial random variables with different cell probability parameters, the sum does not have a simple form but various approximations are possible [24].
Inference based on VB is easy to implement but at the end of the day, the fixed point iteration is just a gradient-based lower bound optimisation procedure, and second order Newton methods can provide more efficient alternatives. For NMF models, there exist many conditional independence relations, hence the Hessian matrix has a special block structure [12]. It is certainly interesting to develop efficient inference methods that make use of the special block structure of the Hessian matrix. However, as our primary goal was a practical full Bayesian treatment, we have not investigated this path yet. Another approach in this direction is using alternative deterministic integration techniques such as expectation propagation (EP) [25]. Those techniques work directly on an approximation of the true marginal likelihood rather than a bound. A related approach known as expectation consistent (EC) inference is used with success in related source separation problems [26].
From a modelling perspective, our hierarchical model provides some attractive properties. It is easy to incorporate prior knowledge about individual latent sources via hyperparameters, and one can easily capture variability in the templates and excitations that is potentially useful for developing robust techniques. The prior structure here is qualitatively similar to an entropic prior [20,27], and we find qualitatively similar representations to the ones found by NMF reported earlier by [1,18]. However, none of the above mentioned methods provide an estimate of the marginal likelihood, which is useful for model selection. Our generative model formulation can be extended in various ways to suit the specific needs of particular applications. For Computational Intelligence and Neuroscience 13 example, one can enforce more structured prior models such as chains or fields [10]. As a second possibility, the Poisson observation model can be replaced with other models such as clipped Gaussian, Gamma, or Gaussians which lead to alternative source separation algorithms. For example, the case of Gaussian sources where the excitations and templates correspond to the variances is discussed in [28].
Our main contribution here is the development of a principled and practical way to estimate both the optimal sparsity criteria and model order, in terms of marginal likelihood. By maximising the bound on marginal likelihood, we have a method where all the hyperparameters can be estimated from data, and the appropriate sparseness criteria is found automatically. We believe that our approach provides a practical improvement to the highly popular KL-NMF algorithm without incurring much additional computational cost.

A. Standard Distributions in Exponential Form, Their Sufficient Statistics and Entropies
Here, s = {s 1 , s 2 , . . . , s I }, p = {p 1 , p 2 , . . . , p I }, and p 1 + p 2 + · · · + p I = 1. Here, p i , i = 1 · · · I are the cell probabilities, and x is the index parameter where s 1 + s 2 + · · · + s I = x. The entropy is given as follows: A closed form expression for the entropy is not known due to log Γ(s + 1) terms but asymptotic expansions exist [29,30]. Computationally efficient sampling from a multinomial distribution is not trivial; see [21] for a comparison of various methods and detailed discussion of tradeoffs.
A.1. Summary of the Generative Model. We have the following. Indices: i = 1 · · · I, source index; ν = 1 · · · W, Row (frequency bin) index; t ν,i : template variable at νth row of the ith source s ν,i,τ : source variable of ith source at νth row (frequency bin) and τth column (time frame) x ν,τ : observation at νth row (frequency bin) and τth column (time frame) A. 3. The Variational Bound. The variational bound in (27) can be written as where the energy term is given by the expectation of the expression in Appendix A.2, and H[q] denotes the entropy of the variational approximation distribution q where the individual entropies are defined in Appendix A: One potential problem is that this expression requires the entropy of a multinomial distribution for which there is no known simple expression. This is due to terms of form log Γ(s + 1) where only asymptotic expansions are known. Fortunately, the difficult terms in the energy term can be canceled by the corresponding terms in the entropy term, and one obtains the following expression that only depends on known sufficient statistics: (A.12) After some careful manipulations, the following expression is obtained where log L denotes here elementwise logarithm of matrix L: (A.13)

A.4. Handling Missing Data and MAP Estimation.
When there is missing data, that is, when some of the x ν,τ are not observed, computation is still straightforward in our framework and can be accomplished by a simple modification to the original algorithm. We first define a mask matrix M = {m ν,τ }, same size as X, where m ν,τ = ⎧ ⎨ ⎩ 0, x ν,τ is missing, 1, otherwise. (A.14) Using the mask variables, the observation model with missing data can be written as follows: p(X | S)p(S | T, V ) Consequently, it is easy to see that By a derivation analogous to one detailed in Section 3.3, we see that the excitation update equations in Algorithm 1, line 4, can be written using matrix notation as follows: The update rules for the templates are similar. Note that when there is no missing data, we have M = 1 W 1 K which gives the original algorithm. The bound in (A.13) can also be easily modified for handling missing data. We merely replace X ← M . * X and the first term E t E v ← M . * E t E v . We conclude this subsection by noting that the standard NMF update equations, given in (20), can be also rewritten to handle missing data as follows: Here, the denominator has to be nonzero. Similarly, an iterative conditional modes (ICM) algorithm can be derived to compute the maximum a posteriori (MAP) solution as follows: Note that when the shape parameters go to zero, that is, A t , A v → 0, we obtain the maximum likelihood NMF algorithm. It is well known that Newton iterations can diverge if started away from the root. Occasionally, we observe that a can become negative. If this is the case, we set Δ (n) ← Δ (n) /2, and try again. The digamma Ψ function and its derivative Ψ are available in numeric computation libraries (e.g., in Matlab as psi(0, a) and psi (1, a), resp.). The derivation of the hyperparameter update equations is straightforward: (A.23) Tying parameters across τ as a v i = a v i,τ and b (A. 25)