1. Introduction

CIN

Computational Intelligence and Neuroscience

1687-52731687-5265

Hindawi Publishing Corporation

785152

10.1155/2009/785152

785152

Research Article

Bayesian Inference for Nonnegative Matrix Factorisation Models

Cemgil

Ali Taylan

Cruces-Alvarez

Department of Computer Engineering

Boğaziçi University

34342 Bebek

Istanbul

Turkey

boun.edu.tr

2009

27052009

20092908200814022009

2009

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We describe nonnegative matrix factorisation (NMF) with a Kullback-Leibler (KL) error measure in a statistical framework, with a hierarchical generative model consisting of an observation and a prior component. Omitting the prior leads to the standard KL-NMF algorithms as special cases, where maximum likelihood parameter estimation is carried out via the Expectation-Maximisation (EM) algorithm. Starting from this view, we develop full Bayesian inference via variational Bayes or Monte Carlo. Our construction retains conjugacy and enables us to develop more powerful models while retaining attractive features of standard NMF such as monotonic convergence and easy implementation. We illustrate our approach on model order selection and image reconstruction.

1. Introduction

In machine learning, nonnegative matrix factorisation (NMF) was introduced by Lee and Seung [1] as an alternative to k-means clustering and principal component analysis (PCA) for data analysis and compression (also see [2]). In NMF, given a W×K nonnegative matrix X={xν,τ}, where ν=1:W, τ=1:K, we seek positive matrices T and V such that(1)xν,τ≈[TV]ν,τ=∑itν,ivi,τ,where i=1:I. We will refer to the W×I matrix T as the template matrix, and I×K matrix V the excitation matrix. The key property of NMF is that T and V are constrained to be positive matrices. This is in contrast with PCA, where there are no positivity constraints or k-means clustering where each column of V is constrained to be a unit vector. Subject to the positivity constraints, we seek a solution to the following minimisation problem:(2)(T,V)*=arg min T,V>0 D(X∥TV).Here, the function D is a suitably chosen error function. One particular choice for D, on which we will focus here, is the information (Kullback-Leibler) divergence, which we write as(3)D(X∥Λ)=−∑ν,τ(xν,τ logλν,τxν,τ−λν,τ+xν,τ).Using Jensen's inequality [3] and concavity of log x, it can be shown that D(·) is nonnegative and D(X||Λ)=0 if and only if X=Λ. The objective in (2) could be minimised by any suitable optimisation algorithm. Lee and Seung [1] have proposed a very efficient variational bound minimisation algorithm that has attractive convergence properties and which has been successfully applied in various applications in signal analysis and source separation, for example, [4–6].

The interpretation of NMF, like singular value decomposition (SVD), as a low rank matrix approximation is sufficient for the derivation of a useful inference algorithm; yet this view arguably does not provide the complete picture about assumptions underlying the statistical properties of X. Therefore, we describe NMF from a statistical perspective as a hierarchical model. In our framework, the original nonnegative multiplicative update equations of NMF appear as an expectation-maximisation (EM) algorithm for maximum likelihood estimation of a conditionally Poisson model via data augmentation. Starting from this view, we develop Bayesian extensions that facilitate more powerful modelling and allow more sophisticated inference, such as Bayesian model selection. Inference in the resulting models can be carried out easily using variational (structured mean field) or Markov Chain Monte Carlo (Gibbs sampler). The resulting algorithms outperform existing NMF strategies and open up the way for a full Bayesian treatment for model selection via computation of the marginal likelihoods (the evidence), such as estimating the dimensions of the template matrix or regularising overcomplete representations via automatic relevance determination.

2. The Statistical Perspective

The interpretation of NMF as a low-rank matrix approximation is sufficient for the derivation of an inference algorithm; yet this view arguably does not provide the complete picture. In this section, we describe NMF from a statistical perspective. This view will pave the way for developing extensions that facilitate more realistic and flexible modelling as well as more sophisticated inference, such as Bayesian model selection.

Our first step is the derivation of the information divergence error measure from a maximum likelihood principle. We consider the following hierarchical model: (4)T~p(T∣Θt), V~p(V∣Θv),(5)sν,i,τ~𝒫𝒪(sν,i,τ;tν,ivi,τ), xν,τ=∑isν,i,τ. Here, 𝒫𝒪(s;λ) denotes the Poisson distribution of the random variable s∈ℕ0 with nonnegative intensity parameter λ, where(6)𝒫𝒪(s;λ)=exp(s log λ−λ−log Γ(s+1))and Γ(s+1)=s! is the gamma function. The priors p(T∣·) and p(V∣·) will be specified later. We call the variables Si={sν,i,τ}latent sources. We can analytically marginalise out the latent sources S={S1⋯SI} to obtain the marginal likelihood(7)log p(X∣T,V)=log ∑Sp(X∣S)p(S∣T,V)=log ∏ν,τ𝒫𝒪(xν,τ;∑itν,i,vi,τ)=∑ν ∑τ(xν,τ log[TV]ν,τ−[TV]ν,τ −log Γ(xν,τ+1)).This result follows from the well-known superposition property of Poisson random variables [7], namely, when si~𝒫𝒪(si;λi) and x=s1+s2+⋯+sI, then the marginal probability is given by p(x)=𝒫𝒪(x;∑i λi). The maximisation of this objective in T and V is equivalent to the minimisation of the information divergence in (3). In the derivation of original NMF in [8], this objective is stated first; the S variables are introduced implicitly later during the optimisation on T and V. In the sequel, we show that this algorithm is actually equivalent to EM, ignoring the priors p(T∣·) and p(V∣·).

2.1. Maximum Likelihood and the EM Algorithm

The log-likelihood of the observed data X can be written as(8)ℒX(T,V)≡log ∑S p(X∣S)p(S∣T,V)≥∑Sq(S) logp(X,S∣T,V)q(S)≡ℬEM[q],where q(S) is an instrumental distribution, that is arbitrary provided that the sum on the right exists; q can only vanish at a particular S only when p does so. Note that this defines a lower bound to the log-likelihood. It can be shown via functional derivatives and imposing the normalisation condition ∑Sq(S)=1 via Lagrange multipliers that the lower bound is tight for the exact posterior of the latent sources, that is,(9)arg max q(S) ℬEM[q]=p(S∣X,T,V).Hence the log-likelihood can be maximised iteratively as follows:

(10)E Step q(S)(n)=p(S∣X,T(n−1),V(n−1)),M Step (T(n),V(n))=arg max T,V 〈 log p(S,X∣T,V)〉q(S)(n). Here, 〈f(x)〉p(x)=∫ p(x)f(x)dx, the expectation of some function f(x) with respect to p(x). In the E step, we compute the posterior distribution of S. This defines a lower bound on the likelihood(11)ℬ(n)(T,V∣T(n−1),V(n−1))=〈 log p(S,X∣T,V)〉q(S)(n).For many models in the exponential family, which includes (5), the expectation on the right depends on the sufficient statistics of q(S)(n) and is readily available; in fact calculating q(S) should be literally taken as calculating the sufficient statistics of q(S). The lower bound is readily obtained as a function of these sufficient statistics, and maximisation in the M Step yields a fixed point equation.

2.1.1. The E Step

To derive the posterior of the latent sources, we observe that(12)p(S∣X,T,V)=p(S,X∣T,V)p(X∣T,V).For the model in (5), we have(13)logp(S,X∣T,V) =∑ν ∑τ(∑i(−tν,ivi,τ+sν,i,τ log(tν,ivi,τ)−log Γ(sν,i,τ+1)) +log δ(xν,τ−∑isν,i,τ)).It follows from (5), (12), (13), and (7)(14)log p(S∣X,T,V) =∑ν ∑τ(∑i(sν,i,τ log(tν,ivi,τ∑i'tν,i'vi',τ)−log Γ(sν,i,τ+1)) +log Γ(xν,τ+1)+log δ(xν,τ−∑isν,i,τ)) =∑ν ∑τlog ℳ(sν,1,τ,…,sν,I,τ;xν,τ,pν,1,τ,…,pν,I,τ),where pν,i,τ≡tν,ivi,τ/∑i'tν,i'vi',τ are the cell probabilities. Here, ℳ denotes a multinomial distribution defined by(15)ℳ(s;x,p)=(xs1 s2 ⋯ sI)p1s1p2s2⋯pIsIδ(x−∑isi)=δ(x−∑isi)x!∏i=1Ipisisi!,where s={s1,s2,…,sI}, p={p1,p2,…,pI}, and p1+p2+⋯+pI=1. Here, pi, i=1⋯I are the cell probabilities, and x is the index parameter where s1+s2+⋯+sI=x. The Kronecker delta function is defined by δ(x)=1 when x=0, and δ(x)=0 otherwise. It is a standard result that the marginal mean is(16)〈si〉=xpi,that is, the expected value of each source si is a fraction of the observation, where the fraction is given by the corresponding cell probability.

2.1.2. The M Step

It is indeed a good news that the posterior has an analytic form. Since now the M step can be calculated easily as follows:(17)〈log p(S,X∣T,V)〉p(S∣X,T,V)=∑ν∑τ(∑i(−tν,ivi,τ+〈sν,i,τ〉log(tν,ivi,τ)−〈log Γ(sν,i,τ+1)〉) +〈log δ(xν,τ−∑isν,i,τ)〉).Fortunately, for maximisation with respect to T and V, the last two difficult terms are merely constant, and we need only to maximise the simpler objective(18)Q(T,V)=∑ν∑τ(∑i(−tν,ivi,τ+〈sν,i,τ〉(n) log(tν,ivi,τ))),where we only need the expected value of the sources given by the previous values of the templates and excitations:(19)〈sν,i,τ〉(n)=xν,τtν,i(n)vi,τ(n)∑i'tν,i'(n)vi',τ(n).Maximisation of the objective Q and substituting 〈sν,i,τ〉(n) give the following fixed point equations:(20)∂Q∂tν,i=−∑τvi,τ(n)+∑τ〈sν,i,τ〉(n)tν,i,tν,i(n+1)=∑τ〈sν,i,τ〉(n)∑τvi,τ(n)=tν,i(n)∑τxν,τvi,τ(n)/∑i'tν,i'(n)vi',τ(n)∑τvi,τ(n),∂Q∂vi,τ=−∑νtν,i(n)+∑ν〈sν,i,τ〉(n)vi,τ,vi,τ(n+1)=∑ν〈sν,i,τ〉(n)∑νtν,i(n)=vi,τ(n)∑νtν,i(n)xν,τ/∑i'tν,i'(n)vi',τ(n)∑νtν,i(n). Equation (20) is identical to the multiplicative update rules of [8]. However, our derivation via data augmentation obtains the same result as an EM algorithm. It is interesting to note that in literature, NMF is often described as EM-like; here, we show that it is actually just an EM algorithm. We see that the efficiency of NMF is due to the fact that the W×I×K object 〈S〉 needs not to be explicitly calculated as we only need its marginal statistics (sums across τ or ν).

We note that our model is valid when X is integer valued. See [9] for a detailed discussion about consequences of this issue. Here, we assume that for nonnegative real valued X˜, we only consider the integer part, that is, we let X˜=X+E, where E is a noise matrix with entries uniformly drawn in [0,1). In practice, this is not an obstacle when the entries of X are large.

The interpretation of NMF as a maximum likelihood method in a Poisson model is mentioned in the original NMF paper [1] and discussed in more detail by [5, 10]. The equivalence of NMF and probabilistic latent sematic analysis is shown in [11]. Kameoka in [5] focuses on the optimisation and gives an equivalent description using auxiliary function maximisation. In contrast, the auxiliary variables can be viewed as model variables (the sources s) that are analytically integrated out [10]. A general framework is described in [12]. Prior structures are placed on conditionally Gaussian NMF models to enforce sparsity in [13]. However, all of these approaches are based on regularisation, that is, aim at calculating a maximum a posteriori estimate. In contrast, we provided in this article a full Bayesian treatment where the templates and excitations are integrated out.

2.2. Hierarchical Prior Structure

Given the probabilistic interpretation, it is possible to propose various hierarchical prior structures to fit the requirements of an application. Here we will describe a simple choice where we have a conjugate prior as follows:(21)tν,i~𝒢(tν,i;aν,it,bν,itaν,it), vi,τ~𝒢(vi,τ;ai,τv,bi,τvai,τv).Here, 𝒢 denotes the density of a gamma random variable x∈ℝ+ with shape a∈ℝ+ and scale b∈ℝ+ defined by(22)𝒢(x;a,b)=exp((a−1) log x−xb−log Γ(a)−a log b).The primary motivation for choosing a Gamma distribution is computational convenience: Gamma distribution is the conjugate prior to Poisson intensity. The indexing highlights the most general case where there are individual parameters for each element tν,i and vi,τ. Typically, we do not allow many free hyperparameters but tie them depending upon the requirements of an application. See Figure 1 for an example. As an example, consider a model where we tie the hyperparameters such as aν,it=at, bν,it=bt, ai,τv=av, and bi,τv=bv for i=1⋯I, ν=1⋯W, and τ=1⋯K. This model is simple to interpret, where each component of the templates and the excitations is drawn independently from the Gamma family shown in Figure 2. Qualitatively, the shape parameter a controls the sparsity of the representation. Remember that 𝒢(x;a,b/a) has the mean b and standard deviation b/a. Hence, for large a, all coefficients will have more or less the same magnitude b, and typical representations will be full. In contrast, for small a, most of the coefficients will be very close to zero, and only very few will be dominating, hence favouring a sparse representation. The scale parameter b is adapted to give the expected magnitude of each component.

(a) A schematic description of the NMF model with data augmentation. (b) Graphical model with hyperparameters. Each source element sν,i,τ is Poisson distributed with intensity tν,ivi,τ. The observations are given by xν,τ=∑isν,i,τ. In matrix notation, we write X=∑ Si. We can analytically integrate out over S. Due to superposition property of Poisson distribution, intensities add up, and we obtain 〈X〉=TV. Given X, the NMF algorithm is shown to seek the maximum likelihood estimates of the templates T and excitations V. In our Bayesian treatment, we further assume that elements of T and V are Gamma distributed with hyperparameters Θ.

(a)(b)

Figure 2

(Left) The family of densities p(v;a,b=1)=𝒢(v;a,b/a) with the same mean 〈v〉=b=1. Small values of a (for a<1) enforce sparser representations, and large values of a≈100 tie all values to be close to a nonzero mean (nonsparse representation).

To model missing data, that is, when some of the xν,τ are not observed, we define a mask matrix M={mν,τ}, the same size as X where mν,τ=0, if xν,τ is missing and 1 otherwise (see Appendix A.4 for details). Using the mask variables, the observation model with missing data can be written as(23)p(X∣S)p(S∣T,V) =∏ν,τ(p(xν,τ∣sν,1:I,τ)p(sν,1:I,τ∣tν,1:I,v1:I,τ))mν,τ.

The hierarchical model in (21) is more powerful than the basic model of (5), in that it allows a lot of freedom for more realistic modelling. First of all, the hyperparameters can be estimated from examples of a certain class of source to capture the invariant features. Another possibility is Bayesian model selection, where we can compare alternative models in terms of their marginal likelihood. This enables one to estimate the model order, for example, the optimum number of templates to represent a source.

3. Full Bayesian Inference

Below, we describe various interesting problems that can be cast to Bayesian inference problems. In signal analysis and feature extraction with NMF, we may wish to calculate the posterior distribution of templates and excitations, given data and hyperparameters Θ≡(Θt,Θv). Another important quantity is the marginal likelihood (also known as the evidence), where(24)p(X∣Θ)=∫dT dV∑Sp(X∣S)p(S∣T,V)p(T,V∣Θ).The marginal likelihood can be used to estimate the hyperparameters, given examples of a source class(25)Θ*=argmaxΘ p(X∣Θ)or to compare two given models via Bayes factors(26)l(Θ1,Θ2)=p(X∣Θ1)p(X∣Θ2).

This latter quantity is particularly useful for comparing different classes of models. Unfortunately, the integrations required cannot be computed in closed form. In the sequel, we will describe the Gibbs sampler and variational Bayes as approximate inference strategies.

3.1. Variational Bayes

We sketch here the Variational Bayes (VB) [3, 14] method to bound the marginal log-likelihood as(27)ℒX(Θ)≡log p(X∣Θ)≥∑S∫d(T,V)q logp(X,S,T,V∣Θ)q=〈log p(X,S,V,T∣Θ)〉q+H[q]≡ℬVB[q],where q=q(S,T,V) is an instrumental distribution, and H[q] is its entropy. The bound is tight for the exact posterior q(S,T,V)=p(S,T,V∣X,Θ) but as this distribution is complex, we assume a factorised form for the instrumental distribution by ignoring some of the couplings present in the exact posterior as follows:(28)q(S,T,V)=q(S)q(T)q(V)=(∏ν,τq(sν,1:I,τ))(∏ν,iq(tν,i))(∏i,τq(vi,τ))≡∏α∈𝒞qα,where α∈𝒞={{S},{T},{V}} denotes set of disjoint clusters. Hence, we are no longer guaranteed to attain the exact marginal likelihood ℒX(Θ). Yet, the bound property is preserved, and the strategy of VB is to optimise the bound. Although the best q distribution respecting the factorisation is not available in closed form, it turns out that a local optimum can be attained by the following fixed point iteration:(29)qα(n+1)∝exp(〈log p(X,S,T,V∣Θ)〉q¬α(n)),where q¬α=q/qα. This iteration monotonically improves the individual factors of the q distribution, that is, ℬ[q(n)]≤ℬ[q(n+1)] for n=1,2,… given an initialisation q(0). The order is not important for convergence; one could visit blocks in arbitrary order. However, in general, the attained fixed point depends upon the order of the updates as well as the starting point q(0)(·). We choose the following update order in our derivations:(30)q(S)(n+1)∝exp(〈log p(X,S,T,V∣Θ)〉q(T)(n)q(V)(n)),(31)q(T)(n+1)∝exp(〈log p(X,S,T,V∣Θ)〉q(S)(n+1)q(V)(n)),(32)q(V)(n+1)∝exp(〈log p(X,S,T,V∣Θ)〉q(S)(n+1)q(T)(n+1)).

3.2. Variational Update Equations and Sufficient Statistics

The expectations of 〈log p(X,S,T,V∣Θ)〉 are functions of the sufficient statistics of q (see the expression in the Appendix A.2). The update equation for the latent sources (30) leads to the following:(33)q(sν,1:I,τ)∝exp(∑i(sν,i,τ(〈log tν,i〉+〈log vi,τ〉) −log Γ(sν,i,τ+1)))δ(xν,τ−∑isν,i,τ)∝ℳ(sν,1,τ,…,sν,i,τ,…,sν,I,τ; xν,τ,pν,1,τ,…,pν,i,τ,…,pν,I,τ),pν,i,τ=exp(〈log tν,i〉+〈log vi,τ〉)∑i exp(〈log tν,i〉+〈log vi,τ〉),〈sν,i,τ〉=xν,τpν,i,τ. These equations are analogous to the multinomial posterior of EM given in (14); only the computation of cell probabilities is different. The excitation and template distributions and their sufficient statistics follow from the properties of the gamma distribution:(34)q(tν,i)∝exp((aν,it+∑τ〈sν,i,τ〉−1)log(tν,i) −(aν,itbν,i+∑τ〈vi,τ〉)tν,i)∝𝒢(tν,i;αν,it,βν,it),αν,it≡aν,it+∑τ〈sν,i,τ〉, βν,it≡(aν,itbν,it+∑τ〈vi,τ〉)−1,exp(〈logtν,i〉)=exp(Ψ(αν,it))βν,it,〈tν,i〉=αν,itβν,it,q(vi,τ)∝exp((ai,τv+∑ν〈sν,i,τ〉−1)log vi,τ −(ai,τvbi,τv+∑ν〈tν,i〉)vi,τ)∝𝒢(vi,τ;αi,τv,βi,τv),αi,τv≡ai,τv+∑ν〈sν,i,τ〉, βi,τv≡(ai,τvbi,τv+∑ν〈tν,i〉)−1,exp(〈log vi,τ〉)=exp(Ψ(αi,τv))βi,τv,〈vi,τ〉=αi,τvβi,τv.

3.3. Efficient Implementation

One of the attractive features of NMF is easy and efficient implementation. In this section, we derive that the update equations of Section 3.2 in compact matrix notation are to illustrate that these attractive properties are retained for the full Bayesian treatment. A subtle but key point in the efficiency of the algorithm is that we can avoid explicitly storing and computing the W×I×K object 〈S〉, as we only need the marginal statistics during optimisation. Consider (33). We can write(35)∑τ〈sν,i,τ〉=∑τxν,τpν,i,τ=exp(〈log tν,i〉) ×∑τ(xν,τ(∑i'exp(〈log tν,i'〉)exp(〈log vi',τ〉))) ×exp(〈log vi,τ〉),Σt=Lt .*((X./(LtLv))Lv⊤).Here, the denominator has to be nonzero. In the last line, we have represented the expression in compact notation where we define the following matrices:

(36)Et={〈tν,i〉}, Lt={exp(〈log tν,i〉)},Σt={∑τ〈sν,i,τ〉}, At={aν,it},Bt={bν,it}, αt={αν,it}, βt={βν,it},Ev={〈vi,τ〉}, Lv={exp(〈log vi,τ〉)},Σv={∑ν〈sν,i,τ〉}, Av={ai,τv},Bv={bi,τv}, αv={αi,τv}, βv={βi,τv}. The matrices subscripted with t are in ℝ+W×I and with v are in ℝ+I×K. For notational convenience, we define .* and ./ as elementwise matrix multiplication and division, respectively, and 1W as a W×1 vector of ones. After straightforward substitutions, we obtain the variational nonnegative matrix factorisation algorithm, that can compactly be expressed as in panel Algorithm 1.

<bold>Algorithm 1: </bold>Variational nonnegative matrix factorisation.

(1) Initialise:

Lt(0)=Et(0)~𝒢(·;At,Bt./At) Lv(0)=Ev(0)~𝒢(·;Av,Bv./Av)

(2) forn=1… MAXITER do

(3) Source sufficient statistics

Σt(n):=Lt(n−1) .*(((X .*M)./(Lt(n−1)Lv(n−1)))Lv(n−1)⊤)

Σv(n):=Lv(n−1) .*(Lt(n−1)⊤((X .*M)./(Lt(n−1)Lv(n−1))))

(4) Means

Et(n):=αt(n) .*βt(n) αt(n)=At+Σt(n) βt(n)=1./(At./Bt+MEv(n−1)⊤)

Ev(n):=αv(n) .*βv(n) αv(n)=Av+Σv(n) βv(n)=1./(Av./Bv+Et(n)⊤M)

(5) Optional: Compute Bound (See Appendix, (A.13)

(6) Means of Logs

Lt(n)=exp(Ψ(αt(n))) .*βt(n) Lv(n)=exp(Ψ(αv(n))) .*βv(n)

(7) Optional: Update Hyperparameters (See Appendix, Section A.5)

(8) end for

Similarly, an iterative conditional modes (ICM) algorithm can be derived to compute the maximum a posteriori (MAP) solution (see Appendix A.4):(37)V:=(Av+V .*(T⊤((M .*X)./(TV)))) ./(Av./Bv+T⊤M),(38)T:=(At+T .*(((M .*X)./(TV))V⊤)) ./(At./Bt+MV⊤).Note that when the shape parameters go to zero, that is, At,Av→0, we obtain the maximum likelihood NMF algorithm.

3.4. Markov Chain Monte Carlo, the Gibbs Sampler

Monte Carlo methods [15, 16] are powerful computational techniques to estimate expectations of form(39)E=〈f(x)〉p(x)≈1N∑n=1Nf(x(i))=E˜N,where x(i) are independent samples drawn from p(x). Under mild conditions on f, the estimate E˜N converges to the true expectation for N→∞. The difficulty here is obtaining independent samples {x(i)}i=1⋯N from complicated distributions.

The Markov Chain Monte Carlo (MCMC) techniques generate subsequent samples from a Markov chain defined by a transition kernel𝒯, that is, one generates x(i+1) conditioned on x(i) as follows:(40)x(i+1)~𝒯(x∣x(i)).Note that the transition kernel 𝒯 is not needed explicitly in practice; all is needed is a procedure to sample a new configuration, given the previous one. Perhaps surprisingly, even though subsequent samples are correlated, provided that 𝒯 satisfies certain ergodicity conditions, (39) remains still valid, and estimated expectations converge to their true values when number of samples N goes to infinity [15]. To design a transition kernel 𝒯 such that the desired distribution is the stationary distribution, that is, p(x)=∫ dx′𝒯(x∣x′)p(x′), many alternative strategies can be employed [16]. One particularly convenient and simple procedure is the Gibbs sampler where one samples each block of variables from full conditional distributions. For the NMF model, a possible Gibbs sampler is(41)S(n+1)~p(S∣T(n),V(n),X,Θ),T(n+1)~p(T∣V(n),S(n+1),X,Θ),V(n+1)~p(V∣S(n+1),T(n+1),X,Θ). Note that this procedure implicitly defines a transition kernel 𝒯(·∣·). It can be shown [15] that the stationary distribution of 𝒯 is the exact posterior p(S,T,V∣X,Θ). Eventually, the Gibbs sampler converges regardless of the order that the blocks are visited, provided that each block is visited infinitely often in the limit n→∞. However, the rate of convergence is very difficult to assess as it depends upon the order of the updates as well as the starting configuration (T(0),V(0),S(0)). It is instructive to contrast above (41) with the variational update of (30)–(32): algorithmically the two approaches are quite similar. The pseudo-code is given in Algorithm 2.

<bold>Algorithm 2: </bold>Gibbs sampler for nonnegative matrix factorisation.

(1) Initialize:

T(0)=~𝒢(·;At,Bt) V(0)~𝒢(·;Av,Bv)

(2) forn=1… MAXITER do

(3) Sample Sources

(4) forτ=1⋯K,ν=1⋯Wdo

(5) pν,1:I,τ(n)=T(n−1)(ν,1:I) .*V(n−1)(1:I,τ)⊤./(T(n−1)(ν,1:I)V(n−1)(1:I,τ))

(6) S(n)(ν,1:I,τ)~ℳ(sν,1:I,τ;xν,τ,pν,1:I,τ(n))

(7) end for

Σt(n)=∑τSν,i,τ(n) Σv(n)=∑νSν,i,τ(n)

(8) Sample Templates

αt(n)=At+Σt(n) βt(n)=1./(At./Bt+1W(V(n−1)1K)⊤)

T(n)~𝒢(T;αt(n),βt(n))

(9) Sample Excitations

αv(n)=Av+Σv(n) βv(n)=1./(Av./Bv+(1W⊤T(n−1))⊤1K⊤)

V(n)~𝒢(V;αv(n),βv(n))

(10) end for

3.4.1. Marginal Likelihood Estimation with Chib's Method

The marginal likelihood can be estimated from the samples generated by the Gibbs sampler using a method proposed by Chib [17]. Suppose we have run the block Gibbs sampler until convergence and have N samples as follows:(42){T(n)}n=1 : N, {V(n)}n=1 : N, {S(n)}n=1 : N.The marginal likelihood is (omitting hyperparameters Θ)(43)p(X)=p(V,T,S,X)p(V,T,S∣X).This equation holds for all points (V,T,S). We choose a point in the configuration space; provided that the distribution is unimodal, a good candidate is a configuration near the mode (T˜,V˜,S˜). The numerator in (43) is easy to evaluate. The denominator is(44)p(V˜,T˜,S˜∣X)=p(V˜∣T˜,S˜,X)p(T˜∣S˜,X)p(S˜∣X)=p(V˜∣T˜,S˜)p(T˜∣S˜)p(S˜∣X).The first term is full conditional, so it is available for the Gibbs sampler. The third term is(45)p(S˜∣X)=∫dV dT p(S˜∣V,T,X)p(V,T∣X)≈1N∑n=1Np(S˜∣V(n),T(n),X).The second term is trickier(46)p(T˜∣S˜)=∫dV p(T˜∣V,S˜)p(V∣S˜).The first term here is full conditional. However, the original Gibbs run gives us only samples from p(V∣X), not p(V∣S˜). The idea is to run the Gibbs sampler for M further iterations where we sample from (VS˜(m),TS˜(m))~p(V,T∣S=S˜), that is, with S clamped at S˜. The resulting estimate is(47)p(T˜∣S˜)≈1M∑m=1Mp(T˜∣VS˜(m),S˜). Chib's method estimates the marginal likelihood as follows:(48)log p(X∣Θ)=log p(V˜,T˜,S˜,X∣Θ)−log p(V˜,T˜,S˜∣X,Θ)≈log p(V˜,T˜,S˜,X∣Θ)−log p(V˜∣T˜,S˜,Θ) −log ∑m=1Mp(T˜∣VS˜(m),S˜,Θ) −log∑n=1Np(S˜∣V(n),T(n),X,Θ)+log(MN).

4. Simulations

Our goal is to illustrate our approach in a model selection context. We first illustrate that the variational approximation to the marginal likelihood is close to the one obtained from the Gibbs sampler via Chib's method. Then, we compare the quality of solutions we obtain via Variational NMF and compare them to the original NMF on a prediction task. Finally, we focus on reconstruction quality in the overcomplete case where the standard NMF is subject to overfitting.

Model Order Determination

To test our approach, we generate synthetic data from the hierarchical model in (21) with W=16, K=10, and the number of sources Itrue=5. The inference task is to find the correct number of sources, given X. The hyperparameters of the true model are set to aν,it=at=10, bν,it=bt=1, ai,τv=av=1, and bi,τv=bv=100. In the first experiment, the hyperparameters are assumed to be known and in the second are jointly estimated from data, using hyperparameter adaptation. We evaluate the marginal likelihood for models with the number of templates I=1⋯10, with the Gibbs sampler using Chib's method and variational lower bound ℬ via variational Bayes. We run the Gibbs sampler for MAXITER=10 000 steps following a burn-in period of 5000 steps; then we clamp the sources S and continue the simulation for a further 10 000 steps to estimate quantities required by Chib's method. We run the variational algorithm until convergence of the bound or 10 000 iterations, whichever occurs first. In Figure 3(a), we show a comparison of the variational estimate with the average of 5 independent runs obtained via Chib's method. We observe, that both methods give consistent results. In Figure 4, we show the lower bound as a function of model order I, where for each I, the bound is optimised independently by jointly optimising hyperparameters at,bt,av, and bv using the equations derived in the appendix. We observe, that the correct model order can be inferred even when the hyperparameters are unknown a priori. This is potentially useful for estimation of model order from real data.

Model selection results. (a) Comparison of model selection by variational bound (squares) and marginal likelihood estimated by Chib's (circles) method. The hyperparameters are assumed to be known. (b) Box-plot of marginal likelihood estimated by Chib's method using 5000, 10 000, and 10 000 iterations for burn-in, free, and clamped sampling. The boxes show the lower quartile, median, and upper quartile values. (c) Model selection by variational bound when hyperparameters are unknown and jointly estimated.

(a)(b)(c)

Model selection using variational bound with adapted hyperparameters on face data 16×16 with I*=27 (a) and 32×32 with I*=42 (b).

(a)(b)

As real data, we use a version of the Olivetti face image database (K=400 images of 64×64 pixels available at http://www.cs.toronto.edu/~roweis/data/olivettifaces.mat). We further downsampled to 16×16 or 32×32 pixels, hence our data matrix X is 162×400 or 322×400. We use a model with tied hyperparameters as aν,it=at, bν,it=bt, ai,τv=av, and bi,τv=bv, where all hyperparameters are jointly estimated. In Figure 4, bottom, we show results of model order determination for this dataset with joint hyperparameter adaptation. Here, we run the variational algorithm for each model order I=1⋯100 independently and evaluate the lower bound after optimising the hyperparameters. The Gibbs sampler is not found practical and is omitted here. The lower bound behaves as is expected from marginal likelihood, reflecting the tradeoff between too many and too few templates. Higher resolution implies more templates, consistent with our intuition that detail requires more templates for accurate representation.

We also investigate the nature of the representations (see Figure 5). Here, for each independent run, we fix the values of shape parameters to (at,av)=[(10,10),(0.1,0.1),(10,0.2),(10,0.5)] and only estimate bt and bv. This corresponds to enforcing sparse or nonsparse t and v. Each column shows I=36 templates estimated from the dataset conditioned on hyperparameters. The middle image is the same template image above weighted with the excitations corresponding to the reconstruction (the expected value of the predictive distribution) below. Here, we clearly see the effect of the hyperparameters. In the first condition (at,av)=(10,10), the prior does not enforces sparsity to the templates and excitations. Hence, for the representation of a given image, there are many active templates. In the second condition, we try to force both matrices to be sparse with (at,av)=(0.1,0.1). Here, the result is not satisfactory as isolated components of the templates are zeroed, giving a representation that looks like one contaminated by “salt-and-pepper” noise. The third condition ((at,av)=(10,0.2)) forces only the excitations to be sparse. Here, we observe that the templates correspond to some average face images. Qualitatively, each image is reconstructed using a superposition of a few of these templates. In the final representation, we enforce sparsity in the templates but not in the excitations. Here, our estimate finds templates that correspond to parts of individual face images (eyebrows, lips, etc.). This solution, intuitively corresponding to a parsimonious representation, also is the best in terms of the marginal likelihood. With proper initialisation, our variational procedure is able to find such solutions.

Figure 5

Templates, excitations for a particular example, and the reconstructions obtained for different hyperparameter settings. B is the lower bound for the whole dataset.

Prediction

We now compare variational Bayesian NMF with the maximum likelihood NMF on a missing data prediction task.

To illustrate the self regularisation effect, we set up an experiment in which we select a subset of the face data consisting of 50 images. From half of the images, we remove the same patch (Figure 6) and predict the missing pixels. This is a rather small dataset for this task, as we have only 10 images for each of the 5 different persons, and half of these images have missing data at the same spot. We measure the quality of the prediction in terms of signal-to-noise ratio (SNR). The missing values are reconstructed using the mean of the predictive distribution Xpred≡〈X〉𝒫𝒪(X;T*V*)=T*V*, where T* and V* are point estimates of the template and excitation matrix. We compare our variational algorithm with the classical NMF. For each algorithm, we test two different versions. The variational algorithms differ in how we estimate T* and V*. In the first variational algorithm, we use a crude estimate of T* and V* as the mean of the approximating q distribution. In the second condition, after convergence of hyperparameters via VB, we reinitialise T and V randomly and switch to an ICM algorithm (see (38). This strategy finds a local mode (T*,V*) of the exact posterior distribution. In NMF, we test two initialisation strategies: in the first condition, we initialise the templates randomly; in the second, we set them equal to the images in the dataset with random perturbations.

Results of a typical run. (a) Example images from the dataset. (b) Comparison of the reconstruction accuracy of different methods in terms of SNR (in dB), organised according to the sparseness of the solution. (c) (from left to right). The ground truth, data with missing pixels. The reconstructions of VB, VB + ICM, and ML-NMF with two initialisation strategies (1 = random, 2 = to image).

(a)(b)(c)

In Figure 6, we show the reconstruction results for a typical run, for a model with I=100 templates. Note that this is an overcomplete model, with twice as many templates as there are images. To characterise the nature of the estimated template and excitation matrices, we use the sparseness criteria [18] of an m×n matrix X, defined as Sparseness (X)=(mn−(∑i,j|Xi,j|)/(∑i,jXi,j2)1/2)/(mn−1). This measure is 1 when the matrix X has only a single nonzero entry and 0 when all entries are equal. We see that the variational algorithms are superior in this case in terms of SNR as well as the visual quality of the reconstruction. This is perhaps not surprising, since with maximum likelihood estimation; if the model order is not carefully chosen, generalisation performance is poor: the “noise” in the observed data is fitted but the prediction quality drops on new data. An interesting observation is that highly sparse solutions (either in templates or excitations) do not give the best result; the solution that balances both seems to be the best in this setting. This example illustrates that sparseness in itself may not be necessarily a good criteria to optimise; for model selection, the marginal likelihood should be used as the natural quantity.

On the same face dataset, we compare the prediction error in terms of the SNR for varying model order I. Our goal is to compare the prediction performance of the full Bayesian approach with the ML-NMF for a range of conditions (under-complete, complete, and overcomplete). The results shown in Figure 7 are averages of several runs with hyperparameter adaptation and different hyperparameter tying. In the simulations, the shape parameters are tied always as ai,τv=av (and aν,it=at). The scale parameters are untied or tied as (bτv,bνt) (across sources) or biv,bit (different for each source) and jointly optimised. Regardless of the hyperparameter tying structure, the results were quite similar. The best SNR values are attained with untied scale parameters for both excitations and templates.

Figure 7

Average SNR results for model orders I=1,25,50, 75,100 covering undercomplete, complete, and overcomplete cases. Comparison of VB, VB + ICM, and ML-NMF with two initialisation strategies (1 = random, 2 = to image).

We observe that, due to the implicit self-regularisation in the Bayesian approach, the prediction performance is not very sensitive to the model order and is immune to overfitting. In contrast, the ML-NMF with random initialisation is prone to overfitting, and prediction performance drops with increasing model order. Interestingly, when we initialise the ML-NMF algorithm to true data points with small perturbations, the prediction performance in terms of SNR improves. Note that this strategy would not be possible for data where the pixels were truly missing. However, visual inspection shows that the interpolation can still be “patchy” (see Figure 6).

We observe that hyperparameter adaptation is crucial for obtaining good prediction performance. In our simulations, results for VB without hyperparameter adaptation were occasionally poorer than the ML estimates. Good initialisation of the shape hyperparameters seems to be also important. We obtain best results when initialising the shape hyperparameters asymmetrically, for example, av<1 and at>10 (see 3rd and 4th panels from left in Figure 5). When the shape hyper-parameters are initialised to small av,at≪1, the EM seems to get stuck in a local minima more often. Consequently, the prediction results are poorer. We have also carried out tests with more undercomplete representations when the model order is low I<10. For these simulations, while the marginal likelihood was in favour of the VB solutions, we have not observed statistically significant differences between VB and ML in terms of SNR. The SNR improvement of VB over ML was on average about 0.1 dB only.

5. Discussion and Conclusions

In this paper, we have investigated KL-NMF from a statistical perspective. We have shown that KL minimisation formulation the original algorithm can be derived from a probabilistic model where the observations are superposition of I independent Poisson-distributed latent sources. Here, the template and excitation matrices turn out to be latent intensity parameters. The interpretation of NMF as a maximum likelihood method in a Poisson model is mentioned in the original NMF paper [1] and discussed in more detail by [5, 10], and [5] focuses on the optimisation and gives an equivalent description using auxiliary function maximisation. In contrast, [10] illustrates that the auxiliary variables can be viewed as model variables (the sources s) that are analytically integrated out. The relationship between KL divergence and the Poisson distribution is not just a lucky coincidence. There exists a duality between divergence functions and exponential family distributions. If a cost function is a Bregman divergence, there exists a regular exponential family where minimising the cost corresponds to maximum likelihood parameter estimation [19]; also see [12] it in the context of matrix factorisation models.

The novel observation in the current article is the exact characterisation of the approximating distribution q(S) or full conditionals p(S∣T,V,X) as a product of multinomial distributions, leading to a richer approximation distribution than a naive mean field or single site Gibbs (which would freeze due to deterministic p(X∣S)). This conjugate form leads to significant simplifications in full Bayesian integration. Apart from the conditionally Gaussian case, NMF with KL objective seems to be unique in this respect. For several other distance metrics D(·∥·), we find that full Bayesian inference not as practical as p(S∣T,V,X) is not standard.

We have also shown that the standard KL-NMF algorithm with multiplicative update rules is in fact an EM algorithm with data augmentation. Extending upon this observation, we have developed an hierarchical model with conjugate Gamma priors. We have developed a variational Bayes algorithm and a Gibbs sampler for inference in this hierarchical model. We have also developed methods for estimating the marginal likelihood for model selection. This is an additional feature that is lacking in existing NMF approaches with regularisation, where only MAP estimates are obtained, such as [13, 18, 20].

Our simulations suggest that the variational bound seems to be a reasonable approximation to the marginal likelihood and can guide model selection for NMF. The computational requirements are comparable to the ML-NMF. A potentially time-consuming step in the implementation of the variational algorithm is the evaluation of the Ψ function but this step can also be replaced by a simple piecewise polynomial approximation since exp(Ψ(x))≈x−0.5 for x>5.

We first compare the variational inference with a Gibbs sampler. In our simulations, we observe that both algorithms give qualitatively very similar results, both for inference of templates and excitations as well as model order selection. We find the variational approach somewhat more practical as it can be expressed as simple matrix operations, where both the fixed point equations as well as the bound can be compactly and efficiently implemented using matrix computation software. In contrast, our Gibbs sampler is computationally more demanding, and the calculation of marginal likelihood is somewhat more tricky. With our implementation of both algorithms, the variational method is faster by a factor of around 13. Reference implementations of both algorithms in Matlab are available from the following url: http://www.cmpe.boun.edu.tr/~cemgil/bnmf/.

In terms of computational requirements, the variational procedure has several advantages. First, we circumvent sampling from multinomial variables, which is the main computational bottleneck with the Gibbs sampler. Whilst efficient algorithms are developed for multinomial sampling [21], the procedure is time consuming when the number of latent sources I is large. In contrast, the variational method estimates the expected sufficient statistics directly by elementary matrix operations. Another advantage is hyperparameter estimation. In principle, it is possible to maximise the marginal likelihood via a Monte Carlo EM procedure [22, 23]; yet this potentially requires many more iterations. In contrast, the evaluation of the derivatives of the lower bound is straightforward and can be implemented without much additional computational cost.

The efficiency of the Gibbs sampler could be improved by working out the distribution of the sufficient statistics of sources directly (namely, quantities ∑τsν,i,τ or ∑νsν,i,τ) to circumvent multinomial sampling. Unfortunately, for the sum of binomial random variables with different cell probability parameters, the sum does not have a simple form but various approximations are possible [24].

Inference based on VB is easy to implement but at the end of the day, the fixed point iteration is just a gradient-based lower bound optimisation procedure, and second order Newton methods can provide more efficient alternatives. For NMF models, there exist many conditional independence relations, hence the Hessian matrix has a special block structure [12]. It is certainly interesting to develop efficient inference methods that make use of the special block structure of the Hessian matrix. However, as our primary goal was a practical full Bayesian treatment, we have not investigated this path yet. Another approach in this direction is using alternative deterministic integration techniques such as expectation propagation (EP) [25]. Those techniques work directly on an approximation of the true marginal likelihood rather than a bound. A related approach known as expectation consistent (EC) inference is used with success in related source separation problems [26].

From a modelling perspective, our hierarchical model provides some attractive properties. It is easy to incorporate prior knowledge about individual latent sources via hyperparameters, and one can easily capture variability in the templates and excitations that is potentially useful for developing robust techniques. The prior structure here is qualitatively similar to an entropic prior [20, 27], and we find qualitatively similar representations to the ones found by NMF reported earlier by [1, 18]. However, none of the above mentioned methods provide an estimate of the marginal likelihood, which is useful for model selection. Our generative model formulation can be extended in various ways to suit the specific needs of particular applications. For example, one can enforce more structured prior models such as chains or fields [10]. As a second possibility, the Poisson observation model can be replaced with other models such as clipped Gaussian, Gamma, or Gaussians which lead to alternative source separation algorithms. For example, the case of Gaussian sources where the excitations and templates correspond to the variances is discussed in [28].

Our main contribution here is the development of a principled and practical way to estimate both the optimal sparsity criteria and model order, in terms of marginal likelihood. By maximising the bound on marginal likelihood, we have a method where all the hyperparameters can be estimated from data, and the appropriate sparseness criteria is found automatically. We believe that our approach provides a practical improvement to the highly popular KL-NMF algorithm without incurring much additional computational cost.

AppendixA. Standard Distributions in Exponential Form, Their Sufficient Statistics and Entropies

(i)

Gamma (A.1)𝒢(λ;a,b)≡exp(+(a−1)logλ−1bλ−logΓ(a)−alogb),〈λ〉𝒢=ab 〈log λ〉𝒢=Ψ(a)+log(b),H𝒢[λ]≡−〈log 𝒢〉𝒢=−(a−1)Ψ(a)+log b+a+log Γ(a).Here, Ψ denotes the digamma function defined as Ψ(a)≡dlogΓ(a)/da.

(ii)

Poisson (A.2)𝒫𝒪(s;λ)=exp(s log λ−λ−log Γ(s+1)),〈s〉𝒫𝒪=λ.

(iii)

Multinomial (A.3)ℳ(s;x,p)=δ(x−∑isi)exp(log Γ(x+1)+∑i=1I(si log pi−log Γ(si+1))),〈si〉ℳ=xpi. Here, s={s1,s2,…,sI}, p={p1,p2,…,pI}, and p1+p2+⋯+pI=1. Here, pi, i=1⋯I are the cell probabilities, and x is the index parameter where s1+s2+⋯+sI=x. The entropy is given as follows:(A.4)Hℳ[sν,1:I,τ]=−log Γ(xν,τ+1)−∑i=1I〈sν,i,τ〉log pν,i,τ +∑i=1I〈log Γ(sν,i,τ+1)〉−〈log δ(xν,τ−∑isν,i,τ)〉.

A closed form expression for the entropy is not known due to 〈logΓ(s+1)〉 terms but asymptotic expansions exist [29, 30]. Computationally efficient sampling from a multinomial distribution is not trivial; see [21] for a comparison of various methods and detailed discussion of tradeoffs.

A.1. Summary of the Generative Model

We have the following. Indices:

i=1⋯I, source index;

ν=1⋯W, Row (frequency bin) index;

τ=1⋯K, Column (time frame) index;

tν,i: template variable at νth row of the ith source(A.5)tν,i~𝒢(tν,i;aν,it,bν,itaν,it);vi,τ: excitation variable of the ith source at τth column(A.6)vi,τ~𝒢(vi,τ;ai,τv,bi,τvai,τv);sν,i,τ: source variable of ith source at νth row (frequency bin) and τth column (time frame)(A.7)sν,i,τ~𝒫𝒪(sν,i,τ;tν,ivi,τ);xν,τ: observation at νth row (frequency bin) and τth column (time frame)(A.8)xν,τ~∑isν,i,τ.

A.2. Expression of the Full Joint Distribution

Here, ϕ≡p(X,S,T,V∣Θ)=p(X∣S)p(S∣T, V)p(V∣Θv)p(V∣Θt),(A.9)log ϕ=∑ν ∑i ∑τ(−tν,ivi,τ+sν,i,τ log(tν,ivi,τ)−log Γ(sν,i,τ+1)) +∑ν∑τlog δ(xν,τ−∑isν,i,τ) +∑ν ∑i+(aν,it−1)log tν,i−aν,itbν,ittν,i−log Γ(aν,it) −aν,it log (bν,itaν,it) +∑τ ∑i+(ai,τv−1)log vi,τ−ai,τvbi,τvvi,τ−log Γ(ai,τv) −ai,τv log(bi,τvai,τv).

A.3. The Variational Bound

The variational bound in (27) can be written as(A.10)ℒX(Θ)≡log p(X∣Θ)≥〈log ϕ〉q+H[q]=ℬVB,where the energy term is given by the expectation of the expression in Appendix A.2, and H[q] denotes the entropy of the variational approximation distribution q where the individual entropies are defined in Appendix A:(A.11)H[q]=−〈log q〉=∑ν ∑τHℳ[sν,1:I,τ]+∑ν ∑iH𝒢[tν,i]+∑i ∑τH𝒢[vi,τ].One potential problem is that this expression requires the entropy of a multinomial distribution for which there is no known simple expression. This is due to terms of form 〈logΓ(s+1)〉 where only asymptotic expansions are known. Fortunately, the difficult terms in the energy term can be canceled by the corresponding terms in the entropy term, and one obtains the following expression that only depends on known sufficient statistics:(A.12)ℬ=−∑ν ∑τ ∑i〈tν,i〉〈vi,τ〉 +∑ν ∑i〈log tν,i〉(aν,it−1+∑τ〈sν,i,τ〉) +∑τ ∑i〈log vi,τ〉(ai,τv−1+∑ν〈sν,i,τ〉) +∑ν ∑i−aν,itbν,i〈tν,i〉−log Γ(aν,it)−aν,it log (bν,iaν,it) +∑τ ∑i−ai,τvbi,τv〈vi,τ〉−log Γ(ai,τv)−ai,τv log(bi,τvai,τv) +∑ν ∑τ(−log Γ(xν,τ+1)−∑i=1I〈sν,i,τ〉log pν,i,τ) +∑ν ∑i(−(αν,it−1)Ψ(αν,it)+log βν,it+αν,it+log Γ(αν,it)) +∑i ∑τ(−(αi,τv−1)Ψ(αi,τv)+log βi,τv+αi,τv+log Γ(αi,τv)).After some careful manipulations, the following expression is obtained where log L denotes here elementwise logarithm of matrix L:(A.13)ℬ=∑ν ∑τ(−EtEv−log Γ(X+1)) +∑ν ∑τ−X .*(((Lt .* log(Lt))Lv+Lt(Lv .* log(Lv))) ./(LtLv)−log(LtLv)) +∑ν ∑i−(At./Bt) .*Et−log Γ(At)+At .*log(At./Bt) +∑ν ∑iαt .*(log βt+1)+log Γ(αt) +∑i ∑τ−(Av./Bv) .*Ev−log Γ(Av)+Av .*log(Av./Bv) +∑i ∑ταv .*(log βv+1)+log Γ(αv).

A.4. Handling Missing Data and Map Estimation

When there is missing data, that is, when some of the xν,τ are not observed, computation is still straightforward in our framework and can be accomplished by a simple modification to the original algorithm. We first define a mask matrix M={mν,τ}, same size as X, where(A.14)mν,τ={0,xν,τ is missing,1,otherwise.Using the mask variables, the observation model with missing data can be written as follows:(A.15)p(X∣S)p(S∣T,V) =∏ν,τ(p(xν,τ∣sν,1:I,τ)p(sν,1:I,τ∣tν,1:I,v1:I,τ))mν,τ.The prior is not affected. Hence, we merely replace the first two lines of the expression for the full joint distribution (given in the Appendix A.2) as follows:(A.16)log ϕ=∑ν∑τmν,τ∑i(−tν,ivi,τ+sν,i,τ log(tν,ivi,τ) −log Γ(sν,i,τ+1)) +∑ν∑τmν,τ log δ(xν,τ−∑isν,i,τ)+⋯.Consequently, it is easy to see that(A.17)q(vi,τ)∝exp((ai,τv+∑νmν,τ〈sν,i,τ〉−1)logvi,τ −(ai,τvbi,τv+∑νmν,τ〈tν,i〉)vi,τ)∝𝒢(vi,τ;αi,τv,βi,τv),αi,τv≡ai,τv+∑νmν,τ〈sν,i,τ〉,βi,τv≡(ai,τvbi,τv+∑νmν,τ〈tν,i〉)−1.By a derivation analogous to one detailed in Section 3.3, we see that the excitation update equations in Algorithm 1, line 4, can be written using matrix notation as follows:(A.18)Σv=Lv .*(Lt⊤((M .*X)./(LtLv)))αv=Av+Σv, βv=1./(Av./Bv+Et⊤M).The update rules for the templates are similar. Note that when there is no missing data, we have M=1W1K⊤ which gives the original algorithm. The bound in (A.13) can also be easily modified for handling missing data. We merely replace X←M .*X and the first term EtEv←M .*EtEv.

We conclude this subsection by noting that the standard NMF update equations, given in (20), can be also rewritten to handle missing data as follows:(A.19)V(n+1)=V(n) .*(T⊤((M .*X)./(TV)))./(T⊤M),T(n+1)=T(n) .*(((M .*X)./(TV))V⊤)./(MV⊤).Here, the denominator has to be nonzero. Similarly, an iterative conditional modes (ICM) algorithm can be derived to compute the maximum a posteriori (MAP) solution as follows:(A.20)V(n+1)=(Av+V(n) .*(T⊤((M .*X)./(TV)))) ./(Av./Bv+T⊤M),T(n+1)=(At+T(n) .*(((M .*X)./(TV))V⊤)) ./(At./Bt+MV⊤).Note that when the shape parameters go to zero, that is, At,Av→0, we obtain the maximum likelihood NMF algorithm.

A.5. Hyperparameter Optimisation

The hyperparameters Θ=(Θt,Θv) can be estimated by maximising the bound in 13. Below, we will derive the results for the excitations; the results for templates are similar. The solution for shape parameters involves finding the zero of a function f(a)−c, where(A.21)f(a)=log a−Ψ(a)+1,a*=f−1(c).The solution can be found by Newton's method by iteration of the following fixed point equation:(A.22)a(n+1)=a(n)−f(a(n))−cf′(a(n))=a(n)−log(a(n))−Ψ(a(n))+1−c1/a(n)−Ψ′(a(n))=a(n)−Δ(n).It is well known that Newton iterations can diverge if started away from the root. Occasionally, we observe that a can become negative. If this is the case, we set Δ(n)←Δ(n)/2, and try again. The digamma Ψ function and its derivative Ψ′ are available in numeric computation libraries (e.g., in Matlab as psi(0,a) and psi(1,a), resp.).

The derivation of the hyperparameter update equations is straightforward:(A.23)∂ℬ∂ai,τv=〈log vi,τ〉−1bi,τv〈vi,τ〉−Ψ(ai,τv) −log bi,τv+log ai,τv+1=0,ci,τ=log ai,τv−Ψ(ai,τv)+1,ci,τ≡〈vi,τ〉bi,τv−(〈log vi,τ〉−log bi,τv),∂ℬ∂bi,τv=ai,τv(bi,τv)2〈vi,τ〉−ai,τv1bi,τv=0,bi,τv=〈vi,τ〉. Tying parameters across τ as aiv=ai,τv and biv=bi,τv yields(A.24)∂ℬ∂aiv=∑τ〈logvi,τ〉−∑τ1bi,τv〈vi,τ〉−KΨ(aiv) −∑τlog bi,τv+K=0,ci=log aiv−Ψ(aiv)+1,ci=1K∑τ(〈vi,τ〉bi,τv−(〈log vi,τ〉−log bi,τv)),∂ℬ∂biv=∑τai,τv(biv)2〈vi,τ〉−1biv∑τai,τv,biv=∑τ ai,τv〈vi,τ〉∑τ ai,τv. Tying parameters across τ and i, av=ai,τv, and bv=bi,τv yields(A.25)c=log av−Ψ(av)+1,c=1KI∑τ ∑i(〈vi,τ〉bi,τv−(〈log vi,τ〉−log bi,τv)),∂ℬ∂bv=∑i ∑τai,τv(bv)2〈vi,τ〉−1biv∑i ∑τai,τv,bv=∑i ∑τ ai,τv〈vi,τ〉∑i ∑τ ai,τv. The derivation of the template parameters is exactly analogous. We can express the update equations once again in compact matrix notation as follows:(A.26)Z⟵Ev(n)./Bv(n)−log(Lv(n)./Bv(n)),C⟵{Z,Not tied,(Z1K)K,Tie columns (over τ),(1I⊤Z)I,Tie rows (over i),(1I⊤Z1K)(KI),Tie all (over τ and i),Av(n+1)⟵SolveByNewton (Av(n),C),Bv(n+1)⟵{Ev(n),Not tied,(((Av(n) .*Ev(n))1K) ./(Av(n)1K))1K⊤,Tie columns (over τ),1I((1I⊤(Av(n) .*Ev(n))) ./(1I⊤Av(n))),Tie rows (over i),1I((1I⊤(Av(n).*Ev(n))1K) ./(1I⊤Av(n)1K))1K⊤,Tie all (over τ and i). Here, we assume SolveByNewton(A0,C) is a matrix-valued function that finds root Ci,j=f(Ai,j) for each element of A, starting from the initial matrix A0. If C is a scalar or vector, it is repeated over the missing index to implement parameter tying. For example, if C is a I×1 vector and A0 is I×K, we assume Ci=ci,τ for all τ=1⋯K, and the output is the same size as A0. This is only a notational convenience; an actual implementation can be achieved more efficiently. Again, the implementation of the template parameters is exactly analogous; merely replace above the subscripts as v←t, (i,τ)←(ν,i) and (I,K)←(W,I).

Acknowledgments

The author would like to thank Nick Whiteley, Tuomas Virtanen, and Paul Peeling for fruitful discussion and for their comments on earlier drafts of this paper. This research is funded by the Engineering and Physical Sciences Research Council (EPSRC) under the grant EP/D03261X/1 and by the Turkish Science, Technology Research Council grant TUBITAK 107E021 and Boğaziçi University Research Fund BAP 09A105P. The research is carried out while the author was with the Signal Processing and Comms. Lab, Department of Engineering, University of Cambridge, UK and Department of Computer Engineering, Boğaziçi University, Turkey.

Lee

D. D.

Seung

H. S.

Learning the parts of objects with nonnegative matrix factorization

Nature1999401788791

10.1038/44565

Paatero

Tapper

Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values

Environmetrics199452111126

10.1002/env.3170050203

Bishop

C. M.

Pattern Recognition and Machine Learning2006

New York, NY, USA

Springer

Virtanen

Sound source separation in monaural music signals, Ph.D. thesis2006November

Tampere, Finland

Tampere University of Technology

Kameoka

Statistical approach to multipitch analysis, Ph.D. thesis2007

Tokyo, Japan

University of Tokyo

Cichocki

a.cichocki@riken.jpMørup

morten.morup@gmail.comSmaragdis

paris@media.mit.eduWang

w.wang@surrey.ac.ukZdunek

rafal.zdunek@pwr.wroc.pl

Advances in nonnegative matrix and tensor factorization

Computational Intelligence and Neuroscience200820083

852187

10.1155/2008/852187

Kingman

J. F. C.

Poisson Processes1993

Oxford, UK

Oxford Science

Lee

D. D.

Seung

H. S.

Algorithms for non-negative matrix factorization

Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS '00)

October-November 2000

Denver, Colo, USA

556562

Le Roux

Kameoka

Ono

Sagayama

On the relation between divergence-based minimization and maximum-likelihood estimation for the i-divergence

Personal Communication, 2008

Virtanen

Cemgil

A. T.

Godsill

Bayesian extensions to non-negative matrix factorisation for audio signal modelling

Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08)

March-April 2008

Las Vegas, Nev, USA

18251828

10.1109/ICASSP.2008.4517987

Gaussier

Goutte

Relation between PLSA and NMF and implication

Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2005

Salvador, Brazil

601602

Singh

A. P.

ajit@cs.cmu.eduGordon

G. J.

ggordon@cs.cmu.edu

A unified view of matrix factorization models

5212

Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD '08)

September 2008

Antwerp, Belgium

Springer

358373Lecture Notes in Computer Science

10.1007/978-3-540-87481-2_24

Schmidt

M. N.

mns@imm.dtu.dkLaurberg

hla@es.aau.dk

Nonnegative matrix factorization with Gaussian process priors

Computational Intelligence and Neuroscience2008200810

361705

10.1155/2008/361705

Ghahramani

Beal

Propagation algorithms for variational Bayesian learning

Advances in Neural Information Processing Systems200013

Cambridge, Mass, USA

MIT Press

507513

Gilks

W. R.

Richardson

Spiegelhalter

D. J.

Markov Chain Monte Carlo in Practice1996

London, UK

CRC Press

Liu

J. S.

Monte Carlo Strategies in Scientific Computing2004

New York, NY, USA

Springer

Chib

Marginal likelihood from the gibbs output

Journal of the Acoustical Society of America19959043213131321

Hoyer

P. O.

Non-negative matrix factorization with sparseness constraints

The Journal of Machine Learning Research2004514571469

Banerjee

Merugu

Dhillon

I. S.

Ghosh

Clustering with Bregman divergences

Journal of Machine Learning Research2005617051749

Shashanka

M. V.

Raj

Smaragdis

Sparse overcomplete latent variable decomposition of counts data

Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS '07)

December 2007

Vancouver, Canada

Davis

C. S.

The computer generation of multinomial random variates

Computational Statistics and Data Analysis1993162205217

10.1016/0167-9473(93)90115-A

Tanner

M. A.

Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions19963rd

New York, NY, USA

Springer

Quintana

F. A.

quintana@mat.puc.clLiu

J. S.

del Pino

G. E.

Monte Carlo EM with importance reweighting and its applications in random effects models

Computational Statistics and Data Analysis1999294429444

10.1016/S0167-9473(98)00075-9

Butler

Stephens

The distribution of a sum of binomial random variables

1993April467

Stanford, Calif, USA

Stanford University

prepared for the Office of Naval Research

Minka

Expectation propagation for approximate Bayesian inference, Ph.D. thesis2001

Cambridge, Mass, USA

MIT Media Lab

Winther

owi@imm.dtu.dkPetersen

K. B.

Bayesian independent component analysis: variational methods and non-negative decompositions

Digital Signal Processing2007175858872

10.1016/j.dsp.2007.01.003

Buntine

Variational extensions to EM and multinomial PCA

2430

Proceedings of the 13th European Conference on Machine Learning (ECML '02)

August 2002

Helsinki, Finland

Springer

2334Lecture Notes In Computer Science

Cemgil

A. T.

atc27@eng.cam.ac.ukPeeling

php23@eng.cam.ac.ukDikmen

od225@eng.cam.ac.ukGodsill

sjg@eng.cam.ac.uk

Prior structures for time-frequency energy distributions

Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '07)

October 2007

New Paltz, NY, USA

151154

10.1109/ASPAA.2007.4393007

Flajolet

philippe.flajolet@inria.fr

Singularity analysis and asymptotics of Bernoulli sums

Theoretical Computer Science19992151-2371381

10.1016/S0304-3975(98)00220-5

Jacquet

Szpankowski

Entropy computations via analytic depoissonization

IEEE Transactions on Information Theory199945410721081

10.1109/18.761251