We describe nonnegative matrix factorisation (NMF) with a Kullback-Leibler (KL) error measure in a statistical framework, with a hierarchical generative model consisting of an observation and a prior component. Omitting the prior leads to the standard KL-NMF algorithms as special cases, where maximum likelihood parameter estimation is carried out via the Expectation-Maximisation (EM) algorithm. Starting from this view, we develop full Bayesian inference via variational Bayes or Monte Carlo. Our construction retains conjugacy and enables us to develop more powerful models while retaining attractive features of standard NMF such as monotonic convergence and easy implementation. We illustrate our approach on model order selection and image reconstruction.

In machine learning, nonnegative matrix factorisation
(NMF) was introduced by Lee and Seung [

The interpretation of NMF, like singular value
decomposition (SVD), as a low rank matrix approximation is sufficient for the
derivation of a useful inference algorithm; yet this view arguably does not
provide the complete picture about assumptions underlying the statistical
properties of

The interpretation of NMF as a low-rank matrix approximation is sufficient for the derivation of an inference algorithm; yet this view arguably does not provide the complete picture. In this section, we describe NMF from a statistical perspective. This view will pave the way for developing extensions that facilitate more realistic and flexible modelling as well as more sophisticated inference, such as Bayesian model selection.

Our first step is the derivation of the information
divergence error measure from a maximum likelihood principle. We consider the
following hierarchical model:

The
log-likelihood of the observed data

To derive the
posterior of the latent sources, we observe that

It is indeed a
good news that the posterior has an analytic form. Since
now the M step can be calculated easily as follows:

We note that our model is valid when

The interpretation of NMF as a maximum likelihood
method in a Poisson model is mentioned in the original NMF paper [

Given the
probabilistic interpretation, it is possible to propose various hierarchical
prior structures to fit the requirements of an application. Here we will
describe a simple choice where we have a conjugate prior as follows:

(a) A schematic description of the NMF model with data augmentation.
(b) Graphical model with hyperparameters. Each source element

(Left) The family of
densities

To model missing data, that is, when some of the

The hierarchical model in (

Below, we describe various interesting problems that
can be cast to Bayesian inference problems. In signal analysis and feature
extraction with NMF, we may wish to calculate the posterior distribution of templates and
excitations, given data and hyperparameters

This latter quantity is particularly useful for comparing different classes of models. Unfortunately, the integrations required cannot be computed in closed form. In the sequel, we will describe the Gibbs sampler and variational Bayes as approximate inference strategies.

We sketch here
the Variational Bayes (VB) [

The
expectations of

One of the
attractive features of NMF is easy and efficient implementation. In this
section, we derive that the update equations of Section

(1) Initialise:

(2)

(3)

(4)

(5)

(6)

(7)

(8)

Similarly, an iterative conditional modes (ICM)
algorithm can be derived to compute the maximum a posteriori (MAP) solution
(see Appendix

Monte Carlo
methods [

The Markov Chain Monte Carlo (MCMC) techniques
generate subsequent samples from a Markov chain defined by a

(1) Initialize:

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

The marginal
likelihood can be estimated from the samples generated by the Gibbs sampler
using a method proposed by Chib [

Our goal is to illustrate our approach in a model selection context. We first illustrate that the variational approximation to the marginal likelihood is close to the one obtained from the Gibbs sampler via Chib's method. Then, we compare the quality of solutions we obtain via Variational NMF and compare them to the original NMF on a prediction task. Finally, we focus on reconstruction quality in the overcomplete case where the standard NMF is subject to overfitting.

To test our approach, we generate synthetic
data from the hierarchical model in (

Model selection results. (a) Comparison of
model selection by variational bound (squares) and marginal likelihood
estimated by Chib's (circles) method. The hyperparameters are assumed to be
known. (b) Box-plot of marginal likelihood estimated by Chib's method
using

Model
selection using variational bound with adapted hyperparameters on face data

As real data, we use a version of the Olivetti face
image database (

We also investigate the nature of the representations
(see Figure

Templates, excitations
for a particular example, and the reconstructions obtained for different
hyperparameter settings.

We now compare variational Bayesian NMF with the maximum likelihood NMF on a missing data prediction task.

To illustrate the self regularisation effect, we set
up an experiment in which we select a subset of the face data consisting of

Results of a typical run. (a) Example images from the dataset. (b) Comparison of the reconstruction accuracy of different methods in terms of SNR (in dB), organised according to the sparseness of the solution. (c) (from left to right). The ground truth, data with missing pixels. The reconstructions of VB, VB + ICM, and ML-NMF with two initialisation strategies (1 = random, 2 = to image).

In Figure

On the same face dataset, we compare the prediction
error in terms of the SNR for varying model order

Average SNR results for
model orders

We observe that, due to the implicit
self-regularisation in the Bayesian approach, the prediction performance is not
very sensitive to the model order and is immune to overfitting. In contrast,
the ML-NMF with random initialisation is prone to overfitting, and prediction
performance drops with increasing model order. Interestingly, when we
initialise the ML-NMF algorithm to true data points with small perturbations,
the prediction performance in terms of SNR improves. Note that this strategy
would not be possible for data where the pixels were truly missing. However,
visual inspection shows that the interpolation can still be “patchy” (see
Figure

We observe that hyperparameter adaptation is crucial
for obtaining good prediction performance. In our simulations, results for VB
without hyperparameter adaptation were occasionally poorer than the ML
estimates. Good initialisation of the shape hyperparameters seems to be also
important. We obtain best results when initialising the shape hyperparameters
asymmetrically, for example,

In this paper, we have investigated KL-NMF from a
statistical perspective. We have shown that KL minimisation formulation the
original algorithm can be derived from a probabilistic model where the
observations are superposition of

The novel observation in the current article is the
exact characterisation of the approximating distribution

We have also shown that the standard KL-NMF algorithm
with multiplicative update rules is in fact an EM algorithm with data
augmentation. Extending upon this observation, we have developed an
hierarchical model with conjugate Gamma priors. We have developed a variational
Bayes algorithm and a Gibbs sampler for inference in this hierarchical model.
We have also developed methods for estimating the marginal likelihood for model
selection. This is an additional feature that is lacking in existing NMF
approaches with regularisation, where only MAP estimates are obtained, such as
[

Our simulations suggest that the variational bound
seems to be a reasonable approximation to the marginal likelihood and can guide
model selection for NMF. The computational requirements are comparable to the
ML-NMF. A potentially time-consuming step in the implementation of the
variational algorithm is the evaluation of the

We first compare the variational inference with a
Gibbs sampler. In our simulations, we observe that both algorithms give
qualitatively very similar results, both for inference of templates and
excitations as well as model order selection. We find the variational approach
somewhat more practical as it can be expressed as simple matrix operations,
where both the fixed point equations as well as the bound can be compactly and
efficiently implemented using matrix computation software. In contrast, our
Gibbs sampler is computationally more demanding, and the calculation of
marginal likelihood is somewhat more tricky. With our implementation of both
algorithms, the variational method is faster by a factor of around

In terms of computational requirements, the
variational procedure has several advantages. First, we circumvent sampling
from multinomial variables, which is the main computational bottleneck with the
Gibbs sampler. Whilst efficient algorithms are developed for multinomial
sampling [

The efficiency of the Gibbs sampler could be improved
by working out the distribution of the sufficient statistics of sources
directly (namely, quantities

Inference based on VB is easy to implement but at the
end of the day, the fixed point iteration is just a gradient-based lower bound
optimisation procedure, and second order Newton methods can provide more
efficient alternatives. For NMF models, there exist many conditional
independence relations, hence the Hessian matrix has a special block structure
[

From a modelling perspective, our hierarchical model
provides some attractive properties. It is easy to incorporate prior knowledge
about individual latent sources via hyperparameters, and one can easily capture
variability in the templates and excitations that is potentially useful for
developing robust techniques. The prior structure here is qualitatively similar
to an entropic prior [

Our main contribution here is the development of a principled and practical way to estimate both the optimal sparsity criteria and model order, in terms of marginal likelihood. By maximising the bound on marginal likelihood, we have a method where all the hyperparameters can be estimated from data, and the appropriate sparseness criteria is found automatically. We believe that our approach provides a practical improvement to the highly popular KL-NMF algorithm without incurring much additional computational cost.

Gamma

Poisson

Multinomial

We have the
following. Indices:

Here,

The variational
bound in (

When there is
missing data, that is, when some of the

We conclude this subsection by noting that the
standard NMF update equations, given in (

The
hyperparameters

The derivation of the hyperparameter update equations
is straightforward:

The author would like to thank Nick Whiteley, Tuomas Virtanen, and Paul Peeling for fruitful discussion and for their comments on earlier drafts of this paper. This research is funded by the Engineering and Physical Sciences Research Council (EPSRC) under the grant EP/D03261X/1 and by the Turkish Science, Technology Research Council grant TUBITAK 107E021 and Boğaziçi University Research Fund BAP 09A105P. The research is carried out while the author was with the Signal Processing and Comms. Lab, Department of Engineering, University of Cambridge, UK and Department of Computer Engineering, Boğaziçi University, Turkey.