doi:10.1155/2007/98086 Review Article Putting Markov Chains Back into Markov Chain Monte Carlo

Markov chain theory plays an important role in statistical inference both in the formulation 
of models for data and in the construction of efficient algorithms for inference. The use of Markov chains 
in modeling data has a long history, however the use of Markov chain theory in developing algorithms for 
statistical inference has only become popular recently. Using mark-recapture models as an illustration, 
we show how Markov chains can be used for developing demographic models and also in developing 
efficient algorithms for inference. We anticipate that a major area of future research involving mark-recapture 
data will be the development of hierarchical models that lead to better demographic models that account 
for all uncertainties in the analysis. A key issue is determining when the chains produced by Markov 
chain Monte Carlo sampling have converged.


Introduction
Markov chains, and related stochastic models, have long played an important role in helping ecologists understand population dynamics.In the main, this has been through the application of probability models to the problem of predicting the realized dynamics of plant and animal populations over time.In this context, the challenge is to construct models that are relatively simple, in terms of the numbers of parameters and the relationships between (and among) parameters and state variables, and yet that are able to provide a reasonable approximation to the sorts of dynamics typically observed in nature.
The development of these models has been driven by ecologists motivated by an interest in underlying theory, and ecologists with applications in mind.These applications are generally based on inference about the long-term dynamics of the study population.Examples include predicting the time to possible extinction for threatened populations or establishing safe levels of harvest for exploited populations.
Complementary to the problem of developing theoretical models that predict the temporal dynamics of populations has been the development of a body of theory for estimating population parameters.This estimation theory is the main emphasis in the remainder of this contribution.Almost universally, these models treat quantities such as population size at time t, N t , or the numbers of individuals born between t and t + Δt, as fixed quantities to be predicted (here we use prediction in the sense of predicting the value of an unobserved realization of a random variable as well as future realizations).In contrast, the population models discussed above treat these same quantities as random variables whose behavior we are interested in describing.
A particularly important class of estimation models for population dynamics is the mark-recapture model [1].Mark-recapture data comprises repeated measures on individual animals in the population obtained through samples of animals drawn from the population (usually) at discrete sample times t 1 ,...,t k .Because the capture process is imperfect, not all animals in the population at time t j are captured, and a model is required to describe this process of repeated imperfect captures.
In the simplest case, the population is regarded as closed to births and deaths and so the population size at time t j is the same for all k sample occasions.Studies of the dynamics of the population can then be based on repeated experiments generating a sequence of abundance estimates.
A more interesting class of models is the open population models in which individuals may enter (through birth or immigration) or leave (through death or emigration) during the interval (t 1 ,t k ).Here the challenge is to model the sequence of captures of animals in terms of parameters of demographic interest.These are usually survival probabilities S j , and parameters that describe the birth process.Quantities such as N j , the abundance of animals at the time of sample j, and B j , the number of animals born between samples j and j + 1, are then predicted from the model.
Traditionally, inference for mark-recapture models has been based on maximum likelihood.Importantly, demographic models for {N t } have played no role in these estimation models.Inference about the {N t } process has instead been based on ad hoc methods for summarizing sequences of estimates such as { N t }.Recently there have been many developments that apply Bayesian inference methods to mark recapture models based on Markov chain Monte Carlo (MCMC).Here the utility of Markov chain theory appears in a fundamentally different context to that described above for population modeling; it arises as a tool for inference.An exciting feature of Bayesian inference methods based on MCMC is that fitting complicated hierarchical models has become feasible.Hierarchical models provide a link between the demographic models and estimation models in a way that should lead to better and more relevant inference.It is recent developments in Markov chain theory, in particular Gibbs sampling [2] and reversible jump Markov chain Monte Carlo [3], that allow this link to take place.

Open population capture-recapture models
Define ν as the total number of animals alive on at least one of the sampling occasions t j ( j = 1,...,k).Let Y i j be the indicator that animal i (i = 1,...,ν) is caught in sample j, which takes place at time t j .Any animal that is caught for the first time in a particular sample is marked with some form of unique tag and then released, with any recaptures noted.The data are of the form y = {y i j } for i = 1,...,u.and j = 1,...,k, where u j is the number of unmarked animals caught in sample j and u. = j u j is the total number of animals caught at least once.For animal i, the set of values y i = {y i1 ,..., y ik } is called the capture history and these provide censored information on the time of birth and death.We know that the time of birth occurred before the sample time corresponding to the first nonzero value of y i and the time of death occurred sometime after the sample time corresponding to the last nonzero value of y i .

Hidden Markov model.
The problem with capture-recapture methods is that not all animals in the population are caught; this causes the censoring of the birth and death intervals.Define X i j as the indicator of the event that animal i is alive and in the study population at time t j .If X = {x i j }, i = 1,...,ν, j = 1,...,k, were observed for an animal population in which ν animals were ever alive and available for capture on at least one of the k sample occasions, then inference about the parameters {S j } and {β j } would be straight-forward.It would also be straight-forward to calculate observed values of the random variables N j = i x i j .
Instead of observing x i j , we observe y i j with the joint distribution of the sequence of pairs {X i j ,Y i j } described by a hidden Markov model (Figure 2.1).Importantly, y does not include the history given by the vector of k zeros associated with animals that were available for capture at sometime during the experiment but never caught.
Conditional on the event that animal i is alive (and available for capture) on at least one of the capture occasions t j , the sequence (X i1 ,...,X ik ) can then be modeled as a threestate Markov chain.X i j = 0 corresponds to the event that animal i has not yet entered the population.X i j = 1 corresponds to the event that animal i is in the population and alive, and X i j = 2 corresponds to animal i being dead.The possible transitions are 0 → 0, 0 → 1, 1 → 1, 1 → 2, and 2 → 2. Define: β 0 as the probability that an animal is born before the start of the experiment, given that it is alive at some time during the experiment, β j as the probability that an animal is born between the times of sample j and j + 1, given that it was available for capture at sometime during the experiment, S j as the probability that an animal alive at the time of sample j is still alive at the start of sample j + 1. Conditioning on the ν individuals that are alive and available for capture at sometime during the experiment means that If Ꮽ is used to denote the event that animal i is alive and available for capture at sometime during the experiment, then and for j = 2,...,k − 1, where β j = β j / k−1 h= j β h .Note that we exclude from the study population individuals that are born and then die during (t j ,t j+1 ).Clearly these individuals are invisible to the markrecapture experiment.

Observed data likelihood.
A common assumption in mark recapture models is that y i j |x i j is the outcome of a Bernoulli trial with probability p j if x i j = 1 or 0 otherwise.That is, only those animals alive and in the population at the time of sample j are at risk of being caught, which happens with probability p j .The standard approach to fitting this model is to derive an observed data likelihood from the marginal distribution with pdf [y | ν,S,β, p] which is described by summing up across all possible sample paths for X that are compatible with the data (here we use the notation [y|x] to denote the pdf of the distribution for the random variable Y , evaluated at y, conditional on X, evaluated at x).
Using the notation y = {y i j } (i = 1,...,u.; j = 1,...,k) and u = (u 1 ,...,u k ) the likelihood can be expressed as where [u.|ν] describes a binomial distribution with index ν and parameter π 0 and [u|u.] a multinomial distribution with index u.and parameter vector ξ.Both π 0 and ξ are complicated functions of the parameters S and β that account for the censoring of the birth times in y.If we let ψ j denote the probability that an animal is available for capture in sample j but has not yet been caught, then Such an animal is first caught in sample j with probability ψ j p j hence (2.5) The term [y | u,S, p] describes the celebrated Cormack-Jolly-Seber model (see [4][5][6]).If we index the first sample in which an animal i was caught by r i and the last sample in which animal i was seen by l i , then we can write where the term χ j , which accounts for the censoring of time of death, can be defined recursively as (2.7) A nice feature of the observed data likelihood (2.3) is that it is straightforward to find maximum likelihood estimators.[7] showed that the parameter ν is not identifiable, however the partial likelihood [u | u.,β,S][y | u,S, p] obtained by conditioning on u. contains all practically useful information on the identifiable parameters in the model.Closedform solutions to the ML equations exist for all identifiable parameters in this partiallikelihood.
The model can also be easily generalized to allow parameters to be individual-specific, by introducing covariates, or to allow captures to depend on the earlier capture history of the animal.The model (2.3) has also been extended by reparameterizing the model in terms of per capita birth rate f j [7,8] and an index to population growth rate [8] The complete model (2.3), which we refer to as the Crosbie-Manly-Arnason-Schwarz model, was first described in [9] building on earlier work of [10].It differs from the wellknown Jolly-Seber (JS) model ( [5,6]) in the way that captures of unmarked animals are modeled.In the JS model, the terms (2.9) are replaced by and the elements of U = (U 1 ,...,U k ) , where U j is the number of unmarked animals in the population at the time of sample j, are treated as unknown parameters to be estimated.While historically important, this approach does not allow the reparameterizations in terms of f j and λ j described above as f j and λ j cannot be written as deterministic functions of the parameters S j and β j .Also, the U j parameters are of little demographic interest although predictors of N j and B j exist.

Complete data likelihood.
An alternative to using the observed data likelihood is to describe the model in terms of the complete data likelihood (CDL) [11].The CDL is the likelihood of all data, both missing and observed.The observed data are (i) the values y i j for all u.individuals that we observed at least once, (ii) the censored information about X that we obtain from y.The missing data are (i) the unknown number of individuals that were available for capture but not caught (we include this by specifying ν as a parameter), (ii) the realized but unknown values of X i j , x i j .Including both the observed and missing data gives the CDL that we can express as (2.11) The CDL has been naturally factored into a term that describes how the data were corrupted, [y | p,x,ν], and a term that models the underlying birth and death processes of interest, [x | S,β,ν].As for the observed data likelihood section, we model the corruption by assuming that an individual that was alive in sample j (x i j = 1) was observed in sample j (y i j = 1) with probability p j .We model the birth and death process as in the Markov chain model for X described in Section 2.1.
Even though the CDL provides a natural approach to looking at the problem, we must still integrate over the missing data in order to obtain valid inference.Computational techniques such as Markov chain Monte Carlo (MCMC, discussed below) can be used that iteratively integrate out all missing data, allowing models to be specified in terms of the CDL.This means that in each iteration of the MCMC chain we need values for the missing quantities x and ν (and all other parameters) that were obtained from the posterior distribution of all unknowns.One such MCMC algorithm we can use is Gibbs sampling, described in the following section.

Markov chain Monte Carlo.
The Gibbs sampler, also known as alternating conditional sampling, is a remarkable algorithm for efficiently constructing a Markov chain for complex joint probability distributions [Z 1 ,...,Z k | θ] by sampling from the full conditional distributions [Z i | {Z j } j =i ,θ] of each component.The stationary distribution of this Markov chain has the target density [Z 1 ,...,Z k | θ].A particularly useful feature of the Gibbs sampler is that the Markov chain can be constructed even though the target joint probability density may only be known up to the normalizing constant.This has led to a resurgence of interest in Bayesian inference which, historically, has been held back by the need for high-dimension integration needed to normalize posterior probability densities.
Here, we outline the construction of a Gibbs sampler whose target density is the CDL described in the previous section.Once we have collected our mark-recapture data we have data y, that is known; and the unknown are any parameters and any unknown realized values of random variables of interest.For the Crosbie-Manly-Arnason-Schwarz model (2.3), the parameters are p 1 , p 2 ,..., p k ; S 1 ,...,S k−1 , β 0 ,β 1 ,...,β k−1 , and ν.In addition unknown realized random variables of interest (might Starting with (2.3) and specifying independent beta Be(α p ,γ p ) prior distributions for the parameters p j , we can show that the full conditional distribution for p j is a beta Be(n j + α p ,N j − n j + γ p ) for j = 1,...,k, and where n j is the total number of individuals caught at time of sample j.
If we specify independent beta Be(α S ,γ S ) prior distributions for the parameters S j , we obtain beta Be(N j − D j + α S ,D j + γ S ) full conditional distributions for j = 1,...,k − 1.
If we specify independent beta Be(α β ,γ β ) prior distributions for the parameters β j , we obtain beta Be(B j + α β ,N − j h=0 B h + γ β ) full conditional distributions for j = 1,...,k − 2 (note that β k−1 = 1).If desired, we can transform the generated values of β j to any other birth parameter, such as β j or η j .For example, β j , j = 0,...,k − 1, is obtained by taking (2.12) We also need to calculate the full conditional probability of the missing values of x.This can be done a number of ways, but we choose to represent the information in x by matrices that give the interval censored times of birth and death, denoted by b and d, respectively.The value b i j = 1 means individual i was born between sample j and j + 1 with b i j = 0 otherwise.The value b i0 = 1 means the individual was born before the study began.The value d i j = 1 means that individual i died between sample j and j + 1 with d i j = 0 otherwise.The value d ik = 1 means the individual was still alive at the end of the study.The assumptions about the underlying birth and death processes impose (2.13) The matrices b and d are censored.For the u.individuals that were ever observed we know that they were not born after first capture, b i j = 0, j = r i ,...,k, where r i indexes the sample in which animal i first appeared.Likewise we know that they did not die before the last capture, d i j = 0, j = 1,...,l i , where l i indexes the sample in which animal i last appeared.For the ν − u. individuals that were never observed we have no information about either b or d.
The full conditional distribution of the unknown values of b for individual i is a multinomial distribution with probability vector γ b , where and λ b i is the time of first capture r i for all individuals observed, and it is the last period in which the individual was alive (obtained from d) for individuals i = u.+ 1,...,ν that were not observed.
The full conditional distribution of the unknown values of d for individual i is a multinomial distribution with probability vector γ d , where and λ d i is the time of last capture l i for all individuals observed, and it is the first period in which the individual was alive (obtained from b) for individuals i = u.+ 1,...,ν that were not observed.
For the parameter ν we specify a discrete uniform prior distribution DU(0,κ ν ).Obtaining a sample from the full conditional distribution for ν has two problems: (1) the full conditional distribution is only known to a proportionality constant, (2) the value of the parameter changes the dimension of other unknowns in the model.The statistic u j is the number of individuals caught in sample j, R j is the number of individuals released following sample j, q j is the number of the R j that were ever recaptured, and m j is the number of marked animals caught in sample j.
Month u j = R j q j m j To overcome the first problem we can use a sampling scheme, such as the Metropolis-Hastings algorithm, that allows us to sample from a distribution that we only know up to the proportionality constant.The second problem requires an extension of the Metropolis-Hastings algorithm called reversible jump Markov chain Monte Carlo [3] where there is dimension matching to ensure reversibility of the Markov chain.Details of the reversible-jump sampler are given in [11].

Example.
We illustrate the use of the MCMC algorithm described in the previous section by fitting the model (2.3) to meadow vole (Microtus pensylvannicus) data collected at the Patuxent Wildlife Research Center, Laurel, Md, USA [12].The meadow vole population was trapped at one-month intervals; untagged animals were tagged and released, tagged animals had their identity recorded and were then released.The model (2.3) can be fitted using sufficient statistics (Table 2.1) and closed-form solutions to the ML equations [7].The ML estimates (Table 2.2) and the posterior summaries from fitting the model using the Gibbs sample algorithm (Table 2.3) of the previous section are in close agreement.For the Gibbs sampler we used beta Be(1,1) priors for S i , p i , and β i with the f i parameters found deterministically as a function of the β i and S i parameters [7].We used a discrete uniform DU(200000) prior for ν.
A nice feature of the MCMC approach is that it is relatively simple to obtain a posterior distribution for unobserved random variables of interest such as N i and B i , also reported in Table 2.3.Predictions for these in a likelihood analysis must be based on method of moment-type estimators as the N i and B i parameters do not explicitly appear in the likelihood function.Obtaining predictions using MCMC is straightforward with values sampled from the posterior predictive distribution.In practice, we simply use the set of b and d values obtained in the Markov chain and for each iteration compute the current value of the variable of interest.For example, [11] show that (  chains appear well-mixed after a few thousand iterations.Even chains for nonidentifiable parameters such as ν appear well mixed after a few thousand iterations.

Discussion
Markov chain theory plays an important role in statistical inference both in the formulation of models for data and in the construction of efficient algorithms for inference.The ability to compactly represent stochastic processes that evolve over time has meant that Markov chain theory has played an important role in describing animal and plant population dynamics.The fact that each generation gives rise to the next makes the Markov property a natural starting point in developing population models.
The role of the Markov chain in statistical inference is a much more recent development.Although the Metropolis-Hastings algorithm first appeared in 1953 [13], with roots back to the Manhattan project, and the Gibbs sampler in 1984 [2] MCMC entered into widespread use in statistics only in the last 15 years.This followed [14] who illustrated the application of Gibbs sampling to a variety of statistical problems.
MCMC is particularly useful for fitting hierarchical models using Bayesian inference methods.A simple example of a hierarchical model is one where we have two components In order to fit this model by a method such as maximum likelihood (ML), we must find the marginal likelihood: ᏸ In practice this integral is very difficult, if not practically impossible, in most cases of interest.The equivalent Bayesian problem is to find the posterior marginal density which involves essentially the same integral.However, MCMC allows us to approximate a sample from the required density but we must specify a prior distribution from γ.Thus, Bayesian inference methods based on MCMC allows inference for the model [Y | θ][θ | γ] but the same problem may be difficult or impossible using standard ML theory.
In the context of mark-recapture modeling, the advent of MCMC has meant that it is now possible to fit complex hierarchical models.Starting with a model such as (2.3), it is relatively easy to add hierarchical structure.In the context of the hidden Markov model described in Figure 2.1, we mean that it is relatively straightforward to add in components that allow us to further model [X | β,S,ν].These issues have been explored by [7,11].[7] showed how the model (2.3) could be used to explore relationships among parameters and in particular considered a hierarchical model in which it was assumed that survival probabilities were related to per capita birth rates (functions of the β parameters).Such a model would make sense when there were common environmental influences on both survival probabilities and birth rates.Such an analysis is virtually impossible using ML methods.Similarly, [11] showed how the CDL discussed above could be used in hierarchical modeling, and in particular fit models in which there is feedback between the unknown random variables N j and survival S j or per capita birth rates f j .practical benefit of hierarchical modeling is that a much richer class of models is available for the ecologist to explore, and the advent of MCMC means that methods of inference are fully able to incorporate all different sources of uncertainty in the analysis.
A particularly important technical issue with MCMC is that of convergence.A typical problem in Markov chain theory is the determination of a stationary distribution given the transition kernel.In MCMC, the problem is reversed, and is one of constructing a transition kernel that has, as its stationary distribution, the target distribution of interest.Under mild conditions, the Metropolis-Hastings algorithm and the Gibbs sampler as a special case are methods of constructing transition kernels that have the required property.An important practical problem is the development of rules for helping determine when the chain has converged.Experience indicates that for some problems, convergence can be rapid and in others it can be slow.Rapid convergence appears to be an associated with likelihood functions that are well behaved.Methods for assessing convergence are ad hoc, and generally based on informal graphical assessment (Figure 2.2) or computation of simple statistics, such as the Brooks-Gelman-Rubin diagnostic statistic.There is surprisingly little underlying theory about the rate at which Markov chains constructed in MCMC should converge to the stationary distribution that can be applied to the practical problems of MCMC.Clearly, time to stationarity will be a property of the model.Again, observation indicates that problems for which the likelihood function are well behaved tend to converge rapidly.The work in [15,16] is important in this regard.
restrictions on the values of b and d, k−1 j=0

Figure 2 . 2 .
Figure2.2.Gibbs sampler Markov chains for the first 1000 iterations (column of plots on left-hand side), for 50 000 iterations following a discarded burn-in of 10 000 iterations (middle column of plots) and posterior density plots (right-hand column of plots) for the identifiable parameters S 1 and p 1 and the nonidentifiable parameter ν, for the Gibbs sampler posterior simulations from fitting the Crosbie-Manly-Arnason-Schwarz model fitted to the meadow vole data.

Table 2 .
1. Summary statistics for fitting the Crosbie-Manly-Arnason-Schwarz model to the meadow vole data.

Table 2 .
2. Maximum likelihood estimates of identifiable parameters in the Crosbie-Manly-Arnason-Schwarz model fitted to the meadow vole data.

Table 2 .
3.Posterior distribution summary statistics for the Gibbs sampler posterior simulations and for the identifiable parameters in the Crosbie-Manly-Arnason-Schwarz model fitted to the meadow vole data.Included are predictions of the abundances N j at the time of each sample and B j , the number of individuals born in (t j ,t j+1 ).