On determining the order of Markov dependence of an observed process governed by a hidden Markov model

This paper describes a Bayesian approach to determining the order of a finite state Markov chain whose transition probabilities are themselves governed by a homogeneous finite state Markov chain. It extends previous work on homogeneous Markov chains to more general and applicable hidden Markov models. The method we describe uses a Markov chain Monte Carlo algorithm to obtain samples from the (posterior) distribution for both the order of Markov dependence in the observed sequence and the other governing model parameters. These samples allow coherent inferences to be made straightforwardly in contrast to those which use information criteria. The methods are illustrated by their application to both simulated and real data sets.


Introduction
Markov chains are commonly used as models for data which are observed in discrete time and have a discrete and finite state space.Their application to time series data is widespread, ranging from the analysis of sequences of daily rainfall at a particular location to studying patterns of bases in a deoxyribonucleic acid (DNA) sequence.These data can be described using either a qth order Markov chain with state space Y or (equivalently) when q > 1, a first order Markov chain with a larger state space Y q and a constrained transition structure.However, in most practical situations the parameter q is not known, and this leads to fundamental difficulties in making inferences from the data under either model description.
The problem of estimating the order of dependence of a homogeneous Markov chain has a long history dating back to methods based on likelihood ratio tests described by [3,19].This work was followed by contributions describing procedures based on information (penalized likelihood) criteria such as the AIC [30] and the BIC [21].Much more recently, a procedure which uses a Bayesian approach to determine the posterior distribution for the order of dependence has been described in [14].They also show that their method performs favourably when compared to those which use information criteria.In this paper we generalise their method to one which determines the order of dependence in a heterogeneous Markov chain Y = (Y 1 , Y 2 , . . ., Y n ) which follows a hidden Markov model (HMM).These models are characterised by r homogeneous Markov chains together with an additional first order homogeneous r-state hidden Markov chain S = (S 1 , S 2 , . . ., S n ).The hidden chain describes which of the r chains governs the evolution of the observed process at any particular time.HMMs have proved to be very flexible models for describing heterogeneity in time series data and have been applied to a wide variety of problems; [9] provides a comprehensive list of references.For more background on HMMs see, for example, [26] and [22].Note that some authors, for example, [6], refer to these models as double chain Markov models (DCMMs).
We shall assume that each of the r homogeneous Markov chains has order q ( 0), that is, the probability of the current observation Y t depends only on the previous q observations Y t−q , . . ., Y t−1 .Also, we shall assume that each chain has state space Y = {1, 2, . . ., b}.Thus the inferential problem is to determine values for q and the other model parameters which are consistent with the data.The key quantities we require for a Bayesian analysis of this problem are the posterior (model) probabilities of q.These describe in simple terms how likely are different values of q in light of the data.Alternatively (and equivalently) we could calculate Bayes factors as is commonplace in Bayesian model choice problems; see [20].
A method for determining the order of dependence in homogeneous sequences (r = 1) has been provided by [14].Their approach is particularly appealing as it is fairly straightforward to use.Moreover, the ease of their approach is a direct consequence of their choice of prior distribution for the transition probabilities governing the evolution of the underlying process; we will return to this point later.Their method can be easily extended to the more general HMM context (r > 1) if the configuration of hidden states s is known.In particular, formulae for posterior model probabilities (and thus Bayes factors) are available analytically.However, a fundamental drawback of using HMMs is that the configuration s is unknown and has to be determined from the observed data y.This complication precludes a fully analytic treatment of the model and so we resort to computer intensive Markov chain Monte Carlo (MCMC) methods.These methods involve generating samples from a Markov chain which has been constructed so that its stationary distribution is the (posterior) distribution of interest.The resultant dependent samples can be used to approximate posterior quantities of model parameters such as the configuration s and the order of dependence q; see [8] for an overview.These methods also ensure that inferences take due account of uncertainty surrounding the correct configuration, in contrast to many plug-in methods [13].
In many scenarios the number of different hidden states r in the HMM will be unknown a priori.However, for ease of explanation, we restrict our attention to the case where r is known.It is possible to extend the methods described in this paper to also include estimation of r but this would require the use of methods such as those described in [28] which are based on reversible jump MCMC.
The remainder of the paper is organised as follows.The HMM is described in Section 2, followed by details of our Bayesian approach to inference in Section 3. Section 4 outlines an implementation of our MCMC algorithm on both a simulated and a real data set.The paper concludes in Section 5 with a discussion.

Model description
We shall assume that the observed data y = (y 1 , y 2 , . . ., y n ) are a realisation of a hidden Markov model with observation equations for t = q + 1, q + 2, . . ., n, where the notation x i:j denotes the sequence x i , x i+1 , . . ., x j .Note that we have assumed, as is common practice, that the hidden process S follows a first order homogeneous Markov chain.We shall denote its state space by S = {1, 2, . . ., r} and its transition matrix by Λ = (λ ij ), where each row λ i ∈ S r = {(x 1 , x 2 , . . ., x r ); x j > 0 ∀ j, r j=1 x j = 1}, the r-dimensional simplex.Although the order of dependence q is unknown, we shall assume that it can take values in Q = {0, 1, . . ., q max }, where q max 1.
The layout of the qth order transition matrices with elements ρ (k) yt−q:t−1,j is problematic to work with computationally since each increase in order requires an extra dimension in an array.However, we can overcome this problem by reshaping each matrix into its reduced form [25]. Here, for each k, the reduced form b q × b matrix P (k) consists of elements p (k) ij where i ∈ Y q = {1, 2, . . ., b q } indexes the rows of the matrix and j ∈ Y indexes the columns.The elements of the reshaped matrix corresponding to transition probabilities ρ (k) yt−q:t−1,j can be found on row i = I(y, t, q, b) In computational terms, this solution results in the algorithm working with fixed (two) dimensional arrays.Thus, each reshaped matrix P (k) has rows p (k) i ∈ S b and we denote the collection of these transition matrices by P = {P (1) , . . ., P (r) }.
For convenience in the following derivation, we denote the set of unknown hidden and observed state transition matrices by θ = {Λ, P} ∈ S r r × S rb q b , where the space S x r denotes the product of x simplices, each one r-dimensional.

Bayesian inference
The aim of the analysis is to make inferences about the unknown quantities in the model: the order of dependence q, the model transition parameters θ and the hidden states s.We shall adopt a Bayesian approach to inference [24], and begin by quantifying our uncertainty about these unknowns (before observing the data) through a prior distribution.
The discrete probability distribution π(q) defined on Q describes our prior uncertainty surrounding the value of q.For example, without strong prior beliefs as to likely values, we might take π(q) to be a discrete uniform distribution.Alternatively, if this uniform structure were thought inappropriate, for example, because larger values of q were believed to be relatively unlikely, then a truncated Poisson or geometric distribution might be appropriate choices.
The components of θ are all defined on simplices and therefore there are many choices of priors which could be made.One rich family of distributions is provided by Aitchison's A-distribution [2] which has the logistic normal and the Dirichlet distributions as special (limiting) cases.In this paper we shall adopt the same choice as [14] which was fundamental to the simplicity of their method, namely, the Dirichlet distribution: where Γ(•) is the gamma function [1].Specifically, we take independent Dirichlet distributions for the rows of each P (k) and Λ, that is The Dirichlet parameters c and d should be chosen to reflect the goal of the analysis.In this case, we have little knowledge about the transition structures in the data and so the exchangeable weak specification c (k) i = (1, 1, . . ., 1) may be appropriate.The choice of parameters for the transition structure of the hidden chain is more complex.Usually, this is taken to be an off-diagonal exchangeable pattern of the form for some choice of α and β, where δ ij is Kronecker's delta.These parameters are usually chosen to ensure a given prior mean and standard deviation for the length of runs of each state in the hidden chain, that is, for (1 − λ kk ) −1 ; for more details, see [7].

Likelihood
For this model, the complete-data likelihood is determined as the probability of both the observed and the unobserved data (hidden states) given the parameters, and is given by where and denote the observed transition counts and I(x) is the usual indicator function which equals 1 if x is true and 0 otherwise.Throughout this paper we perform inference conditional on the first q max observations.This simplifies the solution by removing the need for marginal models to describe the evolution at the beginning of the sequence.Consequently, the range of values for t in the above expressions is t = q max + 1, . . ., n.

Posterior inference
In the Bayesian paradigm, inferences are based on the posterior distribution π(q, θ, s|y) = π(q, θ, s, y) π(y) and this distribution calibrates our uncertainties about the unknown parameters after observing the data.Although this distribution is highly structured, it does not permit a straightforward analysis.However, the posterior distribution conditional on the hidden states s is much more amenable to analysis.It can be factorised into two component distributions π(θ|q, s, y) and π(q|s, y), and these distributions can be obtained as follows.
The posterior distribution for θ given s and q is easily obtained as the Dirichlet structure of the prior distribution is conjugate to the multinomial form of the likelihood (Eq.( 1)).Using Bayes' Theorem, it can be shown that this posterior distribution has independent components where n and m i = (m ij ).Inferences about the order of Markov dependence q are based on the posterior distribution of q given s π(q|s, y) = π(q)π(y|q, s) In general, computation of the marginal likelihood π(y|q, s) can be problematic and often is intractable in Bayesian model choice problems such as this.However, the conjugate choice of prior distribution for P allows us to determine the marginal likelihood using a simple rearrangement of Bayes' Theorem: , s, y) .
Substituting the constituent parts produces an exact expression for the marginal likelihood, namely This expression is easy to compute and therefore exact calculation of posterior model probabilities/Bayes factors is straightforward.We note that when r = 1, this marginal likelihood calculation correctly reproduces the result in [14].
The simplicity of the marginal likelihood calculation is due to the choice of Dirichlet distribution for the prior distribution of the transition probabilities P.There are many other possible choices of prior distribution which allow a more flexible covariance structure than the Dirichlet but, in general, these choices introduce additional complexity into the analysis.Even when using a different conjugate distribution, the Aitchison A -distribution, no exact expression for the marginal likelihood can be found as the normalising constant for this distribution is algebraically intractable.However, hierarchical generalisations of the Dirichlet distribution, such as a finite mixture or placing a hyper-prior distribution on the Dirichlet parameters, also inherit the simplicity of the marginal likelihood calculation.Both generalisations allow a more general covariance specification for the transition probabilities but would require additional updates on the mixture or hyper-prior parameters.

Posterior inference via MCMC
We have seen that determining posterior distributions is straightforward and exact when the hidden states are assumed known.However, in our model the hidden states are unknown quantities and so we use Markov chain Monte Carlo (MCMC) methods to allow for this uncertainty.Specifically we employ standard Gibbs sampling (data augmentation) procedures for hidden Markov models [11,27].For our analysis, the MCMC algorithm has two parameter blocks (q, θ) and s in which we simulate from the conditional distributions π(q, θ|s, y) and π(s|q, θ, y).
In the second block, a sequence of hidden states s is generated from the conditional distribution π(s|q, θ, y) using a standard forward-backward simulation algorithm.Algorithms of this type originated with the work of [4] and many variants are possible; the algo-  rithm we use is outlined in Fig. 1.We note that the forward sweep is initialised at t * = q max + 1 with y 1:t * /ξ t * where π k is the stationary probability that S t * = k.In Eq. ( 5), the scale factor η t ensures β t (•) is a valid probability distribution; also we use the convention λ jsn+1 ≡ 1.
The overall structure of our MCMC algorithm is outlined in Fig. 2. The joint (q, θ) move is undertaken in steps 1 and 2 using Eqs (2), ( 3) and (4) and the s move, in step 3, using the algorithm in Fig. 1.We note that this two block scheme should have better convergence properties than a standard three block scheme.
In general, particular care must be taken in the construction of MCMC schemes which incorporate a dimensional parameter, such as the order of Markov dependence q, to ensure that they converge to the correct distribution.The scheme described above does satisfy the necessary convergence conditions since q is simulated from a distribution which is marginalised over the θ parameters.An alternative verification can be obtained using the pseudo-priors approach suggested by [10] and subsequently modified by [18].Briefly, this technique provides a convergent scheme in which the dimension parameter is simulated conditionally on the other model parameters, that is, from π(q|θ, s, y).This distribution reduces to that in Eq. ( 4) when the pseudo-priors are chosen appropriately (see [16]).If a more complex prior distribution were thought to be appropriate, updates for the dimension parameter q could be obtained using a different choice of the pseudo-priors or by using reversible jump MCMC techniques [17].Investigation of this strategy is the subject of on-going work.

Posterior summaries
Suppose the MCMC algorithm is run until it is thought that convergence has been achieved (the burnin period) and then for a further N iterations, giving sampled values (q (i) , θ (i) , s (i) ), i = 1, 2, . . ., N on which to base our posterior summaries.We can estimate the (marginal) posterior distribution for the order of dependence parameter q using the sampled values q (i) by Alternatively, the Rao-Blackwellized estimate [15] π RB (q = j|y) = 1 N N i=1 π(q = j|s (i) , y), (7) j ∈ Q will give a more precise estimate at the expense of additional computing effort.The posterior distributions of the hidden states s and the model parameters θ conditional on q are also of interest.However, summarising these distributions is complicated as the parameters are not identifiable.This non-identifiability is caused by the fact that the likelihood is invariant to permutations of the hidden state labels, that is for any permutation ν(•) of the integers {1, 2, . . ., r}.
Consequently, the likelihood has r! symmetric modes corresponding to the permutations of the labels.Combining this likelihood with a symmetric prior distribution (as suggested in Section 3.1) will produce a posterior distribution which also possesses this symmetry and thus the parameters will not be identifiable.
A natural consequence of the symmetry in the posterior distribution is that our posterior sample is subject to label switching in which the hidden state labels randomly permute during the course of the MCMC run; see [29] for a detailed description.As each of the r! label permutations will appear (theoretically) equally often in the posterior sample, naïve summaries which ignore label switching (such as posterior means) will lead to similar values for each of the k = 1, 2, . . ., r hidden states.One solution to this problem is to impose some ordering constraint on the transition parameters for the observed sequence P (k) (for example, using the Fr öbenius norm) in order to encourage the algorithm to focus on one of the r! symmetric modes in the posterior distribution.An alternative solution which focuses on the hidden states is to post-process the MCMC output using a relabelling algorithm formulated in the decision-theoretic framework of [29].The aim of such algorithms is to determine the permutation of (1, 2, . . ., r) (and a relabelling of the sampled values) which minimises posterior expected loss (Monte Carlo risk) for some chosen loss function.
We shall adopt this second alternative and postprocess the output using an algorithm whose goal is to obtain the most likely hidden state at each position in the sequence, that is, the marginal posterior mode (MPM) estimate ŝ.Because of computing storage limitations, we advocate the use of an on-line algorithm as outlined in Fig. 3; the corresponding batch algorithm can be derived easily from this on-line version.
Two ways have been suggested by [29] to obtain a suitable initial best estimate ŝ (0) which rely on running an initial sample (after a burn-in period).However, in extensive testing we have found that taking ŝ (1) = s (1)  and starting the algorithm at iteration i = 2 works well as the algorithm is fairly robust to the choice of starting point.Another advantage of using an algorithm which relabels according to the hidden states rather than the parameters is that its run time does not depend on q.
In the next section we apply our methods to the analysis of both simulated and real data sets.We show how inferences can be made for the order of dependence and use relabelled MCMC output to make inferences about the parameters and the hidden states.

Simulated data
We begin by analysing a simulated sequence of length n = 1000.The sequence was generated from a hidden Markov model with r = 2 hidden states and a q = 1 order Markov dependence for a b = 4 state observed sequence.The transition matrices P (1) and P (2)  were chosen to have roughly similar columns (within each matrix).This ensures that the q = 0 and q = 1 models are fairly close and so gives the algorithm a reasonable challenge in deciding between them.We give below the transition matrices used for simulating the sequence in order to judge the performance of the estimation procedure:

Choice of prior distributions
We restrict our attention to considering dependence structures of order no more than q max = 3.Also we adopt a truncated Poisson distribution to describe our prior uncertainty about q with π(q) = Pr(q = i) ∝ a i /i!, i = 0, 1, 2, 3.
As we are particularly interested in whether the algorithm can choose between q = 0 or q = 1, we take a = 1 as the hyperparameter of this distribution.We will see later that the results are fairly robust to changes in this choice of prior.We also take the weak specification for the transition probabilities in P (1) and P (2) , that is c

Results
The MCMC algorithm was run for 110000 iterations with the first 10000 being discarded as burn-in.Our results are based on a sample of size N = 10000 since we only recorded every 10th iterate to reduce computing overheads.The usual diagnostic checks were made to ensure there was no evidence of lack of convergence.We also confirmed our results using several MCMC runs from different starting points.
Table 1 contains estimates of the marginal posterior distribution for q based on the sampled values of q over the iterations of the sampler (Eq.( 6)) together with the alternative Rao-Blackwellized estimate (Eq.( 7)).It shows a very high probability for the correct choice q = 1.The probability of higher values of q is very low.Repeating the analysis using a uniform prior distribution for q gives similar results.The effect of us-Table 1 Estimates of the marginal posterior distribution for q: simulated sequence n = 1000 Poisson π(q|y) 0.0005 0.9995 0.0000 0.0000 Poisson π RB (q|y) 0.0006 0.9994 0.0000 0.0000 Uniform π RB (q|y) 0.0006 0.9994 0.0000 0.0000 ing other prior distributions can be seen by reweighting these posterior probabilities according to the ratio of the proposed and actual prior probabilities.Such considerations show that the prior odds in favour of q = 0 (against q = 1) would have to be at least 1500 : 1 before the analysis favoured the incorrect choice of q = 0.
We now present results for the hidden states s and for the transition parameters θ conditional on the (posterior) modal value q = 1.In general we can estimate the posterior probabilities of the hidden states S t along the sequence (conditional on a particular q = q * ) by or by its Rao-Blackwellized equivalent.Figure 4 shows the estimate of the probability of hidden state 1 along the sequence.It also indicates the positions of the actual hidden states from which the sequence was simulated.Clearly, the algorithm has uncovered this latent structure very well.Table 2 contains the posterior means and standard deviations for the model transition probabilities θ.We note that the values shown are not too dissimilar to the values from which the data were simulated and well within sampling error.We now illustrate the effect of sequence length n on the ability to detect the correct order of dependence q in the sequence.Clearly, longer sequences will reduce uncertainty about q, but to what extent?We now investigate what conclusions can be drawn about the model parameters using only the first half of the earlier sequence (n = 500).We have retained the same prior distributions and proceeded with a MCMC algorithm as before.The MCMC algorithm produced a wellmixing chain which traversed the different models (corresponding to the different values of q) regularly, with the value of q changing on approximately 38% of the iterations.
The impact of using a much shorter sequence on the marginal posterior of q is clearly seen in Table 3. Again the various choices of estimate and prior distribution produce very similar results.However, for this shorter sequence, there is considerably more uncertainty surrounding the value of q.As before, q = 0 and q = 1 receive nearly all of the posterior support with values of q = 2 or q = 3 highly unlikely.However, there is a high degree of uncertainty about the order of dependence in the data.There needs only to be a shift in prior odds of around 3 : 2 before the incorrect choice of q = 0 is favoured.This shorter sequence also makes it much more difficult to identify the hidden states.Figure 5 shows the  plot of estimated (posterior) probabilities for both q = 0 and q = 1 together with the true sequence.Clearly, the q = 1 plot better describes the actual sequence but nevertheless its predictive capacity is much reduced when compared to that obtained using the full sequence.The posterior means and standard deviations for the model transition probabilities θ (assuming q = 1) are given in Table 4. Again these values are consistent with the model parameters in Eq. ( 8).However, the standard deviations are significantly increased and much more than would be expected due to halving the sequence length.This is due to uncertainty about the hidden states.

DNA sequence data
For our final example we apply the methods to the analysis of a DNA sequence.These sequences comprise a string of b = 4 states (bases) from the alpha-bet Y = {A, C, G, T}.HMMs have been used for some time to model heterogeneity in the composition of DNA sequences with most analyses assuming that q = 0, that is, the observed sequence is independent conditional on the hidden states.However, empirical evidence would suggest that this independence model is not sufficiently complex to capture the rich dependence structure in these sequences.In such circumstances, the methods described in this paper can permit inferences to be made about the order of Markov dependence of the bases, thereby ensuring appropriate conclusions are drawn.Some authors have analysed DNA sequences using larger values of q but these analyses have assumed a known fixed value of q (for example [7,23]) or attempted to choose between various q using information criteria (for example [12]).
We will study the 7th intron of the human αfetoprotein (AFP) gene which was analysed in [7] using a hidden Markov model.The AFP gene is known be an important factor in embryonic development in mammals and is also thought to play a role in the development of tumors; for further details, see [7].The intron is n = 2275 base pairs in length and is stored in the GenBank sequence database [5] under Accession No. M16110.It can be obtained from the National Center for Biotechnology Information (NCBI) web pages at http://www.ncbi.nlm.nih.gov/.

Results
For comparison with the results in [7] we have run the algorithm assuming r = 3 hidden states.We chose the maximum complexity of base updates to be order q max = 3 and used the same truncated Poisson prior distribution as in the analysis of the simulated data.We also chose the same prior distribution for the observed state transition matrices but for the hidden state transition matrix we chose d ij = 99 for i = j and d ij = 0.5 otherwise; this is equivalent to the information content of a sequence of length 300 with an expected run length of 100 for each hidden state.
The MCMC algorithm was again run for 110000 iterations with the first 10000 being discarded as burnin.The usual checks for evidence on lack of convergence were made using multiple runs from different starting points, and our results are based on a sample of size N = 10000 from one such run, recording every 10th iterate.Table 5 contains the sample and Rao-Blackwellized estimates of the marginal posterior distribution for q.It shows that, after convergence, the sampler never moved from the model with q = 1 during the course of the simulation.Ordinarily, this may point to the MCMC scheme suffering from poor mixing over q, possibly due to the update conditioning on the hidden states s.However, our experience with simulated sequences suggests that this is in fact due to the DNA sequence being sufficiently long to provide overwhelming evidence that q = 1.
These results show that the choice of q = 1 employed by [7] is well justified.

Conclusions
We have seen how inferences can be made about the order of Markov dependence of an observed process governed by a HMM.In many practical examples, the sequence will be sufficiently long that by using these methods it will be straightforward to determine this order, particularly if the transition structures in each hidden state are reasonably different.When sequences are short or the transition structures fairly similar, it can be difficult to determine an appropriate order of dependence.In such circumstances it is important to be able to correctly assess uncertainty about the order q and also the associated transition structures and the pattern of hidden states.The methods presented in this paper do provide this information by adopting a fully Bayesian approach through the use of MCMC techniques.
For the hidden state transition matrix we take d 11 = d 22 = 19 and d 12 = d 21 = 1; this is equivalent to the information content of a sequence of length 40 with an expected run length of 20 for both hidden states.

Table 2
Posterior summaries for transition matrices conditional on q = 1: simulated sequence n = 1000

Table 3
Estimates of the marginal posterior distribution for q: simulated sequence n = 500

Table 4
Posterior summaries for transition matrices conditional on q = 1: simulated sequence n = 500

Table 5
Estimates of the marginal posterior distribution for q: intron 7 of the human AFP gene conditional on r = 3