AAAAbstract and Applied Analysis1687-04091085-3375Hindawi Publishing Corporation10372310.1155/2009/103723103723Research ArticlePolicy Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish SpacesZhuQuanxin1YangXinsong2HuangChuangxia3PapageorgiouNikolaos1Department of MathematicsNingbo UniversityNingbo 315211Chinanbu.edu.cn2Department of MathematicsHonghe UniversityMengzi 661100Chinauoh.edu.cn3The College of Mathematics and Computing Science Changsha University of Science and TechnologyChangsha 410076Chinacsust.cn20092712010200924062009091220092009Copyright © 2009This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We study the policy iteration algorithm (PIA) for continuous-time jump Markov decision processes in general state and action spaces. The corresponding transition rates are allowed to be unbounded, and the reward rates may have neither upper nor lower bounds. The criterion that we are concerned with is expected average reward. We propose a set of conditions under which we first establish the average reward optimality equation and present the PIA. Then under two slightly different sets of conditions we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation.

1. Introduction

In this paper we study the average reward optimality problem for continuous-time jump Markov decision processes (MDPs) in general state and action spaces. The corresponding transition rates are allowed to be unbounded, and the reward rates may have neither upper nor lower bounds. Here, the approach to deal with this problem is by means of the well-known policy iteration algorithm (PIA)—also known as Howard's policy improvement algorithm.

As is well known, the PIA was originally introduced by Howard (1960) in  for finite MDPs (i.e., the state and action spaces are both finite). By using the monotonicity of the sequence of iterated average rewards, he showed that the PIA converged with a finite number of steps. But, when a state space is not finite, there are well-known counterexamples to show that the PIA does not converge even though the action space is compact (see , e.g.,). Thus, an interesting problem is to find conditions to ensure that the PIA converges. To do this, extensive literature has been presented; for instance, see [1, 514] and the references therein. However, most of those references above are concentrated on the case of discrete-time MDPs; for instance, see [1, 5, 11] for finite discrete-time MDPs, [10, 15] for discrete-time MDPs with a finite state space and a compact action set,  for denumerable discrete-time MDPs, and [8, 9, 12] for discrete-time MDPs in Borel spaces. For the case of continuous-time models, to the best of our knowledge, only Guo and Hernández-Lerma , Guo and Cao , and Zhu  have addressed this issue. In [6, 7, 14], the authors established the average reward optimality equation and the existence of average optimal stationary policies. However, the treatments in [6, 7] are restricted to only a denumerable state space. In  we used the policy iteration approach to study the average reward optimality problem for the case of continuous-time jump MDPs in general state and action spaces. One of the main contributions in  is to prove the existence of the average reward optimality equation and average optimal stationary policies. But the PIA is not stated explicitly in , and so the value of the average optimal reward value function and an average optimal stationary policy are also not be computed in . In this paper we further study the average reward optimality problem for such a class of continuous-time jump MDPs in general state and action spaces. Our main objective is to use the PIA to compute or at least approximate (when the PIA takes infinitely many steps to converge) the value of the average optimal reward value function and an average optimal stationary policy. To do this, we first use the so-called “drift" condition, the standard continuity-compactness hypotheses, and the irreducible and uniform exponential ergodicity condition to establish the average reward optimality equation and present the PIA. Then under two differently extra conditions we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation. A key feature of this paper is that the PIA provides an approach to compute or at least approximate (when the PIA takes infinitely many steps to converge) the value of the average optimal reward value function and an average optimal stationary policy.

The remainder of this paper is organized as follows. In Section 2, we introduce the control model and the optimal control problem that we are concerned with. After our optimality conditions and some technical preliminaries as well as the PIA stated in Section 3, we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation in Section 4. Finally, we conclude in Section 5 with some general remarks.

Notation 1.

If X is a Polish space (i.e., a complete and separable metric space), we denote by (X) the Borel σ-algebra.

2. The Optimal Control Problem

The material in this section is quite standard (see [14, 16, 17] e.g.,), and we shall state it briefly. The control model that we are interested in is continuous-time jump MDPs with the following form:

{S,(A(x)A,xS),q(·x,a),r(x,a)}, where one has the following.

S is a state space and it is supposed to be a Polish space.

A is an action space, which is also supposed to be a Polish space, and A(x) is a Borel set which denotes the set of available actions at state xS. The set K:={(x,a):xS,aA(x)} is assumed to be a Borel subset of S×A.

q(·x,a) denotes the transition rates, and they are supposed to satisfy the following properties: for each (x,a)K and D(S),

Dq(Dx,a) is a signed measure on (S), and (x,a)q(Dx,a) is Borel measurable on K;

0q(Dx,a)<, for all xD(S);

q(Sx,a)=0,0-q(xx,a)<;

q(x):=supaA(x)(-q(xx,a))<, for all xS.

It should be noted that the property (Q3) shows that the model is conservative, and the property (Q4) implies that the model is stable.

r(x,a) denotes the reward rate and it is assumed to be measurable on K. (As r(x,a) is allowed to take positive and negative values; it can also be interpreted as a cost rate.)

To introduce the optimal control problem that we are interested in, we need to introduce the classes of admissible control policies.

Let Πm be the family of function πt(Bx) such that

for each xS and t0, Bπt(Bx) is a probability measure on (A(x)),

for each xS and B(A(x)), tπt(Bx) is a Borel measurable function on [0,).

Definition 2.1.

A family π=(πt,t0)Πm is said to be a randomized Markov policy. In particular, if there exists a measurable function f on S with f(x)A(x) for all xS, such that πt({f(x)}x)1 for all t0 and xS, then π is called a (deterministic) stationary policy and it is identified with f. The set of all stationary policies is denoted by F.

For each π=(πt,t0)Πm, we define the associated transition rates q(Dx,πt) and the reward rates r(x,πt), respectively, as follows.

For each xS, D(S) and t0,

q(Dx,πt):=A(x)q(Dx,a)πt(dax),r(x,πt):=A(x)r(x,a)πt(dax). In particular, we will write q(Dx,πt) and r(x,πt) as q(Dx,f) and r(x,f), respectively, when π:=fF.

Definition 2.2.

A randomized Markov policy is said to be admissible if q(Dx,πt) is continuous in t0, for all D(S) and xS.

The family of all such policies is denoted by Π. Obviously, ΠF and so that Π is nonempty. Moreover, for each πΠ, Lemma 2.1 in  ensures that there exists a Q-process—that is, a possibly substochastic and nonhomogeneous transition function Pπ(s,x,t,D) with transition rates q(Dx,πt). As is well known, such a Q-process is not necessarily regular; that is, we might have Pπ(s,x,t,S)<1 for some state xS and ts0. To ensure the regularity of a Q-process, we shall use the following so-called “drift" condition, which is taken from [14, 1618].

Assumption A.

There exist a (measurable) function w11 on S and constants b10, c1>0, M1>0 and M>0 such that

Sw1(y)q(dyx,a)-c1w1(x)+b1 for all (x,a)K;

q(x)M1w1(x) for all xS, with q(x) as in (Q4);

|r(x,a)|Mw1(x) for all (x,a)K.

Remark 2.1 in  gives a discussion of Assumption A. In fact, Assumption A(1) is similar to conditions in the previous literature (see [19, equation (2.4)] e.g.,), and it is together with Assumption A(3) used to ensure the finiteness of the average expected reward criterion (2.5) below. In particular, Assumption A(2) is not required when the transition rate is uniformly bounded, that is, supxSq(x)<.

For each initial state xS at time s0 and πΠ, we denote by Ps,xπ and Es,xπ the probability measure determined by Pπ(s,x,t,D) and the corresponding expectation operator, respectively. Thus, for each πΠ by [20, pages 107–109] there exists a Borel measure Markov process {xtπ} (we shall denote {xtπ} by {xt} for simplicity when there is no risk of confusion) with value in S and the transition function Pπ(s,x,t,D), which is completely determined by the transition rates q(Dx,πt). In particular, if s=0, we write E0,xπ and P0,xπ as Exπ and Pxπ, respectively.

If Assumption A holds, then from [17, Lemma 3.1] we have the following facts.

Lemma 2.3.

Suppose that Assumption A holds. Then the following statements hold.

For each xS, πΠ and t0, Exπ[w1(xt)]e-c1tw1(x)+b1c1, where the function w1 and constants b1 and c1 are as in Assumption A.

For each uBw1(S), xS and πΠ, limtExπ[u(xt)]t=0.

For each xS and πΠ, the expected average reward  V(x,π) as well as the corresponding optimal reward value functions V*(x) are defined as

V(x,π):=lim infT0T[Exπr(xt,πt)]dtT,V*(x):=supπΠV(x,π).

As a consequence of Assumption A(3) and Lemma 2.3(a), the expected average reward V(x,π) is well defined.

Definition 2.4.

A policy π*Π is said to be average optimal if V(x,π*)=V*(x) for all xS.

The main goal of this paper is to give conditions for ensuring that the policy iteration algorithm converges.

3. Optimality Conditions and Preliminaries

In this section we state conditions for ensuring that the policy iteration algorithm (PIA) converges and give some preliminary lemmas that are needed to prove our main results.

To guarantee that the PIA converges, we need to establish the average reward optimality equation. To do this, in addition to Assumption A, we also need two more assumptions. The first one is the following so-called standard continuity-compactness hypotheses, which is taken from [14, 1618]. Moreover, it is similar to the version for discrete-time MDPs; see, for instance, [3, 8, 2123] and their references. In particular, Assumption B(3) is not required when the transition rate is uniformly bounded, since it is only used to ensure the applying of the Dynkin formula.

Assumption B.

For each xS,

A(x) is compact;

r(x,a) is continuous in aA(x), and the function Su(y)q(dyx,a) is continuous in aA(x) for each bounded measurable function u on S, and also for u:=w1 as in Assumption A;

there exist a nonnegative measurable function w2 on S, and constants b20,c2>0 and M2>0 such that q(x)w1(x)M2w2(x),Sw2(y)q(dyx,a)c2w2(x)+b2 for all (x,a)K.

The second one is the irreducible and uniform exponential ergodicity condition. To state this condition, we need to introduce the concept of the weighted norm used in [8, 14, 22]. For the function w11 in Assumption A, we define the weighted supremum norm ·w1 for real-valued functions u on S by

uw1:=supxS[w1(x)-1|u(x)|] and the Banach space

Bw1(S):={u:uw1<}.

Definition 3.1.

For each fF, the Markov process {xt}, with transition rates q(·x,f), is said to be uniform w1-exponentially ergodic if there exists an invariant probability measure μf on S such that supfF|Exf[u(xt)]-μf(u)|Re-ρtuw1w1(x) for all xS, uBw1(S) and t0, where the positive constants R and ρ do not depend on f, and where μf(u):=Su(y)μf(dy).

Assumption C.

For each fF, the Markov process {xt}, with transition rates q(·x,f), is uniform w1-exponentially ergodic and λ-irreducible, where λ is a nontrivial σ-finite measure on (S) independent of f.

Remark 3.2.

(a) Assumption C is taken from  and it is used to establish the average reward optimality equation. (b) Assumption C is similar to the uniform w1-exponentially ergodic hypothesis for discrete-time MDPs; see [8, 22], for instance. (c) Some sufficient conditions as well as examples in [6, 16, 19] are given to verify Assumption C. (d) Under Assumptions A, B, and C, for each fF, the Markov process {xt}, with the transition rate q(·x,f), has a unique invariant probability measure μf such that Sμf(dx)q(Dx,f)=0for  each  D(S).  (e) As in , for any given stationary policy fF, we shall also consider two functions in Bw1(S) to be equivalent and do not distinguish between equivalent functions, if they are equal μf-almost everywhere (a.e.). In particular, if u(x)=0μf-a.e. holds for all xS, then the function u will be taken to be identically zero.

Under Assumptions A, B, and C, we can obtain several lemmas, which are needed to prove our main results.

Lemma 3.3.

Suppose that Assumptions A, B, and C hold, and let fF be any stationary policy. Then one has the following facts.

For each xS, the function hf(x):=0[Exf(r(xt,f))-g(f)]dt belongs to Bw1(S), where g(f):=Sr(y,f)μf(dy) and w1 is as in Assumption A.

(g(f),hf) satisfies the Poisson equation g(f)=r(x,f)+Shf(y)q(dyx,f)  xS, for which the μf-expectation of hf is zero, that is, μf(hf):=Shf(y)μf(dy)=0.

For all xS, |V(x,f)|Mb1/c1.

For all xS, |g(f)|=|V(x,f)|Mb1/c1.

Proof.

Obviously, the proofs of parts (a) and (b) are from [14, Lemma 3.2]. We now prove (c). In fact, from the definition of V(x,f) in (2.5), Assumption A(3), and Lemma 2.3(a) we have |V(x,f)|lim infT0TM[e-c1tw1(x)+b1/c1]dtT=Mb1c1, which gives (c). Finally, we verify part (d). Obviously, by Assumption A(3) and Assumption C we can easily obtain g(f)=V(x,f) for all xS, which together with part (c) yields the desired result.

The next result establishes the average reward optimality equation. For the proof, see [14, Theorem 4.1].

Theorem 3.4.

Under Assumptions A, B, and C, the following statements hold.

There exist a unique constant g*, a function h*Bw1(S), and a stationary policy f*F satisfying the average reward optimality equation

g*=maxaA(x){r(x,a)+Sh*(y)q(dyx,a)}=r(x,f*)+Sh*(y)q(dyx,f*)xS.

g*=supπΠV(x,π) for all xS.

Any stationary policy fF realizing the maximum of (3.10) is average optimal, and so f* in (3.11) is average optimal.

Then, under Assumptions A, B, and C we shall present the PIA that we are concerned with. To do this, we first give the following definition.

For any real-valued function u on S, we define the dynamic programming operator T as follows:

Tu(x):=maxaA(x){r(x,a)+Su(y)q(dyx,a)}xS.

Algorithm A (policy iteration)

Step 1 (initialization).

Take n=0 and choose a stationary policy fnF.

Step 2 (policy evaluation).

Find a constant g(fn) and a real-valued function hfn on Ssatisfying the Poisson equation (3.7), that is, g(fn)=r(x,fn)+Shfn(y)q(dyx,fn)xS. Obviously, by (3.12) and (3.13) we have g(fn)Thfn(x)=maxaA(x){r(x,a)+Shfn(y)q(dyx,a)}xS.

Step 3 (policy improvement).

Set fn+1(x):=fn(x) for all xS for which r(x,fn)+Shfn(y)q(dyx,fn)=Thfn(x); otherwise (i.e., when (3.15) does not hold), choose fn+1(x)A(x) such that r(x,fn+1)+Shfn(y)q(dyx,fn+1)=Thfn(x).

Step 4.

If fn+1 satisfies (3.15) for all xS, then stop (because, from Proposition 4.1 below, fn+1 is average optimal); otherwise, replace fn with fn+1 and go back to Step 2.

Definition 3.5.

The policy iteration Algorithm A is said to converge if the sequence {g(fn)} converges to the average optimal reward value function in (2.5), that is, limng(fn)=V*(x)=g*xS, where g* is as in Theorem 3.4.

Obviously, under Assumptions A, B, and C from Proposition 4.1 we see that the sequence {g(fn)} is nondecreasing; that is, g(fn)g(fn+1) holds for all n1. On the other hand, by Lemma 3.3(d) we see that {g(fn)} is bounded. Therefore, there exists a constant g such that

limng(fn)=g. Noting that, in general, we have gg*. In order to ensure that the policy iteration Algorithm A converges, that is, g=g*, in addition to Assumptions A, B, and C, we need an additional condition (Assumption D (or D) below).

Assumption D.

There exist a subsequence {hfm} of {hfn} and a measurable function h on S such that limmhfm(x)=h(x)xS.

Remark 3.6.

(a) Assumption D is the same as the hypothesis H1 in , and Remark 4.6 in  gives a detailed discussion of Assumption D. (b) In particular, Assumption D trivially holds when the state space S is a countable set (with the discrete topology). (c) When the state space S is not countable, if the sequence {hfn} is equicontinuous, Assumption D also holds.

Assumption D’.

There exists a stationary policy f*F such that limnfn(x)=f*(x)xS.

Remark 3.7.

Assumption D is the same as the hypothesis H2 in . Obviously, Assumption D trivially holds when the state space S is a countable set (with the discrete topology) and A(x) is compact for all xS.

Finally, we present a lemma (Lemma 3.8) to conclude this section, which is needed to prove our Theorem 4.2. For a proof, see [24, Proposition 12.2], for instance.

Lemma 3.8.

Suppose that A(x) is compact for all xS, and let {fn} be a stationary policy sequence in F. Then there exists a stationary policy fF such that f(x)A(x) is an accumulation point of {fn(x)} for each xS.

4. Main Results

In this section we will present our main results, Theorems 4.2-4.3. Before stating them, we first give the following proposition, which is needed to prove our main results.

Proposition 4.1.

Suppose that Assumptions A, B, and C hold, and let fF be an arbitrary stationary policy. If any policy f̅F such that Thf(x)=r(x,f̅)+Shf(y)q(dyx,f̅)xS, then (a) g(f)g(f̅);

if g(f)=g(f̅), then hf(·)=hf̅(·)+kfor  some  constant  k;

if f is average optimal, then hf(·)=h*(·)+kfor  some  constant  k, where h* is as in Theorem 3.4;

if g(f)=g(f̅), then (g(f),hf) satisfies the average reward optimality equation (3.10), and so f is average optimal.

Proof.

(a) Combining (3.7) and (4.1) we have g(f)r(x,f̅)+Shf(y)q(dyx,f̅)xS. Obviously, taking the integration on both sides of (4.5) with respect to μf̅ and by Remark 3.2(d) we obtain the desired result.

(b) If g(f)=g(f̅), we may rewrite the Poisson equation for f̅ as

g(f)=r(x,f̅)+Shf̅(y)q(dyx,f̅)xS. Then, combining (4.5) and (4.6) we obtain S[hf(y)-hf̅(y)]q(dyx,f̅)0xS. Thus, from (4.7) and using the Dynkin formula we get Exf̅[hf(xt)-hf̅(xt)]hf(x)-hf̅(x)xS. Letting t in (4.8) and by Assumption C we have μf̅(hf-hf̅)hf(x)-hf̅(x)xS. Now take k:=supxS[hf(x)-hf̅(x)]. Then take the supremum over xS in (4.9) to obtain kμf̅(hf-hf̅)k, and so μf̅(hf-hf̅)=k, which implies hf(·)=hf̅(·)+k        μf̅-a.e. Hence, from Remark 3.2(e) and (4.12) we obtain (4.3).

(c) Since f is average optimal, by Definition 2.4 and Theorem 3.4(b) we have

g(f)=g*=supπΠV(x,π)xS. Hence, the Poisson equation (3.7) for f becomes g*=r(x,f)+Shf(y)q(dyx,f)xS. On the other hand, by (3.10) we obtain g*=maxaA(x){r(x,a)+Sh*(y)q(dyx,a)}r(x,f)+Sh*(y)q(dyx,f)xS, which together with (4.14) gives S[hf(y)-h*(y)]q(dyx,f)0xS. Thus, as in the proof of part (b), from (4.16) we see that (4.4) holds with k:=supxS[hf(x)-h*(x)].

(d) By (3.7), (4.1), (4.3), and Q3 we have

g(f)Thf(x)=r(x,f̅)+S[hf̅(y)+k]q(dyx,f̅)=r(x,f̅)+Shf̅(y)q(dyx,f̅)=g(f̅)=g(f)xS, which gives g(f)=Thf(x)xS, that is, g(f)=maxaA(x){r(x,a)+Shf(y)q(dyx,a)}xS. Thus, as in the proof of Theorem 4.1 in , from Lemma 2.3(b), (3.7), and (4.19) we show that f is average optimal, that is, g(f)=g*. Hence, we may rewrite (4.19) as g*=maxaA(x){r(x,a)+Shf(y)q(dyx,a)}xS. Thus, from (4.20) and part (c) we obtain the desired conclusion.

Theorem 4.2.

Suppose that Assumptions A, B, C, and D hold, then the policy iteration Algorithm A converges.

Proof.

From Lemma 3.3(a) we see that the function hfn in (3.13) belongs to Bw1(S), and so the function h in (3.19) also belongs to Bw1(S). Now let {hfn} be as in Assumption D, and let {hfm} be the corresponding subsequence of {hfn}. Then by Assumption D we have limmhfm(x)=h(x)xS. Moreover, from Lemma 3.8 there is a stationary policy fF such that f(x)A(x) is an accumulation point of {fm(x)} for each xS; that is, for each xS there exists a subsequence {mi} (depending on the state x) such that limifmi(x)=f(x)xS. Also, by (3.13) we get g(fmi)=r(x,fmi)+Shfmi(y)q(dyx,fmi)xS. On the other hand, take any real-valued measurable function m on S such that m(x)>q(x)0 for all xS. Then, for each xS and aA(x), by the properties (Q1)–(Q3) we can define P(·x,a) as follows: P(Dx,a):=q(Dx,a)m(x)+ID(x)D(S). Obviously, P(·x,a) is a probability measure on S. Thus, combining (4.23) and (4.24) we have g(fmi)=r(x,fmi)+Shfmi(y)P(dyx,fmi)xS. Letting i in (4.25), then by (3.18), (4.21), and (4.22) as well as the “extension of Fatou's lemma 8.3.7" in  we obtain g=r(x,f)+Sh(y)q(dyx,f)xS. To complete the proof of Theorem 4.2, by Proposition 4.1(d) we only need to prove that g,h, and f satisfy the average reward optimality equation (3.10) and (3.11), that is, g=Th(x)=r(x,f)+Sh(y)q(dyx,f)xS. Obviously, from (4.26), and the definition of T in (3.12) we obtain gTh(x)xS. The rest is to prove the reverse inequality, that is, gTh(x)xS. Obviously, by (3.19) we have limi[hfmi(x)-hfmi-1(x)]=0xS. Moreover, from Lemma 3.3(a) again we see that there exists a constant k such that hfnw1kn1, which gives hfmi-hfmi-1w1hfmiw1+hfmi-1w12k. Thus, by (4.24), (4.31), (4.32) and the “extension of Fatou's lemma 8.3.7" in  we obtain limiS[hfmi(y)-hfmi-1(y)]P(dyx,fmi)=0xS, which implies limiS[hfmi(y)-hfmi-1(y)]q(dyx,fmi)=0xS. Also, from (3.7), (3.16), and the definition of T in (3.12) we get g(fmi)=r(x,fmi)+Shfmi(y)q(dyx,fmi)=Thfmi-1(x)+S[hfmi(y)-hfmi-1(y)]q(dyx,fmi)r(x,a)+Shfmi-1(y)q(dyx,a)+S[hfmi(y)-hfmi-1(y)]q(dyx,fmi)xS,aA(x). Letting i in (4.35), then by (3.18), (4.21), (4.22), (4.34), and the “extension of Fatou's lemma 8.3.7" in  we obtain gr(x,a)+Sh(y)q(dyx,a)xS,aA(x), which gives gmaxaA(x){r(x,a)+Sh(y)q(dyx,a)}=Th(x)xS. This completes the proof of Theorem 4.2.

Theorem 4.3.

Suppose that Assumptions A, B, C, and D hold, then the policy iteration Algorithm A converges.

Proof.

To prove Theorem 4.3, from the proof of Theorem 4.2 we only need to verify that (4.26) and (4.27) hold true for f* as in Assumption D and some function h in Bw1(S). To do this, we first define two functions h,h in Bw1(S) as follows: h(x):=lim supnhfn(x),h(x):=lim infnhfn(x)xS. Then by (3.7) we get g(fn)=r(x,fn)+Shfn(y)q(dyx,fn)xS, which together with (4.24) yields g(fn)=r(x,fn)+Shfn(y)P(dyx,fn)xS. Applying the “extension of Fatou's Lemma" 8.3.7 in  and letting n in (4.40), then by (3.18), (4.38) and Assumption D we obtain gr(x,f*)+Sh(y)P(dyx,f*)xS,gr(x,f*)+Sh(y)P(dyx,f*)xS, which implies gr(x,f*)+Sh(y)q(dyx,f*)xS,gr(x,f*)+Sh(y)q(dyx,f*)xS. Thus, combining (4.42) and (4.43) we get S[h(y)-h(y)]q(dyx,f*)0xS. Then, from the proof of Proposition 4.1(b) and (4.44) we have h(·)=h(·)+k′′for  some  constant  k′′, which together with (4.42), (4.43), and the definition of T in (3.12) gives g=r(x,f*)+Sh(y)q(dyx,f*)Th(x)xS. The remainder is to prove the reverse inequality, that is, gTh(x)=maxaA(x){r(x,a)+Sh(y)q(dyx,a)}xS. Obviously, by (3.16) and (4.24) we get r(x,fn+1)m(x)+Shfn(y)P(dyx,fn+1)r(x,a)m(x)+Shfn(y)P(dyx,a)xS,aA(x). Then, letting n in (4.48), by (4.38), Assumption D, and the “extension of Fatou's Lemma" 8.3.7 in , we obtain r(x,f*)m(x)+Sh(y)P(dyx,f*)r(x,a)m(x)+Sh(y)P(dyx,a)xS,aA(x), which implies r(x,f*)+Sh(y)q(dyx,f*)r(x,a)+Sh(y)q(dyx,a)xS,aA(x), and so r(x,f*)+Sh(y)q(dyx,f*)maxaA(x){r(x,a)+Sh(y)q(dyx,a)}xS. Thus, combining (4.46) and (4.51) we see that (4.47) holds. And so Theorem 4.3 follows.

5. Concluding Remarks

In the previous sections we have studied the policy iteration algorithm (PIA) for average reward continuous-time jump MDPs in Polish spaces. Under two slightly different sets of conditions we have shown that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation. It should be mentioned that the approach presented here is different from the policy iteration approach used in  because the PIA in this paper provides an approach to compute or at least approximate (when the PIA takes infinitely many steps to converge) the value of the average optimal reward value function and an average optimal stationary policy.

Acknowledgments

The author would like to thank the editor and anonymous referees for their good comments and valuable suggestions, which have helped us to improve the paper. This work was jointly supported by the National Natural Science Foundation of China (10801056), the Natural Science Foundation of Ningbo (201001A6011005) the Scientific Research Fund of Zhejiang Provincial Education Department, K.C. Wong Magna Fund in Ningbo University, the Natural Science Foundation of Yunnan Provincial Education Department (07Y10085), the Natural Science Foundation of Yunnan Provincial (2008CD186), the Foundation of Chinese Society for Electrical Engineering (2008).

HowardR. A.Dynamic Programming and Markov Processes1960Cambridge, Mass, USAThe Technology Press of M.I.T.viii+136MR0118514DekkerR.Counter examples for compact action Markov decision chains with average reward criteriaCommunications in Statistics198733357368MR925930PutermanM. L.Markov Decision Processes: Discrete Stochastic Dynamic Programming1994New York, NY, USAJohn Wiley & Sonsxx+649Wiley Series in Probability and Mathematical Statistics: Applied Probability and StatisticsMR1270015SchweitzerP. J.On undiscounted Markovian decision processes with compact action spacesRAIRO—Operations Research19851917186MR794639DenardoE. V.FoxB. L.Multichain Markov renewal programsSIAM Journal on Applied Mathematics196816468487MR0234721GuoX. P.Hernández-LermaO.Drift and monotonicity conditions for continuous-time controlled Markov chains with an average criterionIEEE Transactions on Automatic Control200348223624510.1109/TAC.2002.808469MR1957320GuoX. P.CaoX. R.Optimal control of ergodic continuous-time Markov chains with average sample-path rewardsSIAM Journal on Control and Optimization2005441294810.1137/S0363012903420875MR2176665Hernández-LermaO.LasserreJ. B.Further Topics on Discrete-Time Markov Control Processes199942New York, NY, USASpringerxiv+276Applications of MathematicsMR1697198Hernández-LermaO.LasserreJ. B.Policy iteration for average cost Markov control processes on Borel spacesActa Applicandae Mathematicae199747212515410.1023/A:1005781013253MR1449437HordijkA.PutermanM. L.On the convergence of policy iteration in finite state undiscounted Markov decision processes: the unichain caseMathematics of Operations Research198712116317610.1287/moor.12.1.163MR882848LasserreJ. B.A new policy iteration scheme for Markov decision processes using Schweitzer's formulaJournal of Applied Probability1994311268273MR1260590MeynS. P.The policy iteration algorithm for average reward Markov decision processes with general state spaceIEEE Transactions on Automatic Control199742121663168010.1109/9.650016MR1490975SantosM. S.RustJ.Convergence properties of policy iterationSIAM Journal on Control and Optimization20044262094211510.1137/S0363012902399824MR2080931ZhuQ. X.Average optimality for continuous-time Markov decision processes with a policy iteration approachJournal of Mathematical Analysis and Applications2008339169170410.1016/j.jmaa.2007.06.071MR2370686GolubinA. Y.A note on the convergence of policy iteration in Markov decision processes with compact action spacesMathematics of Operations Research200328119420010.1287/moor.28.1.194.14255MR1961274GuoX. P.RiederU.Average optimality for continuous-time Markov decision processes in Polish spacesThe Annals of Applied Probability200616273075610.1214/105051606000000105MR2244431ZhuQ. X.Average optimality inequality for continuous-time Markov decision processes in Polish spacesMathematical Methods of Operations Research200766229931310.1007/s00186-007-0157-xMR2342216 (2008f:90120)ZhuQ. X.Prieto-RumeauT.Bias and overtaking optimality for continuous-time jump Markov decision processes in Polish spacesJournal of Applied Probability200845241742910.1239/jap/1214950357MR2426841LundR. B.MeynS. P.TweedieR. L.Computable exponential convergence rates for stochastically ordered Markov processesThe Annals of Applied Probability19966121823710.1214/aoap/1034968072MR1389838GīhmanI. I.SkorohodA. V.Controlled Stochastic Processes1979New York, NY, USASpringervii+237MR544839ZhuQ. X.GuoX. P.Markov decision processes with variance minimization: a new condition and approachStochastic Analysis and Applications200725357759210.1080/07362990701282807MR2321898ZhuQ. X.GuoX. P.Another set of conditions for Markov decision processes with average sample-path costsJournal of Mathematical Analysis and Applications200632221199121410.1016/j.jmaa.2006.02.050MR2250645ZhuQ. X.GuoX. P.Another set of conditions for strong n(n=1,0) discount optimality in Markov decision processesStochastic Analysis and Applications200523595397410.1080/07362990500184865MR2158887SchälM.Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimalZeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete1975323179196MR0378841