Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish Spaces

and Applied Analysis 3 ii A is an action space, which is also supposed to be a Polish space, andA x is a Borel set which denotes the set of available actions at state x ∈ S. The set K : { x, a : x ∈ S, a ∈ A x } is assumed to be a Borel subset of S ×A. iii q · | x, a denotes the transition rates, and they are supposed to satisfy the following properties: for each x, a ∈ K and D ∈ B S , Q1 D → q D | x, a is a signed measure on B S , and x, a → q D | x, a is Borel measurable on K; Q2 0 ≤ q D | x, a < ∞, for all x /∈D ∈ B S ; Q3 q S | x, a 0, 0 ≤ −q x | x, a < ∞; Q4 q x : supa∈A x −q x | x, a < ∞, for all x ∈ S. It should be noted that the property Q3 shows that the model is conservative, and the property Q4 implies that the model is stable. iv r x, a denotes the reward rate and it is assumed to be measurable on K. As r x, a is allowed to take positive and negative values; it can also be interpreted as a cost rate. To introduce the optimal control problem that we are interested in, we need to introduce the classes of admissible control policies. Let Πm be the family of function πt B | x such that i for each x ∈ S and t ≥ 0, B → πt B | x is a probability measure on B A x , ii for each x ∈ S and B ∈ B A x , t → πt B | x is a Borel measurable function on 0,∞ . Definition 2.1. A family π πt, t ≥ 0 ∈ Πm is said to be a randomized Markov policy. In particular, if there exists a measurable function f on S with f x ∈ A x for all x ∈ S, such that πt {f x } | x ≡ 1 for all t ≥ 0 and x ∈ S, then π is called a deterministic stationary policy and it is identified with f . The set of all stationary policies is denoted by F. For each π πt, t ≥ 0 ∈ Πm, we define the associated transition rates q D | x, πt and the reward rates r x, πt , respectively, as follows. For each x ∈ S, D ∈ B S and t ≥ 0, q D | x, πt : ∫ A x q D | x, a πt da | x , r x, πt : ∫ A x r x, a πt da | x . 2.2 In particular, we will write q D | x, πt and r x, πt as q D | x, f and r x, f , respectively, when π : f ∈ F. Definition 2.2. A randomized Markov policy is said to be admissible if q D | x, πt is continuous in t ≥ 0, for all D ∈ B S and x ∈ S. The family of all such policies is denoted by Π. Obviously, Π ⊇ F and so that Π is nonempty. Moreover, for each π ∈ Π, Lemma 2.1 in 16 ensures that there exists 4 Abstract and Applied Analysis a Q-process—that is, a possibly substochastic and nonhomogeneous transition function P s, x, t,D with transition rates q D | x, πt . As is well known, such a Q-process is not necessarily regular; that is, we might have P s, x, t, S < 1 for some state x ∈ S and t ≥ s ≥ 0. To ensure the regularity of aQ-process, we shall use the following so-called “drift” condition, which is taken from 14, 16–18 . Assumption A. There exist a measurable function w1 ≥ 1 on S and constants b1 ≥ 0, c1 > 0, M1 > 0 andM > 0 such that 1 ∫ Sw1 y q dy | x, a ≤ −c1w1 x b1 for all x, a ∈ K; 2 q x ≤ M1w1 x for all x ∈ S, with q x as in Q4 ; 3 |r x, a | ≤ Mw1 x for all x, a ∈ K. Remark 2.1 in 16 gives a discussion of Assumption A. In fact, Assumption A 1 is similar to conditions in the previous literature see 19, equation 2.4 e.g., , and it is together with Assumption A 3 used to ensure the finiteness of the average expected reward criterion 2.5 below. In particular, Assumption A 2 is not required when the transition rate is uniformly bounded, that is, supx∈Sq x < ∞. For each initial state x ∈ S at time s ≥ 0 and π ∈ Π, we denote by P s,x and E s,x the probability measure determined by P s, x, t,D and the corresponding expectation operator, respectively. Thus, for each π ∈ Π by 20, pages 107–109 there exists a Borel measureMarkov process {xπ t } we shall denote {xπ t } by {xt} for simplicity when there is no risk of confusion with value in S and the transition function P s, x, t,D , which is completely determined by the transition rates q D | x, πt . In particular, if s 0, we write E 0,x and P 0,x as E x and P x , respectively. If Assumption A holds, then from 17, Lemma 3.1 we have the following facts. Lemma 2.3. Suppose that Assumption A holds. Then the following statements hold. a For each x ∈ S, π ∈ Π and t ≥ 0, E x w1 xt ≤ e−c1tw1 x b1 c1 , 2.3 where the function w1 and constants b1 and c1 are as in Assumption A. b For each u ∈ Bw1 S , x ∈ S and π ∈ Π, lim t→∞ E x u xt t 0. 2.4 For each x ∈ S and π ∈ Π, the expected average reward V x, π as well as the corresponding optimal reward value functions V ∗ x are defined as V x, π : lim inf T →∞ ∫T 0 E π x r xt, πt dt T , V ∗ x : sup π∈Π V x, π . 2.5 As a consequence of Assumption A 3 and Lemma 2.3 a , the expected average reward V x, π is well defined. Abstract and Applied Analysis 5 Definition 2.4. A policy π∗ ∈ Π is said to be average optimal if V x, π∗ V ∗ x for all x ∈ S. The main goal of this paper is to give conditions for ensuring that the policy iteration algorithm converges.and Applied Analysis 5 Definition 2.4. A policy π∗ ∈ Π is said to be average optimal if V x, π∗ V ∗ x for all x ∈ S. The main goal of this paper is to give conditions for ensuring that the policy iteration algorithm converges. 3. Optimality Conditions and Preliminaries In this section we state conditions for ensuring that the policy iteration algorithm PIA converges and give some preliminary lemmas that are needed to prove our main results. To guarantee that the PIA converges, we need to establish the average reward optimality equation. To do this, in addition to Assumption A, we also need two more assumptions. The first one is the following so-called standard continuity-compactness hypotheses, which is taken from 14, 16–18 . Moreover, it is similar to the version for discretetimeMDPs; see, for instance, 3, 8, 21–23 and their references. In particular, Assumption B 3 is not required when the transition rate is uniformly bounded, since it is only used to ensure the applying of the Dynkin formula. Assumption B. For each x ∈ S, 1 A x is compact; 2 r x, a is continuous in a ∈ A x , and the function ∫Su y q dy | x, a is continuous in a ∈ A x for each bounded measurable function u on S, and also for u : w1 as in Assumption A; 3 there exist a nonnegative measurable functionw2 on S, and constants b2 ≥ 0, c2 > 0 andM2 > 0 such that q x w1 x ≤ M2w2 x , ∫ S w2 ( y ) q ( dy | x, a ≤ c2w2 x b2 3.1 for all x, a ∈ K. The second one is the irreducible and uniform exponential ergodicity condition. To state this condition, we need to introduce the concept of the weighted norm used in 8, 14, 22 . For the function w1 ≥ 1 in Assumption A, we define the weighted supremum norm ‖ · ‖w1 for real-valued functions u on S by ‖u‖w1 : sup x∈S [ w1 x −1|u x | ] 3.2 and the Banach space Bw1 S : { u : ‖u‖w1 < ∞ } . 3.3 Definition 3.1. For each f ∈ F, the Markov process {xt}, with transition rates q · | x, f , is said to be uniform w1-exponentially ergodic if there exists an invariant probability measure μf on S 6 Abstract and Applied Analysis such that sup f∈F ∣ ∣ ∣E f x u xt − μf u ∣ ∣ ∣ ≤ Re‖u‖w1w1 x 3.4 for all x ∈ S, u ∈ Bw1 S and t ≥ 0, where the positive constants R and ρ do not depend on f , and where μf u : ∫ Su y μf dy . Assumption C. For each f ∈ F, the Markov process {xt}, with transition rates q · | x, f , is uniform w1-exponentially ergodic and λ-irreducible, where λ is a nontrivial σ-finite measure on B S independent of f . Remark 3.2. a Assumption C is taken from 14 and it is used to establish the average reward optimality equation. b Assumption C is similar to the uniform w1-exponentially ergodic hypothesis for discrete-time MDPs; see 8, 22 , for instance. c Some sufficient conditions as well as examples in 6, 16, 19 are given to verify Assumption C. d Under Assumptions A, B, and C, for each f ∈ F, the Markov process {xt}, with the transition rate q · | x, f , has a unique invariant probability measure μf such that ∫ S μf dx q ( D | x, f 0 for each D ∈ B S . 3.5 e As in 9 , for any given stationary policy f ∈ F, we shall also consider two functions in Bw1 S to be equivalent and do not distinguish between equivalent functions, if they are equal μf -almost everywhere a.e. . In particular, if u x 0 μf -a.e. holds for all x ∈ S, then the function u will be taken to be identically zero. Under Assumptions A, B, and C, we can obtain several lemmas, which are needed to prove our main results. Lemma 3.3. Suppose that Assumptions A, B, and C hold, and let f ∈ F be any stationary policy. Then one has the following facts. a For each x ∈ S, the function hf x : ∫∞ 0 [ E f x ( r ( xt, f )) − gf ] dt 3.6 belongs to Bw1 S , where g f : ∫ Sr y, f μf dy and w1 is as in Assumption A. b g f , hf satisfies the Poisson equation g ( f ) r ( x, f ) ∫ S hf ( y ) q ( dy | x, f ∀x ∈ S, 3.7 for which the μf -expectation of hf is zero, that is,


Introduction
In this paper we study the average reward optimality problem for continuous-time jump Markov decision processes MDPs in general state and action spaces.The corresponding transition rates are allowed to be unbounded, and the reward rates may have neither upper nor lower bounds.Here, the approach to deal with this problem is by means of the well-known policy iteration algorithm PIA -also known as Howard's policy improvement algorithm.
As is well known, the PIA was originally introduced by Howard 1960 in 1 for finite MDPs i.e., the state and action spaces are both finite .By using the monotonicity of the sequence of iterated average rewards, he showed that the PIA converged with a finite number of steps.But, when a state space is not finite, there are well-known counterexamples to show that the PIA does not converge even though the action space is compact see 2-4 , e.g., .
Thus, an interesting problem is to find conditions to ensure that the PIA converges.To do this, extensive literature has been presented; for instance, see 1, 5-14 and the references therein.However, most of those references above are concentrated on the case of discrete-time MDPs; for instance, see 1, 5, 11 for finite discrete-time MDPs, 10, 15 for discrete-time MDPs with a finite state space and a compact action set, 13 for denumerable discrete-time MDPs, and 8, 9, 12 for discrete-time MDPs in Borel spaces.For the case of continuous-time models, to the best of our knowledge, only Guo and Hernández-Lerma 6 , Guo and Cao 7 , and Zhu 14 have addressed this issue.In 6, 7, 14 , the authors established the average reward optimality equation and the existence of average optimal stationary policies.However, the treatments in 6, 7 are restricted to only a denumerable state space.In 14 we used the policy iteration approach to study the average reward optimality problem for the case of continuoustime jump MDPs in general state and action spaces.One of the main contributions in 14 is to prove the existence of the average reward optimality equation and average optimal stationary policies.But the PIA is not stated explicitly in 14 , and so the value of the average optimal reward value function and an average optimal stationary policy are also not be computed in 14 .In this paper we further study the average reward optimality problem for such a class of continuous-time jump MDPs in general state and action spaces.Our main objective is to use the PIA to compute or at least approximate when the PIA takes infinitely many steps to converge the value of the average optimal reward value function and an average optimal stationary policy.To do this, we first use the so-called "drift" condition, the standard continuity-compactness hypotheses, and the irreducible and uniform exponential ergodicity condition to establish the average reward optimality equation and present the PIA.Then under two differently extra conditions we show that the PIA yields the optimal maximum reward, an average optimal stationary policy, and a solution to the average reward optimality equation.A key feature of this paper is that the PIA provides an approach to compute or at least approximate when the PIA takes infinitely many steps to converge the value of the average optimal reward value function and an average optimal stationary policy.The remainder of this paper is organized as follows.In Section 2, we introduce the control model and the optimal control problem that we are concerned with.After our optimality conditions and some technical preliminaries as well as the PIA stated in Section 3, we show that the PIA yields the optimal maximum reward, an average optimal stationary policy, and a solution to the average reward optimality equation in Section 4. Finally, we conclude in Section 5 with some general remarks.Notation 1.If X is a Polish space i.e., a complete and separable metric space , we denote by B X the Borel σ-algebra.

The Optimal Control Problem
The material in this section is quite standard see 14, 16, 17 e.g., , and we shall state it briefly.The control model that we are interested in is continuous-time jump MDPs with the following form: where one has the following.
i S is a state space and it is supposed to be a Polish space.

Abstract and Applied Analysis 3
ii A is an action space, which is also supposed to be a Polish space, and A x is a Borel set which denotes the set of available actions at state x ∈ S. The set K : { x, a : x ∈ S, a ∈ A x } is assumed to be a Borel subset of S × A.
iii q • | x, a denotes the transition rates, and they are supposed to satisfy the following properties: for each x, a ∈ K and D ∈ B S , It should be noted that the property Q 3 shows that the model is conservative, and the property Q 4 implies that the model is stable.iv r x, a denotes the reward rate and it is assumed to be measurable on K.As r x, a is allowed to take positive and negative values; it can also be interpreted as a cost rate.
To introduce the optimal control problem that we are interested in, we need to introduce the classes of admissible control policies.
Let Π m be the family of function Definition 2.1.A family π π t , t ≥ 0 ∈ Π m is said to be a randomized Markov policy.In particular, if there exists a measurable function f on S with f x ∈ A x for all x ∈ S, such that π t {f x } | x ≡ 1 for all t ≥ 0 and x ∈ S, then π is called a deterministic stationary policy and it is identified with f.The set of all stationary policies is denoted by F.
For each π π t , t ≥ 0 ∈ Π m , we define the associated transition rates q D | x, π t and the reward rates r x, π t , respectively, as follows.
For each x ∈ S, D ∈ B S and t ≥ 0,

2.2
In particular, we will write q D | x, π t and r x, π t as q D | x, f and r x, f , respectively, when π : f ∈ F.

Definition 2.2.
A randomized Markov policy is said to be admissible if q D | x, π t is continuous in t ≥ 0, for all D ∈ B S and x ∈ S.
The family of all such policies is denoted by Π. Obviously, Π ⊇ F and so that Π is nonempty.Moreover, for each π ∈ Π, Lemma 2.1 in 16 ensures that there exists a Q-process-that is, a possibly substochastic and nonhomogeneous transition function P π s, x, t, D with transition rates q D | x, π t .As is well known, such a Q-process is not necessarily regular; that is, we might have P π s, x, t, S < 1 for some state x ∈ S and t ≥ s ≥ 0. To ensure the regularity of a Q-process, we shall use the following so-called "drift" condition, which is taken from 14, 16-18 .
Assumption A. There exist a measurable function Remark 2.1 in 16 gives a discussion of Assumption A. In fact, Assumption A 1 is similar to conditions in the previous literature see 19, equation 2.4 e.g., , and it is together with Assumption A 3 used to ensure the finiteness of the average expected reward criterion 2.5 below.In particular, Assumption A 2 is not required when the transition rate is uniformly bounded, that is, sup x∈S q x < ∞.
For each initial state x ∈ S at time s ≥ 0 and π ∈ Π, we denote by P π s,x and E π s,x the probability measure determined by P π s, x, t, D and the corresponding expectation operator, respectively.Thus, for each π ∈ Π by 20, pages 107-109 there exists a Borel measure Markov process {x π t } we shall denote {x π t } by {x t } for simplicity when there is no risk of confusion with value in S and the transition function P π s, x, t, D , which is completely determined by the transition rates q D | x, π t .In particular, if s 0, we write E π 0,x and P π 0,x as E π x and P π x , respectively.
If Assumption A holds, then from 17, Lemma 3.1 we have the following facts.
Lemma 2.3.Suppose that Assumption A holds.Then the following statements hold.
a For each x ∈ S, π ∈ Π and t ≥ 0, where the function w 1 and constants b 1 and c 1 are as in Assumption A.
b For each u ∈ B w 1 S , x ∈ S and π ∈ Π, For each x ∈ S and π ∈ Π, the expected average reward V x, π as well as the corresponding optimal reward value functions V * x are defined as As a consequence of Assumption A 3 and Lemma 2.3 a , the expected average reward V x, π is well defined.Definition 2.4.A policy π * ∈ Π is said to be average optimal if V x, π * V * x for all x ∈ S.
The main goal of this paper is to give conditions for ensuring that the policy iteration algorithm converges.

Optimality Conditions and Preliminaries
In this section we state conditions for ensuring that the policy iteration algorithm PIA converges and give some preliminary lemmas that are needed to prove our main results.
To guarantee that the PIA converges, we need to establish the average reward optimality equation.To do this, in addition to Assumption A, we also need two more assumptions.The first one is the following so-called standard continuity-compactness hypotheses, which is taken from 14, 16-18 .Moreover, it is similar to the version for discretetime MDPs; see, for instance, 3, 8, 21-23 and their references.In particular, Assumption B 3 is not required when the transition rate is uniformly bounded, since it is only used to ensure the applying of the Dynkin formula.Assumption B. For each x ∈ S, and the function S u y q dy | x, a is continuous in a ∈ A x for each bounded measurable function u on S, and also for u : w 1 as in Assumption A; 3 there exist a nonnegative measurable function w 2 on S, and constants b 2 ≥ 0, c 2 > 0 and M 2 > 0 such that The second one is the irreducible and uniform exponential ergodicity condition.To state this condition, we need to introduce the concept of the weighted norm used in 8, 14, 22 .For the function w 1 ≥ 1 in Assumption A, we define the weighted supremum norm • w 1 for real-valued functions u on S by and the Banach space For each f ∈ F, the Markov process {x t }, with transition rates q • | x, f , is said to be uniform w 1 -exponentially ergodic if there exists an invariant probability measure μ f on S such that for all x ∈ S, u ∈ B w 1 S and t ≥ 0, where the positive constants R and ρ do not depend on f, and where μ f u : S u y μ f dy .
Assumption C. For each f ∈ F, the Markov process {x t }, with transition rates q • | x, f , is uniform w 1 -exponentially ergodic and λ-irreducible, where λ is a nontrivial σ-finite measure on B S independent of f.
e As in 9 , for any given stationary policy f ∈ F, we shall also consider two functions in B w 1 S to be equivalent and do not distinguish between equivalent functions, if they are equal μ f -almost everywhere a.e. .In particular, if u x 0 μ f -a.e.holds for all x ∈ S, then the function u will be taken to be identically zero.
Under Assumptions A, B, and C, we can obtain several lemmas, which are needed to prove our main results.Lemma 3.3.Suppose that Assumptions A, B, and C hold, and let f ∈ F be any stationary policy.Then one has the following facts.
a For each x ∈ S, the function belongs to B w 1 S , where g f : S r y, f μ f dy and w 1 is as in Assumption A.
b g f , h f satisfies the Poisson equation for which the μ f -expectation of h f is zero, that is, Proof.Obviously, the proofs of parts a and b are from 14, Lemma 3.2 .We now prove c .In fact, from the definition of V x, f in 2.5 , Assumption A 3 , and Lemma 2.3 a we have which gives c .Finally, we verify part d .Obviously, by Assumption A 3 and Assumption C we can easily obtain g f V x, f for all x ∈ S, which together with part c yields the desired result.
The next result establishes the average reward optimality equation.For the proof, see 14, Theorem 4.1 .

Theorem 3.4. Under Assumptions A, B, and C, the following statements hold.
a There exist a unique constant g * , a function h * ∈ B w 1 S , and a stationary policy f * ∈ F satisfying the average reward optimality equation c Any stationary policy f ∈ F realizing the maximum of 3.10 is average optimal, and so f * in 3.11 is average optimal.
Then, under Assumptions A, B, and C we shall present the PIA that we are concerned with.To do this, we first give the following definition.
For any real-valued function u on S, we define the dynamic programming operator T as follows:

3.12
Algorithm A policy iteration .
Step 1 initialization .Take n 0 and choose a stationary policy f n ∈ F.
Step 2 policy evaluation .Find a constant g f n and a real-valued function h f n on Ssatisfying the Poisson equation 3.7 , that is,

3.13
Obviously, by 3.12 and 3.13 we have g f n ≤ Th f n x max a∈A x r x, a S h f n y q dy | x, a ∀x ∈ S.

3.14
Step 3 policy improvement .Set f n 1 x : f n x for all x ∈ S for which r x, f n S h f n y q dy | x, f n Th f n x ; 3.15 otherwise i.e., when 3.15 does not hold , choose

3.16
Step 4. If f n 1 satisfies 3.15 for all x ∈ S, then stop because, from Proposition 4.1 below, f n 1 is average optimal ; otherwise, replace f n with f n 1 and go back to Step 2.
Definition 3.5.The policy iteration Algorithm A is said to converge if the sequence {g f n } converges to the average optimal reward value function in 2.5 , that is, where g * is as in Theorem 3.4.
Obviously, under Assumptions A, B, and C from Proposition 4.1 we see that the sequence {g f n } is nondecreasing; that is, g f n ≤ g f n 1 holds for all n ≥ 1.On the other hand, by Lemma 3.3 d we see that {g f n } is bounded.Therefore, there exists a constant g such that lim n → ∞ g f n g.

3.18
Noting that, in general, we have g ≤ g * .In order to ensure that the policy iteration Algorithm A converges, that is, g g Finally, we present a lemma Lemma 3.8 to conclude this section, which is needed to prove our Theorem 4.2.For a proof, see 24, Proposition 12.2 , for instance.Lemma 3.8.Suppose that A x is compact for all x ∈ S, and let {f n } be a stationary policy sequence in F. Then there exists a stationary policy f ∈ F such that f x ∈ A x is an accumulation point of {f n x } for each x ∈ S.

Main Results
In this section we will present our main results, Theorems 4.2-4.3.Before stating them, we first give the following proposition, which is needed to prove our main results.Proposition 4.1.Suppose that Assumptions A, B, and C hold, and let f ∈ F be an arbitrary stationary policy.If any policy f ∈ F such that where h * is as in Theorem 3.4; d if g f g f , then g f , h f satisfies the average reward optimality equation 3.10 , and so f is average optimal.
Proof. a Combining 3.7 and 4.1 we have Obviously, taking the integration on both sides of 4.5 with respect to μ f and by Remark 3.2 d we obtain the desired result.b If g f g f , we may rewrite the Poisson equation for f as Then, combining 4.5 and 4.6 we obtain Thus, from 4.7 and using the Dynkin formula we get Letting t → ∞ in 4.8 and by Assumption C we have Now take k : sup x∈S h f x − h f x .Then take the supremum over x ∈ S in 4.9 to obtain and so

4.23
On the other hand, take any real-valued measurable function m on S such that m x > q x ≥ 0 for all x ∈ S.Then, for each x ∈ S and a ∈ A x , by the properties Q 1 -Q 3 we can define P • | x, a as follows:

4.30
Moreover, from Lemma 3.3 a again we see that there exists a constant k such that

4.34
Also, from 3.7 , 3.16 , and the definition of T in 3.12 we get

Concluding Remarks
In the previous sections we have studied the policy iteration algorithm PIA for average reward continuous-time jump MDPs in Polish spaces.Under two slightly different sets of conditions we have shown that the PIA yields the optimal maximum reward, an average optimal stationary policy, and a solution to the average reward optimality equation.It should be mentioned that the approach presented here is different from the policy iteration approach used in 14 because the PIA in this paper provides an approach to compute or at least approximate when the PIA takes infinitely many steps to converge the value of the average optimal reward value function and an average optimal stationary policy.
Remark 3.2.a Assumption C is taken from 14 and it is used to establish the average reward optimality equation.b Assumption C is similar to the uniform w 1 -exponentially ergodic hypothesis for discrete-time MDPs; see 8, 22 , for instance.c Some sufficient conditions as well as examples in 6, 16, 19 are given to verify Assumption C. d Under Assumptions A, B, and C, for each f ∈ F, the Markov process {x t }, with the transition rate q • | x, f , has a unique invariant probability measure μ f such that There exist a subsequence {h f m } of {h f n } and a measurable function h on S Assumption D is the same as the hypothesis H2 in 9 .Obviously, Assumption D trivially holds when the state space S is a countable set with the discrete topology and A x is compact for all x ∈ S.
* , in addition to Assumptions A, B, and C, we need an additional condition Assumption D or D below .Assumption D. gives a detailed discussion of Assumption D. b In particular, Assumption D trivially holds when the state space S is a countable set with the discrete topology .c When the state space S is not countable, if the sequence {h f n } is equicontinuous, Assumption D also holds.
From Lemma 3.3 a we see that the function h f n in 3.13 belongs to B w 1 S , and so the function h in 3.19 also belongs to B w 1 S .Now let {h f n } be as in Assumption D, and let {h f m } be the corresponding subsequence of {h f n }.Then by Assumption D we have lim Lemma 3.8 there is a stationary policy f ∈ F such that f x ∈ A x is an accumulation point of {f m x } for each x ∈ S; that is, for each x ∈ S there exists a subsequence {m i } depending on the state x such that lim π∈Π V x, π ∀x ∈ S.
− h y q dy | x, f * ≥ 0 ∀x ∈ S. 4.44 Then, from the proof of Proposition 4.1 b and 4.44 we have h • h • k for some constant k , 4.45 which together with 4.42 , 4.43 , and the definition of T in 3.12 gives Thus, combining 4.46 and 4.51 we see that 4.47 holds.And so Theorem 4.3 follows.
Proof.To prove Theorem 4.3, from the proof of Theorem 4.2 we only need to verify that 4.26 and 4.27 hold true for f * as in Assumption D and some function h in B w 1 S .To do this, we first define two functions h, h in B w 1 S as follows:h x : lim sup S h y S h y q dy | x, f * ≤ Th x ∀x ∈ S.4.46The remainder is to prove the reverse inequality, that is,g ≥ Th x max S h f n y P dy | x, a ∀x ∈ S, a ∈ A x .S h y q dy | x, f * ≥ r x, a S h y q dy | x, a ∀x ∈ S, a ∈ A x ,