^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

We study the

In this paper we study the average reward optimality problem for continuous-time jump Markov decision processes (MDPs) in general state and action spaces. The corresponding transition rates are allowed to be

As is well known, the PIA was originally introduced by Howard (1960) in [

The remainder of this paper is organized as follows. In Section

If

The material in this section is quite standard (see [

It should be noted that the property

To introduce the optimal control problem that we are interested in, we need to introduce the classes of admissible control policies.

Let

for each

for each

A family

For each

For each

A randomized Markov policy is said to be

The family of all such policies is denoted by

There exist a (measurable) function

Remark

For each initial state

If Assumption A holds, then from [

Suppose that Assumption A holds. Then the following statements hold.

For each

For each

For each

As a consequence of Assumption A(

A policy

The main goal of this paper is to give conditions for ensuring that the policy iteration algorithm converges.

In this section we state conditions for ensuring that the policy iteration algorithm (PIA) converges and give some preliminary lemmas that are needed to prove our main results.

To guarantee that the PIA converges, we need to establish the average reward optimality equation. To do this, in addition to Assumption A, we also need two more assumptions. The first one is the following so-called standard continuity-compactness hypotheses, which is taken from [

For each

there exist a nonnegative measurable function

The second one is the irreducible and uniform exponential

For each

For each

(a) Assumption C is taken from [

Under Assumptions A, B, and C, we can obtain several lemmas, which are needed to prove our main results.

Suppose that Assumptions A, B, and C hold, and let

For each

For all

For all

Obviously, the proofs of parts (a) and (b) are from [

The next result establishes the

Under Assumptions A, B, and C, the following statements hold.

There exist a unique constant

Any stationary policy

Then, under Assumptions A, B, and C we shall present the PIA that we are concerned with. To do this, we first give the following definition.

For any real-valued function

Take

Find a constant

Set

If

The policy iteration Algorithm A is said to

Obviously, under Assumptions A, B, and C from Proposition

There exist a subsequence

(a) Assumption D is the same as the hypothesis H1 in [

There exists a stationary policy

Assumption

Finally, we present a lemma (Lemma

Suppose that

In this section we will present our main results, Theorems

Suppose that Assumptions A, B, and C hold, and let

if

if

if

(a) Combining (

(b) If

(c) Since

(d) By (

Suppose that Assumptions A, B, C, and D hold, then the policy iteration Algorithm A converges.

From Lemma

Suppose that Assumptions A, B, C, and

To prove Theorem

In the previous sections we have studied the policy iteration algorithm (PIA) for average reward continuous-time jump MDPs in Polish spaces. Under two

The author would like to thank the editor and anonymous referees for their good comments and valuable suggestions, which have helped us to improve the paper. This work was jointly supported by the National Natural Science Foundation of China (10801056), the Natural Science Foundation of Ningbo (201001A6011005) the Scientific Research Fund of Zhejiang Provincial Education Department, K.C. Wong Magna Fund in Ningbo University, the Natural Science Foundation of Yunnan Provincial Education Department (07Y10085), the Natural Science Foundation of Yunnan Provincial (2008CD186), the Foundation of Chinese Society for Electrical Engineering (2008).