A SEMIMARTINGALE CHARACTERIZATION OF AVERAGE OPTIMAL STATIONARY POLICIES FOR MARKOV DECISION PROCESSES

This paper deals with discrete-time Markov decision processes with Borel state and action spaces. The criterion to be minimized is the average expected costs, and the costs may have neither upper nor lower bounds. In our former paper (to appear in Journal of Applied Probability), weaker conditions are proposed to ensure the existence of average optimal stationary policies. In this paper, we further study some properties of optimal policies. Under these weaker conditions, we not only obtain two necessary and sufficient conditions for optimal policies, but also give a “semimartingale characterization” of an average optimal stationary policy.


Introduction
The long-run average expected cost criterion in discrete-time Markov decision processes has been widely studied in the literature; for instance, see [3,[12][13][14], the survey paper [1], and their extensive references.As is well known, when the state and action spaces are both finite, the existence of average optimal stationary policies is indeed guaranteed [2,3,11,12].However, when a state space is countably infinite, an average optimal policy may not exist even though the action space is compact [3,12].Thus, many authors are interested in finding optimality conditions when a state space is not finite.We now simply describe some existing works.(I) When the costs/rewards are bounded, the minorant condition [3] or the ergodicity condition [5,6,8] ensures the existence of a bounded solution to the average optimality equation and of an average optimal stationary policy.Their common ways are via the Banach's fixed point theorem.(II) When the costs are nonnegative (or bounded below), the optimality inequality approach [1,9,10] is used to prove the existence of average optimal stationary policies.A key character of this approach is via the Abelian theorem which requires that the costs have to be nonnegative (or bounded below).In particular, Hernández-Lerma and Lasserre [9] also get the average optimality equation under the additional equi-continuity condition and give a "martingale characterization" of an average optimal stationary policy.(III) For the much more general case, when the costs have neither upper nor lower bounds, in order to establish the average optimality equation and then prove the existence of an average optimal stationary policy, the equi-continuity condition [4,9] or the irreducibility condition (e.g., [10,Assumption 10.3.5]) is required.But in [7], we propose weaker conditions under which we prove the existence of average optimal stationary policies by two optimality inequalities rather than the "optimality equality" in [4,9,10].Moreover, we remove the equi-continuity condition used in [4,9,10] and the irreducibility condition in [10].In this paper, we further study some properties of optimal policies.Under these weaker conditions, we not only obtain two necessary and sufficient conditions for optimal policies, but also give a semimartingale characterization of an average optimal stationary policy.
The rest of the paper is organized as follows.In Section 2, we introduce the control model and the optimality problem that we are concerned with.After optimality conditions and a technical preliminary lemma given in Section 3, we present a semimartingale characterization of an average optimal stationary policy in Section 4.

The optimal control problem
Notation 1.If X is a Borel space (i.e., a Borel subset of a complete and separable metric space), we denote by Ꮾ(X) its Borel σ-algebra.
In this section, we first introduce the control model where S and A are the state and the action spaces, respectively, which are assumed to be Borel spaces, and A(x) denotes the set of available actions at state x ∈ S. We suppose that the set is a Borel subset of S × A. Furthermore, Q(• | x,a) with (x,a) ∈ K, the transition law, is a stochastic kernel on S given K.
Finally, c(x,a), the cost function, is assumed to be a real-valued measurable function on K. (As c(x,a) is allowed to take positive and negative values, it can also be interpreted as a reward function rather than a "cost.") To introduce the optimal control problem that we are concerned with, we need to introduce the classes of admissible control policies.
For each t ≥ 0, let H t be the family of admissible histories up to time t, that is, H 0 := S, and (2.3) Q. X. Zhu and X. P. Guo 3 The class of all randomized history-dependent policies is denoted by Π.A randomized history-dependent policy π := (π t , t ≥ 0) ∈ Π is called (deterministic) stationary if there exists a measurable function f on S with f (x) ∈ A(x) for all x ∈ S, such that For simplicity, denote this policy by f .The class of all stationary policies is denoted by F, which means that F is the set of all measurable functions f on S with f (x) ∈ A(x) for all x ∈ S.
For each x ∈ S and π ∈ Π, by the well-known Tulcea's theorem [3,8,10], there exist a unique probability measure space (Ω,Ᏺ,P π x ) and a stochastic process {x t ,a t ,t ≥ 0} defined on Ω such that, for each D ∈ Ꮾ(S) and t ≥ 0, with h t = (x 0 ,a 0 ,...,x t−1 ,a t−1 ,x t ) ∈ H t , where x t and a t denote the state and action variables at time t ≥ 0, respectively.The expectation operator with respect to P π x is denoted by E π x .
The main goal of this paper is to give conditions for a semimartingale characterization of an average optimal stationary policy.

Optimality conditions
In this section, we state conditions for a semimartingale characterization of an average optimal stationary policy, and give a preliminary lemma that is needed to prove our main results.
We will first introduce two sets of hypotheses.The first one, Assumption 3.1, is a combination of a "Lyapunov-like inequality" condition together with a growth condition on the one-step cost c.The second set of hypotheses we need is the following standard continuity-compactness conditions; see, for instance, [7,12,13,15,16] and their references.To ensure the existence of average optimal stationary policies, in addition to Assumptions 3.1 and 3.3, we give a weaker condition (Assumption 3.5 below).To state this assumption, we introduce the following notation.
For the function w ≥ 1 in Assumption 3.1, we define the weighted supremum norm u w for real-valued functions u on S by and the Banach space B w (S) := {u : u w < ∞}.

hold. Then the following hold.
(a) There exist a unique constant g * , two functions h * k ∈ B w (S) (k = 1,2), and a stationary policy f * ∈ F satisfying the two optimality inequalities (c) Any stationary policy f in F realizing the minimum of (3.5) is average optimal, and so f * in (3.6) is an average optimal stationary policy.
(d) In addition, from the proof of part (b), it yields that for each h ∈ B w (S), x ∈ S, and Proof.See [7, Theorem .

A semimartingale characterization of average optimal stationary policies
In this section, we present our main results.To do this, we use the following notations.Let h * 1 , h * 2 , g * be as in Lemma 3.7, and define (a) A policy π * is AEC-optimal and V (x,π * ) = V * (x) = g * for all x ∈ S if and only if 6 Semimartingale characterization of optimal policies (b) A policy π * is AEC-optimal and V (x,π * ) = V * (x) = g * for all x ∈ S if and only if Proof.(a) For each π ∈ Π and x ∈ S, it follows from (4.1) that which together with (2.5) yields and so Multiplying by 1/n and letting n → ∞, from (3.7), we see that part (a) is satisfied.Similarly, combining (4.2) and (3.7), we see that part (b) is also true.
Theorem 4.2.Suppose that Assumptions 3.1, 3.3, and 3.5 hold.Then the following hold: (a) {M (1)  n } is a P π x -submartingale for all π ∈ Π and x ∈ S; (b) let f * be the average optimal stationary policy obtained in Lemma 3.7, then {M (2)   n } is a P f * x -supermartingale for all x ∈ S; (c) if {M (2)  n } is a P π * x -supermartingale, then π * is AEC-optimal and V (x,π * ) = g * for all x ∈ S.
Remark 4.3.Theorems 4.1-4.2are our main results: Theorem 4.1 gives two necessary and sufficient conditions for AEC-optimal policies, whereas Theorem 4.2 further provides a semimartingale characterization of an average optimal stationary policy.

Assumption 3 . 3 . 1 . 3 . 4 .
(1) For each x ∈ S, A(x) is compact.(2)For each fixed x ∈ S, c(x,a) is lower semicontinuous in a ∈ A(x), and the functionS u(y)Q(dy | x,a) is continuous in a ∈ A(x)for each bounded measurable function u on S, and also for u =: w as in Assumption 3.Remark Assumption 3.3 is the same as in [10, Assumption 10.2.1].Obviously, Assumption 3.3 holds when A(x) is finite for each x ∈ S.