Asymptotic Optimality and Rates of Convergence of Quantized Stationary Policies in Continuous-Time Markov Decision Processes

is paper is concerned with the asymptotic optimality of quantized stationary policies for continuous-time Markov decision processes (CTMDPs) in Polish spaces with state-dependent discount factors, where the transition rates and reward rates are allowed to be unbounded. Using the dynamic programming approach, we rst establish the discounted optimal equation and the existence of its solutions. en, we obtain the existence of optimal deterministic stationary policies under suitable conditions by more concise proofs. Furthermore, we discretize and incentivize the action space and construct a sequence of quantizer policies, which is the approximation of the optimal stationary policies of the CTMDPs, and get the approximation result and the rates of convergence on the expected discounted rewards of the quantized stationary policies. Also, we give an iteration algorithm on the approximate optimal policies. Finally, we give an example to illustrate the asymptotic optimality.


Introduction
is paper deals with the in nite horizon discounted continuous-time Markov decision processes (CTMDPs), as well as studies the asymptotic optimality of quantized stationary policies of CTMDPs, and gives the convergence rate results. e discount factors are state-dependent, and the transition rates and reward rates are allowed to be unbounded.
It is well-known that the discounted CTMDPs have been widely studied as an important class of stochastic control problems. Generally speaking, according to the various forms of discount factors, the in nite horizon discounted CTMDPs can be classi ed into the following three groups: (i) MDPs with a xed constant discount factor α, see, for instance, Doshi [1], Dynkin and Yushkevich [2], Feinberg [3], Guo [4,5], Guo and Song [6], Guo and Hernndez-Lerma [7], Hernndez-Lerma and Lasserre [8,9], Puterman [10], and the references therein; (ii) MDPs with varying (statedependent or state-action dependent) discount factors, for instance, see Feinberg and Shwartz [11], Gonzlez-Hernndez et al. [12], Wu and Guo [13], Wu and Zhang [14], and the references therein; (iii) MDPs whose the discount factor is a function of the history, see Hinderer [15], for example. is paper will study the in nite horizon discounted CTMDPs in the case of the group.
For the discounted criterion of MDPs, there are many works on the existence of solutions to the discounted optimality equation and of discounted optimal stationary policies, see, for instance, [1,4,6,7,16] for the CTMDPs and [8][9][10][13][14][15] for the discrete-time Markov decision processes (DTMDPs). ese references, however, are on the discounted MDPs with a constant discount factor or the discounted DTMDPs with varying discount factors. Recently, the discounted CDMDPs with state-dependent discount factors are studied in [16], in which the authors established the discounted reward optimality equation (DROE) and obtained the existence of discounted optimal stationary policies. However, in [16], the discussion is restricted to the class of all randomized stationary policies (i.e., the policies are time-independent). Following these ideas, still within the discounted continuous-time MDPs, models with Polish spaces are studied in this paper. We will extend some results in [16] to the case of all randomized Markov policies and obtain the existence of discounted optimal stationary policies by more concise proof.
Although the existence of the optimal policies is proved, it is difficult to compute an optimal policy even in the stationary policies class for nonfinite Polish (i.e., complete and separable metric) state and action spaces. Furthermore, in applications to networked control, the transmission of such control actions to an actuator is not realistic when there is an information transmission constraint (imposed by the presence of a communication channel) between a plant, a controller, or an actuator. us, from a practical point of view, it is important to study the approximation of optimal stationary policies. Several approaches have been developed in the literature to solve this problem for finite or countable state spaces, see [17][18][19][20]. Lately, for infinite Borel state and action spaces, [21,22] give the asymptotic optimality of quantized stationary policies in stochastic control for DTMDPs. Inspired by these, in this paper, we are concerned with the asymptotic optimality of quantized stationary policies in CTMDPs with Polish spaces. To the best of our knowledge, the corresponding asymptotic optimality for CTMDPs with varying (state-dependent) discount factors has not been studied. erefore, this paper contains the following three main contributions: (a) For the CDMDPs with state-dependent discount factors, we extend some results in [16] to the case of all randomized Markov policies, and the proof of the existence of discounted optimal stationary policies is simplified under mild conditions and gives an algorithm to get ε-optimal policies. (b) We obtain that the deterministic stationary quantizer policies are able to approximate the optimal deterministic stationary policies under mild technical conditions and thus show that one can search for approximate optimal policies within the class of quantized control policies. (c) For the asymptotic optimality, we give the corresponding convergence rates results.
is paper is organized as follows. In section 2, we introduce the models of CDMDPs with the expected discounted reward criterion and state the discounted optimality problem. In section 3, under suitable conditions, we prove the main result on the existence of the solutions to the discounted optimal equation (DOE) and the existence of optimal stationary policies. In section 4, we give an iteration algorithm on the ε-optimal policies. In section 5, we establish conditions under which quantized control policies are asymptotically optimal and give the corresponding convergence rate results and the rates of convergence on the expected discounted rewards of the quantized stationary policies. Finally, we illustrate the asymptotic optimality by an example in Section 6.

The Markov Decision Processes and Discounted Optimal Problem
Consider the model of continuous-time Markov decision processes M as follows: where S is the state space, A(x) are sets of admissible actions, and A is a compact action space. S and A are assumed to be Polish spaces (i.e., complete and separable metric spaces) with Borel σ-field B(S) and B(A), respectively. A(x) and } are Borel subsets of A and S × A, respectively. q(•|x, a) denotes the function of transition rates, which satisfy the following properties: e discount factors α(x) are the nonnegative measurable functions on S. Finally, the reward rate function r(x, a) is assumed to be Borel-measurable on K. Note that, r(x, a) is allowed to be unbounded from both above and below, and it can be regarded as a cost rate rather than a reward rate only. e definitions of the randomized Markov policy ≔ (π t , t ≥ 0), randomized stationary policy φ, and (deterministic) stationary policy f are given by [8] [Definitions 2.2.3 and 2.3.2]. e sets of all randomized Markov policies, randomized stationary policies, and (deterministic) stationary policies are denoted by Π, Φ, and F, respectively. It is clear that F ⊂ Φ ⊂ Π, and for each π � (π t , t ≥ 0) ∈ Π, x ∈ S, and B ∈ B(S), we define the associated functions of transition rates q π (•|x, π t ) and reward rates r π (x, π t ) by In general, we also write as q(B|x, π t ) and r(x, π t ), respectively. Furthermore, for each φ ∈ Φ, we define the functions of transition rates and reward rates by In particular, we write them as q(B|x, f) and r(x, f), respectively, when For any fixed policy π � (π t , t ≥ 0) ∈ Π, q(•|x, π t ) is also called an infinitesimal generator (see Doshi [1]). As is well known, any transition function p π (s, x, t, B) depending on π such that for all x ∈ S and B ∈ B(S) is called a Q-process with transition rates q(•|x, π t ), where δ x (B) is the Dirac measure at x ∈ S. By Guo [4], there exists a minimal Q-process p min π (s, x, t, B) with transition rates q(·|x, π t ), but such a Q-process might not be regular, that is, there may exist p min π (s, x, t, S) < 1 for some x ∈ S and t ≥ s ≥ 0. To ensure the regularity of the Q-process, we propose the following "drift conditions." ere exists a measurable function w 1 ≥ 1 on S, and constants c 1 ≠ 0, b 1 ≥ 0 and M q > 0 such that (a) e function w in Assumption 1 (a) is used to guarantee the finiteness of the optimal value function as below, and by [4] [Remark 2.2(b)], it is an extension of the "drift condition" in Lund et al. [23] for a time-homogeneous Q-process. Moreover, Assumption 1 (b) is used to guarantee the regularity of the Q-process, and it is not required when the transition rates are bounded (i.e., sup x∈S q * (x) < ∞). (b) Under Assumption 1, it holds that p min π (s, x, t, S) ≡ 1 by Guo [4] [ eorem 3.2]. en, the Q-process with transition rates q(•|x, π t ) is regular and unique. Hence, we write p min π (s, x, t, B) simply as p π (s, x, t, B). Since it is time-homogeneous, we discuss the case that the initial time is s � 0, and then, we write p π (0, x, t, B) simply as p π (x, t, B).
As it is well known (e.g., see Doshi [1] and Guo [5]), for each π � (π t , t ≥ 0) ∈ Π and the initial state x ∈ S, there exists a unique probability space (Ω, B(Ω), P π x ), where the probability measure P π x is completely determined by p π (x, t, B) (see Guo [6], Section 2.3]), and a state and action process x(t), a(t), t ≥ 0 { } with the transition probability function p π (x, t, B) such that (see Guo [5], Lemma 2.1]) e expectation operator corresponding to P π x can be denoted by E π x . Moreover, for each x ∈ S, π ∈ Π and t ≥ 0, the expected reward is given by Now, we state the discounted optimality problem. For each π ∈ Π and x ∈ S, the expected discounted reward criterion is defined as and the corresponding optimal value function is given by Also, a policy π * ∈ Π is called optimal policy if J(x, π * ) ≥ J(x, π) for all x ∈ S and π ∈ Π. Our main aim in Section 3 is to give conditions for the existence of optimal deterministic stationary policies.

The Existence of Optimal Stationary Policies
In this section, the existence and uniqueness of the discounted optimal equation (DOE) are shown, and the existence of the optimal policies is given for the CTMDPs M defined in (1).
Note that, for any given measurable function w ≥ 1 on S, a function v on S is called w− bounded if the w− weighted norm ‖v‖ w ≔ sup x∈S |v(x)/w(x)| is finite. Such a function w is called a weight function. It is clear that B w (S) ≔ v: ‖v‖ w < ∞ is a Banach space for all real-valued measurable functions v on S. To guarantee the finiteness of the optimal value function, we need the following assumptions. (e) ere exists a nonnegative measurable function w 2 (x) on S, and constants c 2 > 0, b 2 ≥ 0 and M 2 > 0 such that q * (x)w 1 (x) ≤ M 2 w 2 (x) and

S u(y)q(dy|x, a) ≤ c 2 w 2 (x) + b 2 for all x ∈ S and a ∈ A(x)
For each x ∈ S, let m(x) be any positive measurable function on S such that m(x) ≥ q * (x) ≥ 0, and where δ x (B) is Dirac measure (i.e., it is equal to 1 if x ∈ B and 0 otherwise). It is clear that P(·|x, a) is a probability measure on S for each (x, a) ∈ K. For any u ∈ B w 1 (S), define an operator T on B w 1 (S) as And, define a recursive sequence u n , n ≥ 0 as Now, we give the discounted optimal equation (DOE). 1 and 2 (b)-(c), the following assertions hold.

Theorem 1. Under Assumptions
for all x ∈ S and π ∈ Π, and J(·, π) ∈ B w 1 (S) (b) Let u * ≔ lim n⟶∞ u n , then we have u * ∈ B w 1 (S), and it is the solution of the following discounted optimal equation (DOE): Proof. (a) By the assumptions, we have where the last inequality holds by Guo [4] [ eorem 3.2(b)]. en, J(x, π) ∈ B w 1 (S) for each x ∈ S, and part (a) holds. (b) First, we obtain u n is monotone and nondecreasing by a similar calculation as in Ye and Guo [16] [Equation (15)]. Furthermore, it is clear that the operator T is monotone and nondecreasing. en, we have u n is monotone and nondecreasing, which yields that u * ≔ lim n⟶∞ u n ≥ u n for all n ≥ 0.
Next, we show that u * ∈ B w 1 (S). Note that w 1 (x) ≥1 by Assumption 1, which yields that en, by induction argument, for all n ≥ 1, we have (dy|x, a) , 4 Discrete Dynamics in Nature and Society us, Last, we show Tu * � u * . By the monotonicity of T and u n , we have Tu * ≥ Tu n � u n+1 for all n ≥ 0, and so Tu * ≥ u * .
On the other hand, by the definition of the operator T, we have en, letting n ⟶ ∞, by Hern a ′ Hernndez-Lerma and Lasserre [9], [Lemma 8.3.7], we obtain which follows that u * ≥ Tu * . us, we have Tu * � u * , that is, u * is the solution of DOE in (14).
□ Remark 2. eorem 1 is not only the generalization of the control model with a constant discount factor in Guo [4] [ eorem 3.3(a)-(b)] but also the model in Ye and Guo [16] whose policies are restricted within the family Φ of all randomized stationary policies. e following Lemma 1 is a direct consequence of [16] [ eorem 3.2]. Under Assumptions 1 and 2, for each x ∈ S and φ ∈ Φ, the expected discounted reward criterion J(x, φ) is the unique solution of the following equation:
Proof. By (21), there exists a nonnegative measurable function v(x) on S such that Now, let r(x, a) � r(x, a) + v(x), and we get the new Markov decision processes: r(x, a) , (24) in which only the reward rate function is different from the model in (1). Moreover, for each x ∈ S, φ ∈ Φ, the expected discounted reward criterion is given by Similarly, we can prove (b).

Theorem 2.
Under Assumptions 1 and 2, for each x ∈ S, the optimal value function J * (x) is the solution of DOE in (12), and there exists a (deterministic) stationary policy f * ∈ F such that Proof. By eorem 1(b), for each x ∈ S and π ∈ Π, we have which together with Lemma 2(a) yields that u * (x) ≥ J(x, π), and then, u * (x) ≥ J * (x). Note that S u * (y)q(dy|x, a) is upper semicontinuous on a ∈ A(x); then, by [9] [Lemma 8.3.8], we can obtain that there exists a policy f * ∈ F such that for all x ∈ S. us, by Lemma 1, we have u * (x) � J(x, f * ).

Remark 4.
(a) eorem 2 shows that the optimal value function is a solution to the DOE ((21)]) and ensures the existence of an optimal (deterministic) stationary policy. (b) By the construction of the new Markov decision processes, the proof of eorem 2 is more concise than in [16]

An Iteration Algorithm for ε-Optimal Policies
In this section, we provide an iteration algorithm for ε-optimal policies. In fact, for the operator T on B w 1 (S) in Section 3, with m(x) � q * + 1, it holds that Discrete Dynamics in Nature and Society (29) en, by Algorithm 1, we have which yields that By a similar argument, we have and then,

Approximation of Deterministic Stationary Policies.
In Section 3, we give the existence of the deterministic stationary policies for the CTMDPs in (1) under suitable conditions. However, in practice, sometimes, the action space cannot satisfy the continuity conditions in theoretical research. us, in this section, we will discretize and incentivize the action space, so that we can construct a sequence of policies, namely "quantizer policies," which is the approximation of the deterministic stationary policies of the CTMDPs in (1).
To this end, we first give the definitions of quantizers and deterministic stationary quantizer policies.

Definition 1.
A measurable function f: S ⟶ A is called a quantizer from S to A, if f(S) ≔ f(x) ∈ A: x ∈ S is finite. Let F denote the set of all quantizers from S to A.

Definition 2.
A policy is called a deterministic stationary quantizer policy, if there exists a constant sequence π � π n , n ≥ 0 of stochastic kernels on A given S such that π n (·|x) � δ f(x) (·) for all n for some f ∈ F, where δ f(x) (·) is Dirac measure as in (11).
For any finite set Λ ⊂ A, let F(Λ) denotes the set of all quantizers having range Λ, and let SF(Λ) denotes the set of all deterministic stationary quantizer policies induced by F(Λ).
Denote the metric on A as d A , and then, the action space A is totally bounded by its compactness. For any fixed integer k ≥ 1, there exists a finite point set a i n k i�1 such that for all a ∈ A, where a i n k i�1 is called the 1/k− net in A. From this, for any a deterministic stationary policy f ∈ F, we can construct a sequence of quantizer policies to approximate to f by the following methods. Lemma 3 (The construction of quantizer policies). Let Λ k ≔ a i n k i�1 is the 1/k− net in A, for each x ∈ S and deterministic stationary policy f ∈ F, we define en, f k k ≥ 1 is a deterministic stationary quantizer policy sequence, and f k converges uniformly to f as k ⟶ ∞.

Proof. Lemma 3 holds obviously by [21] [Section 3].
We also call f k k ≥ 1 as the quantized approximations of f. Next, we show their expected discounted rewards also satisfy the approximation. For this purpose, we need the following conditions as follows.