A Version of the Euler Equation in Discounted Markov Decision Processes

. This paper deals with Markov decision processes (cid:2) MDPs (cid:3) on Euclidean spaces with an inﬁnite horizon. An approach to study this kind of MDPs is using the dynamic programming technique (cid:2) DP (cid:3) . Then the optimal value function is characterized through the value iteration functions. The paper provides conditions that guarantee the convergence of maximizers of the value iteration functions to the optimal policy. Then, using the Euler equation and an envelope formula, the optimal solution of the optimal control problem is obtained. Finally, this theory is applied to a linear-quadratic control problem in order to ﬁnd its optimal policy.


Introduction
This paper deals with the optimal control problem in discrete time and with an infinite horizon.This problem is presented with the help of the Markov decision processes MDPs theory.To describe the MDPs, it is necessary to provide a Markov control model.The components of the Markov control model are used to describe the dynamic of the system.In this way at each time t t 0, 1, . . . the state of the system is affected by an admissible action.This sequence of actions is called a policy.The optimal control problem consists in determining an optimal policy, which is characterized through a performance criterion.In this paper, the infinite-horizon expected total discounted reward is considered.
An approach for solving the optimal control problem is through the dynamic programming technique DP see 1-4 .DP characterizes the optimal solution of the optimal control problem using a functional equation, known as the dynamic programming equation see 1-4 .In the literature there exists conditions that guarantee the value iteration VI functions procedure, which is used to approximate the optimal value function of the

Markov Decision Process
A discrete-time markov control model is a quintuple X, A, {A x | x ∈ X}, Q, r , where X is the state space, A is the action space, A x is the set of feasible actions in the state x ∈ X, Q is a transition law and r : K → R is the one-step reward function see 3 .X and A are nonempty Borel spaces with the Borel σ-algebras B X and B A , respectively.Q • | • is a stochastic kernel on X given K, where K : { x, a | x ∈ X, a ∈ A x }, and r is a measurable function.
Consider a Markov control model and, for each t 0, 1, . .., define the space H t of admissible histories up to time t as H 0 X, and H t K × H t−1 , for t 1, 2, . ...
A policy is a sequence π {π t } of stochastic kernels π t on the action space A given H t .The set of policies will be denoted by Π.
Let F be the set of decision functions or measurable selectors, that is, the set of all measurable functions f : X → A such that f x ∈ A x for all x ∈ X.
A sequence {f t } of functions f t ∈ F is called a Markov policy.A stationary policy is a Markov policy π {f t } such that f t f for all t 0, 1, 2, . .., with f ∈ F, and it will be denoted by f see 3 .
Given the initial state x 0 x ∈ X, and any policy π ∈ Π, there is a probability measure P π x on the space Ω, F , with Ω : X × A ∞ and F, the product σ-algebra see 3 .The corresponding expectation operator will be denoted by E π x .The stochastic process Ω, F, P π x , {x t } is called a discrete-time Markov decision process.
The total expected discounted reward is defined as The function defined by x ∈ X, will be called the optimal value function.
The optimal control problem consists in determining an optimal policy.

Dynamic Programming
b The transition law Q is strongly continuous.c There exists a policy π such that v π, x > −∞, for each x ∈ X.
Definition 2.4.The value iteration VI functions are defined as follows: for all x ∈ X and n 1, 2, . .., with v 0 x 0.
The following theorem is well-known in the literature of MDPs see, 1-4 .The proof can be consulted in see 3, page 46 .

2). (b)
There exists f ∈ F such that Remark 2.6.Under Assumption 2.3, it is possible to demonstrate that for each n 1, 2, . .., there exists a stationary policy f n ∈ F such that x ∈ X see 3, page.27, 28 .

Notation and Preliminaries
Let X and Y be Euclidean spaces and consider the following notation: C 2 X, Y denotes the set of functions l : X → Y with a continuous second derivative when X Y , C 2 X, Y will be denoted by C 2 X and in some cases it will be written only as The proof of the following lemma is similar to the proof of Theorem 1 in 16 .
Lemma 3.1.Suppose that Then there exists a function l : Observe that a implies that Λ x, • is a strictly concave function, for each x ∈ X.Then the maximizer l : X → A is unique.
The proof of the following lemma can be consulted in 19 , Theorem 25.7, page 248.
Lemma 3.3.Let C ⊂ R n be an open and convex set.Let g : C → R be a concave and differentiable function, and {g n } be a sequence of differentiable, concave and real-valued functions on C, such that

An Envelope Formula in MDPs
Let X, A, {A x | x ∈ X}, Q, r be a fixed Markov control model.Throughout this section it is assumed that Assumption 2.3 holds.Also, it is supposed that X ⊆ R n and A ⊆ R m are convex sets with nonempty interiors and X is partially ordered.It is considered that the set-valued mapping x → A x is nondecreasing and convex, and A x has nonempty interior, for each x ∈ X.Also, it is assumed that the transition law Q is given by a difference equation: t 0, 1, . .., with a given initial state x 0 x ∈ X fixed, where {ξ t } is a sequence of independent and identically distributed iid random variables, independent of x 0 x ∈ X and taking values in a Borel space S ⊂ R k .Let ξ be a generic element of the sequence {ξ t }.The density of ξ is designated by Δ; L : X × S → X is a measurable function, with X ⊂ R m , and F : K → X , is a measurable function too.
Since Assumption 2.3 is assumed, then Theorem 2.5 yields.Therefore, the optimal value function see Definition 2.1 satisfies and the VI functions see Definition 2.4 satisfy for each n 1, 2, . .., with v 0 x : 0. In addition, by Theorem 2.5, there exists the optimal policy, which will be denoted by f.Furthermore, there exists the maximizer f n of v n for each n 1, 2, . . .see Remark 2.6 .Let G : K → R be a function defined as G x, a : r x, a αH x, a , 3.6 x, a ∈ K, where for each n 1, 2, . .., with v 0 x : 0 and x, a ∈ K.
Assumption 3.4.a r is a strictly concave function and r •, a is an increasing function on X for each a ∈ A fixed; b L •, s is a concave and increasing function, for each s ∈ S; F is a concave function, F •, a is an increasing function on X, for each a ∈ A. Lemma 3.5.Under Assumption 3.4, it results that v n is a strictly concave function and f n is unique, for all n 1, 2, . ... Also, V is a strictly concave function and f is unique.
Proof.By Assumption 3.4 a , it suffices to prove Condition C1 see 20, Lemma 6.2 , which guarantees the result.Let Ψ : K × S → X be defined by Ψ x, a, s : L F x, a , s . 3.9 Then for each s ∈ S, the function Ψ •, •, s is concave in K by Assumption 3.4 b .Indeed, since F is a concave function, then x, a , y, b ∈ K and β ∈ 0, 1 .Furthermore, it is known that L •, s is a concave and increasing function, for each s ∈ S, then From similar arguments, it can be shown that if x < y, then Ψ x, a, s ≤ Ψ y, a, s , for each s ∈ S and a ∈ A y .Then the result follows.
where, in this case, R s denotes the derivative of R with respect to the second variable, and the determinant of R s is denoted as det R s ; d Δ ∈ C 2 int S ; R and the interchange between derivatives and integrals is valid see Remark 3.8 .Lemma 3.7.By Assumption 3.6 it results that H ∈ C 2 int K ; X , with H defined in 3.7 .
Proof.The proof is similar to the proof of Lemma 5 in 16 .Assumption 3.6 allows to express the stochastic kernel see 3.3 in the following form: for each measurable subset B of X and x, a ∈ K, Δ u du.

3.13
Then for the change of variable theorem, it results that

3.14
It follows from 3.13 that H can be expressed as Now, using Assumption 3.6, the result follows.
Remark 3.8.In Lemma 3.7, Assumption 3.6 d was used to guarantee the differentiability of the second order of the integral K x, a, u du, with respect to x or a, where F x, a , u ∈ int X × X .This condition can be verified in practice when the derivatives of K can be bounded in the following sense: for , u , for some functions g i integrable with respect to u, for i 1, . . ., 5 see Remark 10 in 16 .
Assumption 3.9.a The optimal policy f satisfies that f x ∈ int A x , for each x ∈ X; b The sequence {f n } of the maximizers of the VI functions satisfies that f n x ∈ int A x , for each x ∈ X and n 1, 2, . ... Define W by W x, a : r x − r a F −1 a F x x, a , 3.17 x, a ∈ K.
Remark 3.10.Assumption 3.9 evidently holds; if A x is open for every x ∈ X, then f x f n x belongs to the interior of A x , x ∈ X.Also, in some particular cases see 8, 16 , the interiority of f x f n x is guaranteed by the mean value theorem.
Theorem 3.11.Under Assumptions 3.4, 3.6, and 3.9(a), it results that f ∈ C 1 int X ; A , V ∈ C 2 int X ; R and for each x ∈ int X , where W is defined in 3.17 .
Proof.Let x ∈ int X fixed.Note that Assumptions 3.4 and 3.6 imply that G ∈ C 2 int K ; R where G is defined in 3.6 .Indeed, since Assumptions 3.4 a and 3.6 a hold, it is known that r ∈ C 2 int K ; R and r aa x, • is negative definite.Moreover, Lemma 3.5 implies that H x, a E V L F x, a , ξ is a concave function, and by Lemma 3.7, it follows that H ∈ C 2 int K ; R , obtaining that H aa x, • is negative semidefinite see 21, page 260 .Furthermore, by Assumption 3.9 a and applying Lemma 3.1, it concludes that f ∈ C 1 int X ; A and V ∈ C 2 int X ; R .
On the other hand, it is obtained that for each a ∈ int A x .Then, the first order condition and the invertibility of F a see Assumption 3.6 b imply that G a x, f x 0, that is, Moreover, since V satisfies 2.4 and f ∈ F is the optimal policy, then

3.21
Using the fact that G a x, f x 0, it is possible to obtain the following envelope formula:

3.23
Finally, substituting 3.20 in 3.23 , it follows that
Proof.The proof will be made by induction.Let x ∈ int X be fixed.Since where G 1 is defined in 3.8 and by Assumptions 3.4 a and 3.6 a , it follows that G 1 ∈ C 2 int K ; R and G 1 aa x, a is negative definite.By Assumption 3.9 b , it yields that f 1 x ∈ int A x , and applying Lemma 3.1, it follows that aa is negative definite.Now, since f 2 x ∈ int A x see Assumption 3.9 b , applying again Lemma 3.1, it follows that f 2 ∈ C 1 int X ; A , v 2 ∈ C 2 int X ; R .Furthermore, the first order condition implies that G 2 a x, f 2 x 0. By the invertibility of F a see Assumption 3.6 b , it follows that

3.30
On the other hand, and substituting 3.30 in 3.31 , it is obtained that where W is defined in 3.17 .Now, suppose that v n−1 ∈ C 2 int X ; R with n > 2. Using arguments similar to the case n 2, it is possible to demonstrate that

3.33
Journal of Applied Mathematics Assumption 3.13.For each x ∈ X, the function W x, • has a continuous inverse function, denoted by w.
Theorem 3.14.Under Assumptions 3.4, 3.6, 3.9 and 3.13, it follows that It is known by Lemma 3.5 and Theorem 3.11 that the optimal value function V is concave and differentiable on int X .In addition, it is known that for each n ∈ N, v n is a concave and differentiable function on int X .Then from Lemma 3.3 it follows that when n goes to ∞.Now by Assumption 3.13, it concludes that for n 2, 3, . ..,

3.36
where f n is a stationary policy of v n and f is the optimal policy.Finally, the convergence is guaranteed by the continuity of w see Assumption 3.13 .

Euler Equation
Theorem 3.15.Under Assumptions 3.4, 3.6, 3.9, and 3.13 it follows that for each x ∈ int X and n ∈ N, where w is the function given in Assumption 3.13.
Proof.Let x ∈ int X be fixed.By Lemma 3.5 and Theorem 3.12, it is known that v n ∈ C 2 int X ; R and it is a concave function.Now, from the first order condition and the invertibility of F a see Assumption 3.6 b , it follows that and using the invertibility of W x, • see Assumption 3.13 , it follows that f n x w x, v n x .
Corollary 3.16.The optimal value function satisfies Proof.Let x ∈ int X be fixed.It is known that the VI functions satisfy the Euler equation 3.37 , so applying Lemma 3.3, it is obtained that when n → ∞.Also from Assumption 3.13, w is a continuous function.Then, when n goes to infinite in 3.37 , it follows that the optimal value function satisfies 3.41 .

A Linear-Quadratic Model
Consider that R n X A A x , for each x ∈ X.The dynamic of the system is given by x ∈ X given.B and C are invertible matrices of size n × n, {ξ t } is a sequence of iid column random vectors with values in R n .Let ξ be a generic element of the sequence {ξ t }, assume that ξ has a density Δ with Δ ∈ C 2 , and E ξ equals vector zero.Furthermore, it is assumed that if P is a symmetric negative definite matrix of size n × n, then E ξ T Pξ is finite.In addition, it is assumed that the interchange between derivatives and integrals is valid see Remark 3.8 .A particular case of this assumption can be found in 16, page 315 .
The reward function is given by r x, a x T Qx a T Ra, 4.2 where x T and a T denote the transpose of vectors x and a; Q and R are symmetric matrices of size n × n, and both of them are negative definite.Proof.Note that is a compact set, for each x ∈ X and γ ∈ R. Indeed, let x ∈ X and γ ∈ R. If any sequence {a n } of O x γ satisfies x T Qx a T n Ra n → −∞, then there is a contradiction.Therefore O x γ is a set bounded below.Moreover, since Q and R are negative definite, then O x γ is a set bounded above.In addition, if {a n } ⊂ O x γ so that a n → a, then by the continuity of r, it follows that r x, a ≥ γ, implying that O x γ is a closed set.Therefore, the reward function r is sup-compact.Finally, note that r is a nonpositive and continuous function on K.So Assumption 2.3 a holds.
On the other hand, let U ∈ B X , then where W is defined in 3.17 , implying that the inverse of W x, which is a continuous function.Therefore, Assumption 3.13 is satisfied. 4.17 Since v 1 x 2Qx and E ξ is equal to zero vector, then and by direct calculations, it is obtained that where Now, suppose that v n x 2K n x for n > 2, with K n defined in 4.15 .Then, by Theorem 3.15 and 4.13 , it is known that for each n 1, 2, . . .and x ∈ X.Moreover, the validity of Theorem 3.14 is guaranteed by Lemma 4.2, that is, f n x → f x , implying the convergence of the sequence {K n } which, according to its definition in 4.15 , guarantees that its limit, denoted by K, must satisfy 4.25 .Finally using matrix algebra 4.24 is obtained.

Conclusion
In this paper a method to solve the optimal control problem is presented.This method is based on the use of the Euler equation.The procedure proposed to solve the optimal control problem is by means of an envelope formula and the use of the convergence of the maximizers of values iteration functions to a stationary optimal policy.Future work aims to study possible error bounds for approximating the maximizers toward the optimal policy.

Theorem 2 . 5 .
Suppose that Assumption 2.3 holds.Then (a) The optimal value function V is a solution of the OE (seeDefinition 2.

Lemma 4 . 4 . 1 C
The optimal policy for the linear quadratic problem isK Q αB T K − KC R C T KC − and Γ y denote the partial derivative of Γ for x and y, respectively.The notations for the second partial derivatives of Γ are Γ xx , Γ xy , Γ yx and Γ yy .For any set C ⊂ X, a point x ∈ C is called an interior point of C if there exists an open set U such that x ∈ U ⊂ C. The interior of C is the set of all interior points of C denoted by int C .The set-value mapping Θ from X to Y is said to be

1
Bx t Ca t ξ t , , a t a Pr x t 1 ∈ U | x t x, a t a I U Bx Ca Δ s ds, 4.4 where I U denotes the indicator function of U. Since the density Δ is continuous, it is obtained that the transition law Q is weakly continuous, that is, Assumption 2.3 b holds.Finally, let h ∈ F be defined as .8where v is given by 2.1 .Therefore, Assumption 2.3 c holds.Assumption 3.6 c yields.Furthermore, since Δ ∈ C 2 , it results that Assumption 3.6 d holds.On the other hand, Assumption 3.9 is satisfied since A x, a ∈ K. Since C is an invertible matrix, then Assumption 3.6 b holds.
Observe that the validity of Theorem 3.15 is guaranteed by Lemmas 4.1 and 4.2.Now, since Q and R are negative definite, then