Decomposition Methods for Solving Finite-Horizon Large MDPs

Conventional algorithms for solving Markov decision processes (MDPs) become intractable for a large nite state and action spaces. Several studies have been devoted to this issue, but most of them only treat innite-horizonMDPs.is paper is one of the rst works to deal with non-stationary nite-horizon MDPs by proposing a new decomposition approach, which consists in partitioning the problem into smaller restricted nite-horizon MDPs, each restricted MDP is solved independently, in a specic order, using the proposed hierarchical backward induction (HBI) algorithm based on the backward induction (BI) algorithm. Next, the sub-local solutions are combined to obtain a global solution. An example of racetrack problems shows the performance of the proposal decomposition technique.


Introduction
Stochastic models have recently gained a lot of attention in the arti cial intelligent (AI) communities; it o ers a suitable framework for solving problems with uncertainties. MDPs [1] are one such model that achieved promising results in numerous applications [2][3][4]. Most real-world problems have very large state spaces that require many mathematical operations and substantial memory. It is intractable to solve them with classical MDPs algorithms [5]. Motivated by these considerations, several recent types of research have used the decomposition technique to overcome the computational complexity. e decomposition approach introduced by Bather [6] divides the state space into strongly connected components (SCCs) according to certain levels and solves the small problems called restricted MDPs separately at each level to obtain the global solution of the original MDP by combining the partial solutions. Subsequently, Ross and Varadarajan [7] also proposed a similar decomposition technique for solving constrained limiting average MDPs. It is employed in various categories of MDPs (discounted, average, and weighted MDPs) in diverse studies [8,9]. e weakness point of these approaches is their polynomial execution complexity. To accelerate the execution time Cha k and Daoui integrate the decomposition and parallelism schemes [10], unfortunately, the decomposition algorithm remains polynomial at runtime. Following, to accelerate the convergence time of the decomposition, Larach, and Daoui [11] investigated the state space decomposition approach into SCCs according to some levels based on Tarjan's algorithm [12]. Subsequent work [13][14][15] developed approaches to solve MDPs with factorization methods that were introduced by [16]. e goal of factoring a problem is to decompose it into smaller items. Factored MDPs produce compact representations of complex and uncertain systems allowing for an exponential reduction in the complexity of the representation [17]. ese factorized approaches represent states as factorized states with an internal structure and state transition matrices as dynamic Bayesian networks (DBNs). However, methods for solving representations based on factorized DBNs do not exploit advances in tensor decomposition methods for representing large atomic MDPs. More recently, research such as [17,18] exploits a similar idea to the thesis of Smart and [19] aims at improving the e ciency of MDP solvers by using tensor decomposition methods to compact state transition matrices. e solver uses the value iteration and policy iteration algorithms to compute the solution compactly. e authors try different ways to parallelize their proposed approaches, but no improvement in the execution time is observed. ese methods are based on multiplications between small tensor components. More recently, work has been carried out to address this problem in the context of parallelism [20] presenting a way to decompose an MDP into SCCs and find dependency chains for these SCCs. ey solve independent chains of SCCs with a proposed variant of the topological value iteration (TVI) algorithm, called parallel chained TVI aimed at improving the execution time on GPUs. In this context, research groups [21][22][23] have improved iterative algorithms in parallel versions to accelerate their convergence. e literature mentioned above focuses only on solving different types of MDPs under the infinite-horizon criterion. It is difficult to find papers that focus on solving large nonstationary finite-horizon MDPs. is paper is oriented in this underexplored direction, the main objective of this work is to propose a new decomposition technique tackling the challenges of reducing memory requirements and computational cost. e proposed technique consists in partitioning the global problem into smaller restricted finite-horizon MDPs, each restricted MDP is solved independently, in a specific order, using the backward induction algorithm. Next, the sub-local solutions are combined to obtain a global solution.
ere are several problems modeled as MDPs with an initial given state i 0 , then the optimal action f(i 0 ), and the optimal value V T (i 0 ) are computed by solving just the restricted MDP corresponding to the classes accessible from the class containing i 0 (one does not need to consider all states).
is is also an advantage of this method, which reduces memory consumption and speeds up the computation time.
e remainder of the article is organized as follows: the second section introduces the fundamentals of finite-horizon MDPs. e third section focuses on the decomposition technique and describes the new finite-horizon restricted MDP. e fourth section presents the proposed hierarchical backward induction algorithm. e last section illustrates the advantages of this decomposition technique by its application to a racetrack problem. e paper concludes with conclusions and prospects for future work.

Markov Decision Process
Markov decision processes have been widely studied as an elegant mathematical formalism for many decision-making problems in a variety of fields of science and engineering [24]. e objective is to approximate the best decision policies (action selection) to achieve maximum expected rewards (minimum costs) in a given stochastic dynamic environment satisfying the Markov property [1]. In this section, we will present non-stationary finite-horizon MDP with a finite state and action spaces.
Formally, a non-stationary MDP with a finite-horizon is defined by five-tuple (S, A, T, P, and R), where S and A are the state and action spaces; T is the time horizon; P denotes the state transition probability function, where iaj is the probability of transition from state i to state j by taking action a at time t; S t (A t ) is a random variable indicating the state (action) at time t; R supplies the reward function defined on state transitions, where r t ia indicates the reward gained if the action a is executed in the state at t period. Most solvers of MDPs attempt to find an optimal policy that specifies (optimal) action should be taken for each agent at each state. If the process will be considered a finite planning horizon T, an optimal policy π * is given as the policy that maximizes the expected reward. π * maximizes the value function of the Bellman equation [24]: is the total expected reward in Tperiods, given that the process starts from initial state i and the policy π is used.
where X t and Y t are the random variables representing, respectively, the state and action at time t. Besides, we define the optimal value vector V T : It is well known that the backward induction algorithm is one of the most common iterative methods used to find an optimal policy. In the next section, we will discuss it in more detail.

Backward Induction Algorithm
In this section, we compute an optimal policy as well as the optimal value vector using the backward induction algorithm, its iterative process starting at the end of the planning horizon T, one computes the values for the previous periods. en, after T iterations an optimal policy is found. e following theorem introduced in [25] demonstrates the validity of the BI algorithm: Define recursively for t � T, T -1,. . ., 1, a deterministic decision rule f t and the vector x t as follows: . ., f T ) is an optimal policy and x 1 is the optimal value vector V T .
To accelerate the execution time of the classical BIalgorithm, the authors used the proposal of [11], they introduced for each action a the list of state-action suc-

Hierarchical Backward Induction Algorithm
e BI algorithm becomes quite impractical to compute an optimal policy for finite-horizon MDPs with large state space. For non-stationary finite-horizon MDPs, the computing load can increase further. To overcome this issue, we describe, in this section, a new decomposition technique for improving the performance and reducing the time running.

e Decomposition Technique.
Let us consider an oriented graph G � (S, U), associated with the original MDP, where S is a set of nodes that represents a state space and ere exists a unique partition S � C 1 ∪C 2 . . .∪C p of the state space S into strongly connected classes. Note that the SCCs are defined to be the classes with respect to the relation on G defined by i is strongly connected to j if and only if i � j or there exists a directed path from i to j and a directed path from j to i. ere are many good algorithms in graph theory for the computation of such partition, e.g., see [11]. Now, we construct by induction the levels of the graph G. e level L 0 is formed by all closed classes C i , that is for all i ∈ C i ; a ∈ A(i): P iaj � 0 for all j ∉ C i . e level L p is formed by all classes C i such that the end of any arc emanating from C i is in some levels L p−1 , L p−2 ,. . ., L 0 . After finding the SCCs using Tarjan's algorithm, their belonging levels are found by using the following algorithm (algorithm 2) introduced in [26].
For each level L n , n � 0, 1, 2, ..., L. Let (C lk ), k ∈ {1, 2, . . ., K(l)} be the strongly connected classes corresponding to the nodes in level l (see Figure 1). Each class C lk leads to a partial MDP lk that is solved independently, the global solution is obtained by combining these partial solutions. e hierarchical method used by several researchers for several categories of MDPs, addresses the "curse of dimensionality" of large MDPs, was described by [27] and later further developed by [11,26]. It consists of breaking up the state space into small subsets, solving the restricted MDPs problems corresponding to these subsets, and combining these solutions to determine the solution of the global problem. Based on the above decomposition technique, the authors propose, a hierarchical backward induction (HBI) algorithm by decomposing the original finite-horizon MDPs into restricted MDPs corresponding to each SCC. ese restricted MDPs are solved independently and according to their level.
. . , f T ) is an optimal policy and v 1 is the optimal value vector.
ALGORITHM 2: Finding levels. Journal of Mathematics e performance of the proposed algorithm is exposed after the search for the optimal policy of the known initial state, the algorithm solves only the restricted MDP corresponding to the reachable classes of the initial state. For example, in Figure 1, the initial state S 0 is in class C 10 . Only the restricted MDPs corresponding to the SCCs: C 10 , C 00 , and C 01 are solved.
In the next section, we will define new MDPs called the restricted finite-horizon MDPs.
According to the definition of the restricted finite-horizon MDPs, we remark that. e restricted finite-horizon MDPs are solved according to the ascending order of levels and in the same level L p . e restricted finite-horizon MDPs are independent, so they can be solved in parallel.

Hierarchical Backward Induction
Algorithm. Based on the above restricted finite-horizon MDPs, the authors present in this section, a new algorithm called hierarchical backward induction (HBI) algorithm (algorithm 3). e main contribution is to show that the optimal value in the restricted MDP pk is equal to the optimal value in the original MDP ( eorem 2). Now, the corresponding restricted finite-horizon MDPs are constructed and immediately solved by using this procedure: e following theorem shows the validity of the HBI algorithm. *  � (f 1 , f 2 , . . ., f T ) and V T are, respectively, an optimal policy and the optimal value vector in the original MDP. If R pk � (f 1 pk , f 2 pk , . . . , f T pk ) and V T pk are, respectively, an optimal policy and the optimal value vector in the restricted MDP pk , then for all i ϵ C pk , for t � 1,2, . . .,T, f t pk (i) � f t (i) is an optimal action in the original MDP and V T pk (i) � V T (i).

Theorem 2. Let R
Proof. e proof is by induction. For p � 0 (level L 0 ); k ∈ 1, ..., . . , f T 0k ) and V T 0k are, respectively, an optimal policy and the optimal value vector in the restricted MDP 0k .
According to eorem 1, we have for t � T, T − 1,. . ., 1: from the definition of the restricted MDP 0k , the state space S 0k � C 0k , the action space A 0k (i) � A(i) for i ∈ S 0k , for t � T, T − 1,. . ., 1 the transition probabilities p t 0k (i, a, j) � p t (i, a, j) for all i, j ∈ C 0k , a ∈ A 0k (i), the rewards r t 0k (i, a) � r t ia . Furthermore, the class C 0k is closed then for t � 1, . . ., T.
therefore, for all i ∈ C 0k and t � 1, . . ., T, f t 0k (i) � f t (i) is an optimal action for the global MDP and Suppose that the result is true until the level p-1. Now we shall show that the result is still true in the level p. e state space of the restricted MDP pk is S pk � C pk ⋃ j ∈ E p : p t iaj > 0 for all iεC pk , aεA(i) , where Let R pk � (f 1 pk , f 2 pk , . . . , f T pk ) and V T pk are, respectively, an optimal policy and the optimal value vector in the restricted MDP pk .
According to eorem 1, we have for t � T, T − 1,. . ., 1: ∀i ∈ S pk, f t pk (i) � arg max a∈A pk (i) r t pk (i, a) + j∈C pk p t pk (i, a, j)x t+1 x t � r f t pk + P f t pk x t+1 , Based on the definition of the restricted MDP pk , for i ∈ S pk , if i ∈ C pk , A pk (i) � A(i), for a ∈ A pk (i), the rewards r t pk (i, a) � r t ia and the transition probabilities p t pk (i, a, j) � p t (i, a, j) for all j ∈ S pk . In fact, that and since p t pk (i, a, j) � 0, ∀i ∈ S pk , ∀jε(E/S pk ), then for t � 1, . . ., T,f t pk verifies ∀i ∈ C pk, f t pk (i) � arg max x t � r f t pk + P f t pk x t+1 .
By consequence, for i ∈ C pk and t � 1, . . ., T, f t pk (i) � f t (i) is an optimal action for the global MDP, and, . Now, It remains to see the case where i ∈ (S pk / C pk ), m ∈ 0, 1, . . . , p − 1 , h ∈ 1, 2, . . . , K(m) { }: i ∈ C mh and r t pk (i, θ) � V t mh (i)/N. From the recurrence hypothesis f t mh (i) � f t (i) is an optimal action for the global MDP, calculated in previous levels, it remains to verify that V T pk (i) � V T (i). Since for i ∈ (S pk /C pk ) and t � 1, . . ., T, f t pk (i) � θ then P(f t pk ) ii � 1 and r(f t pk ) i � V t mh (i)/N. It follows from (5): □ Remark 1. If the initial state i 0 is known, its optimal action f(i 0 ) and its optimal value V T (i 0 ) are computed by solving just few restricted MDPs: one does not need to consider all states. e following algorithm (algorithm 4) explains this issue. It is clear that, f(i0) and VT(i0) are obtained by solving only MDPmk. ◆ To demonstrate the benefit of the proposed HBI algorithm, we consider a case study of the racetrack problem described in the following section.

Case Study and Experimental Results
To show the advantage of the proposed HBI algorithm, we consider a standard control racetrack problem described by Martin Gardner [28] and Barto [29]. e goal is to control the movement of a race car along a predefined racetrack so that the racer can get to the finish line from starting the line in the minimum amount of time.
At each time step, the state of the racer is given by the tuples (X t , Y t , V x (t), V y (t)) that represent the position and speed of the car in the x, y dimensions at time t. e actions are pairs a � (a x , a y ) of instantaneous accelerations, where a x , ay ϵ {−1, 0, 1}. We assume that the road is 'slippery' and the car may fail to accelerate. An action a � (a x , a y ) has its intended effect 90% of the time; 10% of the time the action effects correspond to those of the action a 0 �(0; 0). Also, when the car hits a wall, its velocity is set to zero and its position is left intact. When the car is state (X t , Y t , V x (t), V y (t)) and the action taken is a � (a x , a y ), it transit with 90% probability to a state (X t + V x (t)+a x , Y t + V y (t) +a y , V x (t)+a x , V y (t)+a y ).
Let s � (X t , Y t , V x (t), V y (t)), a � (a x , a y ) and s″� (X t + V x (t)+a x , Y t + V y (t)+a y , V x (t)+a x , V y (t)+a y ), the transition probability is defined as follows: To complete the formulation of the finite-horizon MDP problem, we need to define the reward function and the horizon. Independently of the action taken, the immediate reward for all non-goal states is equal to −1, i.e., R i � − 1 and it is equal to zero for any goal state reached, i.e., R g � 0. e horizon is determined after the decomposition into levels; indeed, during this decomposition, the maximum level obtained will be considered as the horizon.
Step 1. Determine the class C mk such that i 0 ∈ C mk .
Step 2. Determine the classes C nh , n∈{0, 1, . . ., m}; h ∈{1, 2,.., k(n)} such that the end of any arc emanating from C mk is in the classes C nh Step 3. Solve the restricted MDPnh found in Step 2.  Note. e value iteration (VI) algorithm under the infinite-horizon discounted MDP is used in [30] in order to solve racetrack problems. presents the comparison between VI, BI, and HBI algorithms. As it can be seen, the BI algorithm outperforms the VI algorithm, but the proposed HBI algorithm is more efficient than the BI algorithm.  To restrict the possible infinite state space, we assume that the speed of the car in the x, y dimensions are bounded in the range [ − 7, + 7], i.e., V x (t), V y (t) ϵ [−7, +7]. e speed will not change if the agent attempts to accelerate beyond these limits. e proposed algorithms are tested using Intel(R) Core(TM) i7-6500 U (2.6 GHz), C++ implementation, Windows 10 operating system (64 bits). Figure 2 presents the three racetracks considered, blues cells represent the initial states, and green cells represent the goal states. Table 1 presents the horizon, the number of SCCs, and the number of possible states obtained with the decomposition algorithm into levels for the three considered racetrack problems. As it can be seen, the number of states is reduced. Table 2. Figures 3-5 show the policy obtained with VI, BI, and HBI algorithms for the three racetracks problems. As it can be seen, we obtain the same policy with the three algorithms.

Conclusion
In this paper, we have presented a new hierarchical backward induction algorithm for finite-horizon non-stationary MDP that is successful for large state spaces. It consists in decomposing the original problem into smaller restricted MDPs; indeed in each level and for each SCC a restricted finite-horizon MDP is constructed and solved independently from the other restricted MDPs of the same level. is proposed method accelerated the calculation time and reduces the memory requirement.
To show the advantage of the proposed HBI algorithm, we applied it to racetrack problems. e experimental results show that the HBI algorithm outperforms better the standard BI and value iteration algorithm. From a perspective, the use of parallelism techniques could further accelerate the convergence of the hierarchical finite-horizon MDPs.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Value Iteration
Backward Induction Hierarchical Backward Induction Figure 4: Road generated by VI, BI, and HBI algorithms for the racetrack-2.
Value Iteration Backward Induction Hierarchical Backward Induction Journal of Mathematics 7