A Sarsa( 𝜆 ) Algorithm Based on Double-Layer Fuzzy Reasoning

ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense,which permits


Introduction
Reinforcement learning is a kind of machine learning methods that gets the maximum cumulative rewards by interacting with the environment [1,2].If a reinforcement learning problem can be modeled as a Markov decision process (MDP), methods such as dynamic programming (DP), Monte Carlo (MC), and temporal difference (DP) can be used to get an optimal policy.
Classic reinforcement learning methods are generally used for dealing with discrete state and action space problems, where each of the state values or state action values is stored in a lookup table.This kind of methods can effectively solve simple tasks, but not for large, continuous space problems.At present, the most common approach to solve this problem is using function approximation methods to approximate the state value or action value function.The approximate function can generalize the learned experience from a state space subset to the entire state space.Besides, an agent can choose the best action sequence through the function approximation [3,4].A variety of function approximation methods are used to reinforcement learning problems at present.Sutton et al. proposed a gradient TD (GTD) learning algorithm [5], which combined TD algorithms with linear function approximation, and also introduced a new objective function related to Bellman errors.Sherstov and Stone proposed a linear function approximation algorithm based on online adaptive tile coding, in which the experimental results verified its effectiveness [6].Heinen and Engel used incremental probabilistic neural network to approximate value function in reinforcement learning, which can be used to solve continuous state space problems well [7].
Reinforcement algorithms with the function approximation methods mentioned above usually have slow convergence and generally can only be used for getting discrete action policies [5][6][7][8][9].By introducing prior knowledge, reinforcement learning algorithms based on fuzzy inference systems (FIS) not only can effectively accelerate the convergence rate, but also may get continuous action policies [10][11][12].Horiuchi et al. put forward fuzzy interpolation-based Qlearning, which can solve the continuous space problems [13].Glorennec and Jouffe combined FIS and Q-learning, using prior knowledge to make the global approximator, which can effectively speed up the convergence rate.However, the Mathematical Problems in Engineering algorithm cannot be used to get a continuous action policy [14].Fuzzy Sarsa proposed by Tokarchuk et al. can effectively reduce the scale of state space and accelerate the convergence rate, but it easily causes "curse of dimensionality" when applied to multidimensional state-space problems [15].Type-2 fuzzy Q-learning proposed by Hsu and Juang has strong robustness to noise, but its time complexity is relatively high, and meanwhile, it cannot guarantee convergence [12].
Though the classic Q-iteration algorithms based on only one fuzzy inference system can be used for solving continuous action space problems, there still exist reasons for the slow convergence: for each iteration step in the learning process, there might exist a state-action pair that corresponds to different Q-values due to the structure of FIS.If the next iteration step needs to use the Q-value of the mentioned stateaction pair to update the value function, the algorithm will simply select a Q-value randomly, since there are no criteria on how to choose the best one from different Q-values, which will influence the learning speed.Because this situation may happen many times in the learning process, it will greatly slow down the convergence rate.
In allusion to the problem that classic Q-iteration algorithms based on the lookup table and fuzzy inference system converge slowly and cannot obtain continuous action policies as well, DFR-Sarsa(), which means Sarsa() based on double-layer fuzzy reasoning, is proposed in this paper, and the convergence is proven theoretically.The algorithm has two-layer fuzzy reasoning.Firstly, it puts states as input of the first fuzzy reasoning layer and gets continuous actions as output.Secondly, the second fuzzy reasoning layer uses the obtained actions from the first layer as input and gets Q-value component of each activation rule of the first layer.Finally, through the combination of two-layer fuzzy reasoning, Qvalues of the input states are obtained.What is more, a new eligibility trace based on gradient descent is defined, which is dependent on membership degrees of activation rule in two-layer fuzzy reasoning.Applying DFR-Sarsa() and other algorithms to Mountain Car and Cart-pole Balancing problems, the results show that DFR-Sarsa() not only can obtain a continuous action policy, but also has a better convergence performance.

Markov Decision Process.
In reinforcement learning framework, the process interacting with the environment can be modeled as an MDP [16], and the MDP can be described as a quadruple  = ⟨, , , ⟩, where (1)  is the state set and   ∈  is the state at time ; (2)  is the action set and   ∈  is the action that the agent takes at time ; ( (2) The objective of reinforcement learning is to get the optimal policy ℎ * .It satisfies for all  ∈  :  ℎ * () ≥  ℎ ().Under the optimal policy ℎ * , the optimal -function and optimal -function satisfy (3) and ( 4 If  and  are known, DP is a good solution for getting optimal action policy.However if  and  are unknown, TD algorithms such as -learning or Sarsa can be the choice.Sarsa is an on-policy algorithm, and when the eligibility trace mechanism is introduced, it becomes a more efficient algorithm, which can effectively deal with temporal credit assignment.Besides, Sarsa() can be combined with function approximation to solve continuous state space problems.
Definition 1 is a constraint on bounded MDP (mainly about state-space, action-space, reward, and value function).Attention should be given that all algorithms in this paper meet the definition.

Fuzzy Inference System
. FIS is a system that can handle fuzzy information.Typically, it mainly consists of a set of fuzzy rules whose design and interaction are crucial to the FIS's performance.
There are many types of fuzzy inference systems at present [17] in which a simple type of FIS named TSK-FIS is described as follows: where the first part is called antecedent and the second part is called consequent.  means th rule in the rule base.x = ( 1 ,  2 , . . .,   ) is an -dimensional input variable.   is the fuzzy set in the th fuzzy rule which corresponds to the th dimension of input variable.A membership function   , (  ) is usually used to describe it.y =   (x) is a polynomial function with an input variable x.If the input x is a vector, the output y is also a vector.When   (x) is a constant, FIS is called zero-order FIS.
When the FIS has an exact input value x = ( 1 ,  2 , . . .,   ), we can calculate the firing strength   (x) of the th rule (for T-norm product): (x) is used to calculate the output of FIS: set firing strength   (x) as weight, multiply their corresponding consequent and sum up; then we can obtain the final output (x) as follows: TSK-FIS can be used for function approximation which approximates the objective function by updating the consequent of fuzzy rules.In general, the approximation error is measured by mean square error (MSE).When FIS gets an optimal approximation performance, the vector , which consists of all rules consequents, satisfies ( 8) where   (x) is the objective function and Ŷ (x) is its approximate function.actions.Then, the two-layer FISs are combined to get the approximating Q-function of continuous action (x).

DFR-Sarsa(𝜆)
The main structure of the two-layer FIS is described as follows.
or  =  , with  , =  , , where x = ( 1 ,  2 , . . .,   ) is the state and  , is the th discrete action in the th fuzzy rule.The action space is divided into  discrete actions. , is a component of value corresponding to the th discrete action in the th fuzzy rule.When the state is x, the firing strength of the th rule is If   (x) > 0, we call the th rule "the activation rule." In the activation rule   , we select an action from  discrete actions by -greedy action selection policy according to the value  , .The selected action is called activation action, denoted by ũ .Therefore, by multiplying activation actions selected from FIS1 to its firing strength   (x) and summing them up, we get the continuous action (x) as follows: We call (x) a continuous action because the change of (x) is smooth with state x, which does not mean that any action in action space can be selected in state x.To simplify (11), regularize the firing strength   (x) as follows: so ( 11) can be written as (2) The rule of FIS2 is given as follows: The construction of R, depends on FIS1.The core of the fuzzy set ] , is the th action of the th rule in FIS1, and its membership function is described as  ] , (); the value  , from the consequent part of the rule equals the value  , in FIS1.
Set the continuous action (x) obtained from FIS1 as the input of FIS2; it can activate  R rules of FIS2.Through fuzzy reasoning of FIS2, we can get the -value component of the th rule in FIS1 as follows: In the same way of getting (12), regularize the membership function  ] , ((x)) in (15); we get then (15) can be written as From ( 17), we can get Q (x, (x)), the -value component obtained by the activation rule   of FIS1.So when taking continuous action (x), the -value of all activation rules in FIS1 is given as follows: From (18) we can see that -value depends on fuzzy sets of the two-layer FIS and their shared consequent variables  , .Since fuzzy sets are set according to prior knowledge in advance, they are no longer changed in the algorithm.In order to get convergent -value, the FISs require updating  , until convergence.
In order to minimize the approximation error of FIS, that is, parameter vector  meets (8), the algorithm uses gradient descent method to update the parameter vector  as follows: where the bracket part in (19) is the TD error.Set  =  +1 +   (x +1 ,  +1 ) −   (x  ,   ); combining the backward TD() algorithm [1], we get where  is a step-size parameter and e  is the eligibility trace vector at time , which corresponds to parameter vector   .It is updated as follows: e  of ( 21) is a kind of accumulating trace [1], where  is the discount factor and  is the decay factor.∇     (x  ,   ) represents the gradient vector obtained by the partial derivative of function on each dimension of parameter vector at time  [1].

The Learning Process of DFR-Sarsa(𝜆).
In this section, DFR-Sarsa() is proposed based on the algorithm Sarsa in literature [1] and the content of MDP in Section 2.1.DFR-Sarsa() not only can solve reinforcement learning problems with continuous state and discrete action space, but also can solve problems with continuous state and continuous action space.Algorithm 1 describes the general process of DFR-Sarsa().

Convergence Analysis.
In the literature [18,19], the convergence of on-policy TD() using linear function approximation is analyzed in detail.When this kind of algorithm meets some assumptions and lemmas, it converges with probability 1.Since DFR-Sarsa() is exactly such an on-policy TD() algorithm, it can be proved to be convergent when it satisfies some assumptions and lemmas in literature [18].And this paper will not take too much details for its convergence proof.
Assumption 2. The state transition function and reward function of MDP follow stable distributions.

Lemma 3. The Markov chain that DFR-Sarsa(𝜆) depends on is irreducible and aperiodic, and the reward and value function are bounded.
Proof.Firstly, we prove its irreducibility.According to the property of Markov process, if any two states of a Markov process can be transferred from each other, it is irreducible [20].DFR-Sarsa() is used for solving reinforcement learning problems that satisfy MDP framework, and the MDP meets Definition 1. Thus for any state  in the MDP, there must exist an  that meets (, ,   ) ≥ 0, which indicates that state  can be visited infinitely.Therefore, each state can be transferred to any other state.So the Markov chain of DFR-Sarsa() is irreducible.
Secondly, we prove that it is aperiodic.For the irreducible Markov chain, if one of the states in Markov chain is proved aperiodic, the entire Markov chain can be proved aperiodic.In addition, if a state of the Markov Chain has the property of autoregression, the state can be proven aperiodic [20].For state  of the MDP, there must exist a state transition satisfying (, , ) > 0, which indicates that state  is autoregressive.From the above analysis, we can conclude that the MDP is aperiodic.Therefore, the Markov chain that DFR-Sarsa() depends on is aperiodic.
Finally, we prove that its reward and value function are bounded.Literature [1] shows that value function is a discounted accumulating reward function, which satisfies the equation (, ) = ∑ ∞ =0   (, ),  ∈ (0, 1).By Definition 1, we know that the reward function  is bounded, and it satisfies 0 ≤ (, ) ≤ , where  is a constant.Hence By Inequation (24), we can conclude that value function (, ) is bounded.In summary, Lemma 3 is proved.
Condition 1.For each membership function , there exists a unique state   that   (  ) >   (), for all  ̸ =   , while the other membership functions in state   are 0; that is,    (  ) = 0, for all   ̸ = .
Lemma 4. The basis functions of DFR-Sarsa() are bounded, and the basis function vector is linearly independent.
Proof.Firstly, we prove the basis functions are bounded.From   (x) ∈ [0, 1] and  ] , ((x)) ∈ [0, 1], we get where ‖ ⋅ ‖ ∞ represents infinite norm.Since the basis function of DFR-Sarsa() is known as   (x) ] , ((x)) from (25), we get that the basis functions of DFR-Sarsa() are bounded.Secondly, we prove the basis function vector is linearly independent.In order to make the basis function vector linearly independent, let the basis functions meet Condition 1 [21], where the function form is shown in Figure 4. From literature [21] we know that, when Condition 1 is met, the basis function vector is linearly independent.
The requirement in Condition 1 can be relaxed appropriately by making the membership degree of    (  ) at state   a small value, for example, a Gaussian membership function with smaller standard deviation.Applying the membership function to DFR-Sarsa(), experimental results show that DFR-Sarsa() is convergent, though the convergence still cannot be given theoretically.
In summary, Lemma 4 is proved.

Theorem 6. Under the condition of Assumption 2, if DFR-Sarsa(𝜆) satisfies Lemma 3 to Lemma 5, the algorithm converges with probability 1.
Proof.Literature [18] gives the related conclusion that, under the condition of Assumption 2, when on-policy TD() algorithms with linear function approximation meet certain conditions (Lemma 3 to Lemma 5), the algorithms converge with probability 1. DFR-Sarsa() is just such an algorithm and it meets Assumption 2 and Lemma 3 to Lemma 5.So we get that DFR-Sarsa() converges with probability 1.

Experiments
In order to verify DFR-Sarsa()'s performance about the convergence rate, iteration steps after convergence, and the  effectiveness of continuous action policy, we take two problems as experimental benchmarks: Mountain Car and Cartpole Balancing.These two problems are classic episodic tasks with continuous state and action spaces in reinforcement learning, which are shown in Figures 2 and 3, respectively.

Mountain Car.
Mountain Car is a representative problem with continuous state space, as shown in Figure 2. Suppose the underpowered car cannot accelerate up directly to reach the top of the right side.So, it has to move around more than once to get there.Modeling the task as an MDP, in which the state represented a two-dimensional variable: location and speed; that is,  = [, V].The action is the force that drives the car to move horizontally, which is bounded in [−1 , 1].In this problem, the system dynamics are described as follows: where bound(V  ) ∈ [−0.07, +0.07], bound(  ) ∈ [−1.5, +0.5], and  = 0.0025 is a constant related to gravity.In addition, time step is 0.1 s and the reward function is as follows: Equation ( 30) is a punishment reward function, where   means the reward received at time .
In the simulation, the number of episodes is set to 1000.The maximum time step in each episode is also set to 1000.The initial state of the car is  = −0.5,  = 0.When the car arrives to the destination ( = 0.5) or the time steps exceed 1000, we finish this episode and begin a new one.The experiment will end after 1000 episodes.
In order to show the effectiveness of DFR-Sarsa(), we compare the algorithm with Fuzzy Sarsa proposed by Tokarchuk et al. [15], GD-Sarsa() proposed by Sutton et al. [3], and Fuzzy Q() proposed by Zajdel [22].Additionally, the effect of eligibility trace on the convergence performance is also tested.At present, there is no proper way to select parameters that make the four algorithms have their best performance, respectively.In order to make the comparison more reasonable, the parameters that exist in all of the four algorithms will be set at the same value, while the parameters that do not exist in all of the four algorithms will be set at the value from where it firstly comes.
We first set the parameters of DFR-Sarsa(): 20 triangular fuzzy sets whose cores are equidistant are used to partition each state variable, which results in 400 fuzzy rules.Similarly, use eight triangular fuzzy sets whose cores are equidistant to partition the continuous action space, where the number of fuzzy rules is 8. Set the other parameters  = 0.001,  = 0.9,  = 0.9, and  = 1.0.The form of fuzzy partition in Fuzzy Sarsa is the same as in DFR-Sarsa().Other parameters are set to  = 0.001,  = 0.9, and  = 0.9.GD-Sarsa() uses 10 tilings of 9 × 9 to divide state space, where the parameters are set as the best experimental parameters given in literature [1]:  = 0.001,  = 0.14,  = 0.3, and  = 1.0.The form of fuzzy partition in Fuzzy Q() Sarsa is also the same as in DFR-Sarsa().Other parameters are set in accordance with literature [22] to  = 0.005,  = 0.1,  = 0.1, and  = 0.995.
DFR-Sarsa(), Fuzzy Sarsa, GD-Sarsa(), and Fuzzy Q() are applied to Mountain Car. Figure 5 shows the average result in 30 independent simulation experiments.The -coordinate indicates the number of episodes, and ycoordinate represents the average time steps the car drives from the initial state to the target.As can be seen from Figure 5, the convergence performance of DFR-Sarsa() is better than those of the other three algorithms.
The detailed performance of the four algorithms is shown in Table 1 (the benchmark time is the average time of a single iteration of DFR-Sarsa()).
In order to test the effectiveness of the proposed eligibility trace, DFR-Sarsa() with eligibility trace and DFR-Sarsa without eligibility trace are both applied in Mountain Car. Figure 6 shows the convergence performance of these two algorithms.It can be seen that these two algorithms converge in the same average time steps, but the convergence speed of DFR-Sarsa() is better than that of DFR-Sarsa.].These two state variables satisfy  ∈ [−/2, /2] (rad) and θ ∈ [−16, 16] (rad/s).The action inference system mainly depends on the fuzzy sets and fuzzy rules.In this paper, the type of fuzzy sets and the number of rules are given as prior knowledge, and they are no longer changed during the learning process.In order to achieve a much better convergence performance, we will focus on using appropriate optimization algorithms to optimize the membership functions and adjust the fuzzy rules adaptively.

Figure 1 :
Figure 1: Framework of approximating the -value functions by using double-layer fuzzy reasoning.

Figure 4 :
Figure 4: Triangular membership functions (except the domains of states are different, the form of membership functions of position, and velocity is described in Figure 4. Besides, the form of membership functions in Section 4.2 is also the same).

Figure 5 :
Figure 5: Comparisons on convergent efficiency of the four algorithms.

Figure 6 :
Figure 6: The effect of eligibility traces on convergent efficiency of DFR-Sarsa().
3)  :  ×  → R  is the reward function, that means, after the agent takes action   at time , the current state transfers from   to  +1 , and the agent receives an immediate reward (  ,   ,  +1 ) at the same time.representsa random reward generated from a distribution with mean (  ,   ,  +1 );(4)  : ×× → [0, 1] is the state transition function, where (, ,   ) represents the probability of reaching   after taking action  in state . ∈  (, ,   )  ℎ (  ) .

Table 1 :
Performance comparison of the four algorithms in Mountain Car problem.Figure3shows a Cart-pole Balancing system, in which the cart can move left or right on the horizontal plane.A pole is hinged to the cart, which can rotate freely within a certain angle.The task is to move the cart horizontally to keep the pole standing in a certain range [−/2, /2].Similarly, modeling the task as an MDP, the state is a two-dimensional variable, which is represented by the vertical angle of pole , and the angular velocity of the pole θ ; that is,  = [, θ