Deep Q-Network with Predictive State Models in Partially Observable Domains

While deep reinforcement learning (DRL) has achieved great success in some large domains, most of the related algorithms assume that the state of the underlying system is fully observable. However, many real-world problems are actually partially observable. For systems with continuous observation, most of the related algorithms, e.g., the deep Q-network (DQN) and deep recurrent Q-network (DRQN), use history observations to represent states; however, they often make computation-expensive and ignore the information of actions. Predictive state representations (PSRs) can offer a powerful framework for modelling partially observable dynamical systems with discrete or continuous state space, which represents the latent state using completely observable actions and observations. In this paper, we present a PSR model-based DQN approach which combines the strengths of the PSR model and DQN planning. We use a recurrent network to establish the recurrent PSR model, which can fully learn dynamics of the partially continuous observable environment. .en, the model is used for the state representation and update of DQN, which makes DQN no longer rely on a fixed number of history observations or recurrent neural network (RNN) to represent states in the case of partially observable environments. .e strong performance of the proposed approach is demonstrated on a set of robotic control tasks from OpenAI Gym by comparing with the technique with the memory-based DRQN and the state-of-the-art recurrent predictive state policy (RPSP) networks. Source code is available at https://github.com/RPSRDQN/paper-code.git.


Introduction
For agents operating in stochastic domains, how to determine the (near) optimal policy is a central and challenge issue. While (deep) reinforcement learning has provided a powerful framework for decision-making and control and has achieved great success in recent years in some large-scale applications, e.g., AlphaGo [1], most of the related approaches rely on the strong assumption that the agent can completely know the environment surrounded it, i.e., the environment is fully observable. However, for many realworld applications, the problem is actually partially observable Markov decision process (POMDP) where the state of the environment may be partially observable or even unobservable [2,3].
Much effort has been devoted to planning in partially observable environments. Some of the work aims for learning the complete model of the underlying system. Huang et al. [4][5][6] propose the planning methods based on the PSR model. Song et al. [7] and Somani et al. [8] propose the planning method based on the POMDP model. However, these methods are only suitable for systems with discrete observations. In this paper, we mainly focus on systems with continuous observations, and there are two main approaches for dealing with the partially observable problem in such domain. One relies on recurrent neural networks to summarize the past, and then the neural network is trained in a model-free reinforcement learning manner [2,9,10]. However, it will be a heavy burden for the training of networks when everything relies on it. e other approach for dealing with the partially observable problem is directly using the past histories, i.e., the past observations (frames), for the state representation, and the main problem of this approach is that the number of observations (frames) used for the state representation can only be determined empirically. Also, too many observations for the state representation may be computation-expensive, but few observations may not be sufficient statics of the past. And neither method considers the effect of action information on state representation.
Predictive state representations (PSRs) provide a general framework for modelling partially observable systems, and unlike the latent-state based approaches, such as POMDPs, the core idea of PSRs is to work only with the observable quantities, which leads to easier learning of a more expressive model [11][12][13]. PSRs can also combine with the recurrent network for the modelling and planning in partially observable dynamic systems with continuous state space [14,15].
In this paper, with the benefits of the PSR approach and the great success of deep Q-network in some real-world applications, we propose the RPSR-DQN approach; firstly, a recurrent PSR model of the underlying partially observable systems is built, then the truly state, namely, the PSR state or the belief state, can be updated and provide the sufficient information for DQN planning, and finally, the tuple of <currentPSRstate, action, rewar d, nextPSRstate>, where currentPSRstate is the information of the current state and nextPSRstate is the information of the next state obtained by taking action under the current state, is stored and used as the data for the training of the deep Q-network. e performance of our proposed approach is firstly demonstrated on a set of robotic control tasks from OpenAI Gym by comparing with the deep recurrent Q-network (DRQN) algorithm which uses current observation as the input and plans based on memory. en, we compare our approach with the state-of-the-art recurrent predictive state policy (RPSP) networks [14]. Experiment results show that with the benefits of the DQN framework and the dividing of the learning of the model and the training of the policy, our approach outperforms the state-of-the-art baselines.

Related Work
A central problem in artificial intelligence is for agents to find optimal policies in stochastic, partially observable environments, which is an ubiquitous and challenging problem in science and engineering [16]. e commonly used technique for solving such partially observable problems is to model the dynamics of the environments by using the POMDP approach or the PSR approach firstly [3,12] and then the problem can be solved using the obtained model. Although POMDPs provide general frameworks to solve partially observable problems, it relies heavily on a known and accurate model of the environment [17]. erefore, in real-world applications, it is extremely difficult to build an accurate model [18]. Also, most of the POMDP-based approaches have difficulties to be extended to some larger-scale real-world applications.
As mentioned previously, PSR is an effective method for modelling partially observable environment and many related works were proposed based on the idea of running a fully observable RL method on the PSR state. In the work of Boots et al. [19], the main idea of it is firstly building accurate enough transformed PSRs with indicative and characteristic features and then the point-based value iteration technique [20] is used for finding the planning solution, where a state subset B in the state space is firstly selected under some strict conditions that B is both sufficiently small to reduce the computational difficulty and sufficiently large to obtain a good approximation function. In the work of Liu and Zheng [5,21], the learned PSR model has been combined with Monte-Carlo tree search both online and offline, which achieves the state-of-the-art performances on some environments. However, the application of these proposed approaches is limited to domains with discrete state and action spaces.
For partially observable systems with continuous state space, most work relies on recurrent neural networks to summarize the past and then the neural network is trained in a model-free reinforcement learning manner. In order to solve the customer relationship management (CRM) problem that is considered to be partially observable, Li et al. [22] proposed a hybrid recurrent reinforcement learning approach (SL-RNN + RL-DQN) which uses the RNN to calculate the hidden states of the CRM environment. While our method was tested on some control environments as shown in the experiments and takes into account both the past observations and actions for the representation of the underlying states, for SL-RNN + RL-DQN, both the proposed approach and the related experiments focus on the CRM problem. Also, SL-RNN + RL-DQN does not consider the effect of action value when calculating the state representation, which may incur the inaccurate representations of the underlying states. Moreover, while RPSR-DQN tries to build the model of the underlying system, which makes the related approaches be easily extended to model-based reinforcement learning approaches, SL-RNN + RL-DQN can only be combined with the model-free reinforcement learning frameworks. In the work of Hausknecht and Stone [9], recurrence is added to a deep Q-network (DQN) by replacing the first fully connected layer with a recurrent LSTM by considering all historical information. Igl et al. [2] extended the RNN-based approach to explicitly support belief inference. However, while in our approach, with suitable features, the mapping between the predictive state and the prediction of the observations given the actions can be fully known and simple to be learned consistently, the main problem of these RNN-based approaches with latent states is in these recurrent models, nonconvex optimization is used, which usually leads to more difficult training than those using convex optimization [14].
Recently, some works have been proposed by using the PSR state for the replacement or quality improvement of the internal state of the RNN. In the work of Venkatraman et al. [15], recurrent neural networks are combined with predictive state decoders (PSDs), which add supervision to the network internal state representation to target predicting future observations. Hefny et al. [14] proposed recurrent predictive state policy (RPSP) networks, which consist of a recursive filter for the tracking of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. While RPSP networks show some promising performances on some benchmark domains, the recursive filter and the reactive policy are trained simultaneously by defining a joint loss function in an online manner. However, how to balance the loss of the recursive filter and the loss of the reactive policy is difficult, and in many cases, as also shown in the experiments, the simultaneously training of two objective functions may lead to a worse final performance.

Background
is section is divided into three parts. In the first part, we briefly review predictive state representations (PSRs) [12]. en, we introduce the recurrent PSRs that can be applied to continuous observation systems. Finally, we briefly describe the DQN algorithm.

Predictive State Representations.
Predictive state representations (PSRs) offer a powerful framework for modelling partially observable and stochastic systems without prior knowledge by using completely observable events to represent states [23]. For discrete systems with a finite set of . , a |A| , at time τ, the observable state representation of the system is a prediction vector composed of the probability of test occurrence conditioned on current history, where a test is a sequence of action-observation pairs that starts from time τ + 1, a history at time τ is a sequence of action-observation pairs that starts from the beginning of time and ends at time τ, and the prediction of a length m and test t at history h is defined as [24].
Given the set of tests satisfies that for any test t, there exists a function f t such that , then T is considered to constitute a PSR. e set T is called the test core, and the prediction vector (p(T | h)) is called the PSR state. In this paper, we only consider linear PSRs, so the function f t can be represented as the weight vector m t . When the action a is performed from the history h and the observation o is obtained, the next PSR state p(T | hao) can be updated from p(T | h) as follows [12]: In formula (1), T is the transposing operation, the m ao is a weight vector of the test ao, and the M ao is a K × K matrix with the ith column corresponding to weight vector m aot i .

Recurrent Predictive State Representation.
e PSR model obtained by using the substate-space method [25] or spectral learning algorithm [26] can only be applied to the modelling of the discrete observation system. More recently, Ahmed et al. [27] proposed the recurrent predictive state representation (RPSR) which treats predictive state models as the recurrent network. It is able to represent systems with continuous observations. Similar to PSR, the RPSR state p t is the conditional distribution of future observations, so the mapping between the RPSR state p t and the predictive observation o t obtained for the given action can be fully known or easy to learn by selecting of features. is characteristic makes the process of learning networks become the supervised learning, which makes the modelling be simple and efficient [28,29]. e state update process of RPSR can be divided into two steps. As can be seen from Section 3.1, if T is the test core, the p(T | h t ) is a sufficient state representation at time t. en, establishing an extended test core T ensures that the p(T | h t ) is the sufficient statistic of the distribution Pr(a t , o t , T | h t ) for any a t , o t . When the estimate of p(T | h t ) is given, the p(T | h t+1 ) can be obtained in the case of getting a t , o t . e p(T | h t ) is called the extended state q t . e steps of state update are as follows [14]: (i) State extension: the state p t transforms to the extended state q t through the linear map W ext . W ext is a parameter that needs to be learned: (i) Conditioning: given a t and o t , the next state p t+1 can be calculated from the current extended state q t by the conditioning function f cond , where the kernel Bayes rule with the Gaussian RBF kernel is used [30]: where the calculation detail is as follows: as the extended feature is a Kronecker product of the immediate feature matrix and the future feature matrix, the extended state can be further divided into two parts, which are derived from the skipped future observation and the present observation, respectively. en, firstly, the feature vectors ϕ(a t ) and ϕ(o t ) are extracted for a given action a t and observation o t . Secondly, ϕ(a t ) and the second part of the expanded state are multiplied to calculate the observation covariance after a t is executed, and the inverse observation covariance is multiplied by the first part of the expanded state to change "predicting observation" into "conditioning on observation", which is transformed from the joint expectation of immediate ao and T to the conditional expectation from immediate ao to T. Finally, the conditional expectation is multiplied by ϕ(a t ) and ϕ(o t ) to obtain the next state p t+1 . e RPSR model can be seen as a recursive filter which is implemented by transforming formulas (2) and (3) to the recurrent network. e output of the recurrent network is a predictive observation o t � W pred (p t , a t ), where the W pred is the predictive observation function that needs to be learned. e p t and q t are represented in terms of observation quantities and can be estimated by supervised regression. e W ext follows from linearity formula (2). So, in the process of network training, the two-stage regression method [28] is used to initialize the state p t , extended state q t , and the linear map W ext .

Deep Q-Network.
DQN is a method combining deep learning and Q-learning, which has succeeded in handling environments with high-dimensional perception input [31]. It is a multilayered neural network which outputs a predicted future reward Q(s, a | θ) for each possible action, where θ are the network parameters. In other words, DQN uses a neural network as an approximation of the action value.
In DQN, the last four frames of the observations are directly input to the CNN as the first layer of DQN to compute the current state information. en, the state information is mapped to a vector of action values for the current state through the full connection layer [32]. DQN optimizes the action value function by updating the network weights θ to minimize a differentiable loss function L(θ) [9]:

RPSR-DQN
With the benefits of the RPSR approach and the great success of deep Q-network, we propose a model-based method, which combines the RPSR with the deep Q-learning. Firstly, we use the recurrent network to build a PSR model of the partially observable dynamic systems. en, the truly state p t , namely, the RPSR state, can provide sufficient information for selecting best action and be updated with the new action a t executed and the new observation o t received. Finally, the tuple of < p t , a t , r t , p t+1 > , where r t represents the return reward for taking action a t in the current state p t , is stored and used as the data for the training of the deep Q-network (DQN).
As depicted in Figure 1, the architecture of our method consists of the RPSR model part and the value-based policy part. In the RPSR model, the state p t transforms to the extended state q t through the extended part, i.e., a liner map. en, extended state q t updates to the next state p t+1 according to the action a t and observation o t . e total state update process can be represented as formula (5). For the policy part, the deep Q-network is used to select the action which can get better long-term reward according to the current state information calculated by the RPSR model: e learning process is divided into two stages: building the model and training the policy network. In the first stage, an exploration strategy is used to collect training data to build the model of the environment. We use the dataprocessing method proposed by Ahmed et al. [27]. We use the 1000 random Fourier features (RFFs) [33] as approximate features of observations and actions. en, we apply principal component analysis (PCA) [34] to features to project into 25 dimensions. Here, the number of features and dimensions depends on the complexity of the environment. We denote the feature function as ϕ. e linear map W ext , states p t , and extend states q t in the RPSR model are initialized by using a two-stage regression algorithm [28]. Use :t+k) ), and ζ A t � ϕ(a (t:t+k) ) to denote sufficient features of future observations, future actions, extended future observations, and extended future actions at time t, respectively. Because the p t and q t are represented in terms of observable quantities and follow from linearity of expectation, they are computed by using the kernel Bayes rule (stage-1 regression). Whereafter, the state extension function is q t � W ext p t , so we can linearly regress the extended state q t from the state p t , using a least squares approach (stage-2 regression), to compute W ext . After initialization, the parameters θ RPSR of the RPSR model can be optimized by using backpropagation through time [35] to minimize prediction error (see formula (6)), where o t is the predictive observation of the RPSR model: After the model is established for the dynamic environment, the current state information of the partially observable environment can be expressed by the model, and the policy part is trained on this basis. In the process of policy training, we build the evaluation network and target network, which are both composed of two fully connected layers. We use the experience replay [32] to train networks. When the agent interacts with the environment, we store transitions (p t , r t , a t , p t+1 ) in the data set D. en, sample random transitions to train the policy network by minimizing the value difference between the target network and evaluation network. ese losses are backpropagated into the weights of both the encoder and the Q-network. e value of the target network is R � r t + cmax a′ Q (p t+1 , a ′ ; θ − policy ), where θ − policy denotes the parameters of an earlier Q-network. e details are shown in Algorithm 1.

Experiments
We select the following three gym environments for evaluating the RPSR-DQN performance (see Figure 2): the traditional control environment CartPole-v1, the MuJoCo robot environment Swimmer-v1, and Reacher-v1. ese environments provide qualitatively different challenges. Due to the setting of experimental conditions, we make some changes to the three environments.
CartPole-v1: this task is controlled by applying left or right force to the cart to move the cart to the left or right. A reward of +1 is provided for every time step that the angle of the pole is less than 15 degrees. e episode is terminated when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center. e goal is to prevent the pole which is attached by an unactuated joint to a cart from falling over. ere are two action values in this environment, that is, the direction of the force applied to the cart. To make the environment partially observable, we remove the observations that represent the velocity, changing the original four observations to two observations which are the position of the cart and the angle of the pole. So, it requires the ability to calculate speed based on positions.  (2) Compute the sufficient features of every trajectory φ h n,t , φ O n,t , φ A n,t , ζ O n,t , ζ A n,t (n denotes the n th trajectory) (3) Establish the recurrent predictive state representation: (4) Initialize PSR: two-stage regression (5) Use kernel Bayes rule to estimate p n,t , q n,t (6) Apply least squares method to formula (2) to compute W ext (7) Set p 0 to the average of p n,t (8) Local optimization: Reacher-v1: this environment involves a 2-link robot arm which is connected to a central point. e goal of this task is to move the endpoint of the robot arm to the target location. Each step reward is the negative of the sum of the distance between the endpoint of the robot arm and the target point  To make the environment partially observable, we change the original six observable values to four, respectively, which represent the angles of two links and the relative distance between the link and the target position. In this task, it requires to find a balance between exploration and exploitation.
In this section, we compare methods using two metrics: the best reward is the best value for return rewards R n on all iterations, where R n is the total return reward for the n th iteration, and the mean reward is the mean return reward R n � (1/25) n i�n−25 R i for the last 25 iterations. Comparison to model-free methods: we compared the performance of RPSR-DQN with the model-free methods including the DQN-1frame and DRQN. e result is shown in Figure 3. Compared with the DQN-1frame which selects  the best action only by the current observation, RPSR-DQN can be shown that the predictive state model can achieve the great effect of tracking and updating the state of the environment. Because RPSR-DQN has a model learning process, it learns faster than DRQN and can converge to a more stable state with fewer iterations. And even with sufficient iterations of the update, RPSP-DQN can still get better rewards than DRQN in the final stable situation. e first three rows of Tables 1 and 2 show the numerical result which includes the performance of three methods in all tasks. Comparison to policy-based methods: Figure 4 shows the results of comparing the RPSR-DQN with the policybased method RPSR [14]. Note that as a policy-based method, RPSP can be applied to both continuous and discrete environments. In the action discrete environment, our method can get better mean rewards in the final stable situation than the policy-gradient method RPSP. In the Reacher-v1 task, the reasons for the ineffective RPSP may be as follows: the initial random weight tends to output highly positive or negative value outputs, which means that most initial actions make the link have the maximum or minimum acceleration. It causes a problem, which is that this link manipulator cannot stop rotary movement as long as putting the most force in the joint. In this case, once the robot has started training, this meaningless state will cause it to deviate from its current strategy. e RPSR may make not enough exploration to select the action to stop the link manipulator from rotating. e last two rows of Tables 1 and 2 show the numerical result which includes the performance of two methods in all tasks.

Conclusion
In this paper, we propose RPSR-DQN, a method that can learn the model and make a decision in partially observable environment. Combining the predictive state model with a value-based approach results in good performance in a partially observable environment. We compare RPSR-DQN with DRQN in different partially observable environments and show that our method can get better performance in terms of learning speed and expected rewards. Also, we compare our approach with the state-of-the-art recurrent predictive state policy (RPSP) networks, where the PSR model and a reactive policy are simultaneously trained in an end-to-end manner. Experiment results show that with the benefits of the DQN framework and the dividing of the learning of the model and the training of the reactive policy, our approach outperforms the state-of-the-art baselines.

Data Availability
e experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.