Variational Quantum Circuit-Based Reinforcement Learning for POMDP and Experimental Implementation

Variational quantum circuit is proposed for applications in supervised learning and reinforcement learning to harness potential quantum advantage. However, many practical applications in robotics and time-series analysis are in partially observable environment. In this work, we propose an algorithm based on variational quantum circuits for reinforcement learning under partially observable environment. Simulations suggest learning advantage over several classical counterparts. ,e learned parameters are then tested on IBMQ systems to demonstrate the applicability of our approach for real-machine-based predictions.


Introduction
e combination of deep neural networks and reinforcement learning is demonstrated as an effective way to tackle computational problems that are difficult for other traditional approaches [1,2]. In the usual reinforcement learning settings, the underlying model is the Markov decision process (MDP) [3,4], where the information of environment can be completely obtained by the learning agent. However, in many real-world applications such as robotics [5], the observations are obtained from the sensors of mobile robots and hence are limited. In such cases, it is necessary to model the problem by the partially observable Markov decision process (POMDP) [6,7]. POMDP is a framework for the environments where complete information cannot be obtained. One difficulty occurs in the POMDP in robotics setting is the perceptual aliasing problem, in which the learning agent cannot distinguish one state from another state due to the limitation of observation ability. To make proper decision under limited observation, the agent has to memorize its history to distinguish one state from the other.
One traditional method for POMDP is the belief value iteration [5][6][7], where the agent maintains a belief distribution over possible states. e value function then becomes a functional over continuous functions and hence is computationally expensive. To deal with the computational difficulties, other methods using Monte Carlo [5,8] and recurrent neural network [9,10] have been proposed. ese methods are difficult to execute without a sufficient computing capacity and memory. Complex-valued reinforcement learning has been proposed as a POMDP algorithm that can be executed with less computational resources [11]. In complex-valued reinforcement learning, the action value function gives a complex number, and there is another complex number internal reference vector that represents time series information. Tables [11,12] and neural networks [13,14] have been used to express complex action value functions. However, expressing correlated complex numbers using a classical computer is not considered to be a good choice from the viewpoint of memory efficiency. On the other hand, since quantum computers perform calculations over Hilbert space, it is reasonable to think that quantum processors can be used to represent complex-valued functions efficiently. If the abovementioned complex action value function can be expressed by a quantum computer, it may become a more memory efficient POMDP learning method.
In this work, we propose a method for performing complex-valued reinforcement learning by expressing a complex action value function using a quantum computer. Previous works apply current quantum hardware [15] for supervised machine learning using the variational quantum circuit method [16][17][18]. e applications of variational quantum circuit to the value function in reinforcement learning problem are also demonstrated [19][20][21][22]. Some works use classical-quantum hybrid model to solve large problems [23,24]. Other works use variational quantum circuit to represent policy in reinforcement learning [25,26]. Variational quantum circuit is also applied to represent both policy and value function in actor-critic method [27]. Quantum algorithms are also applied to sampling from policy in deep energy-based reinforcement learning [28]. ese quantum algorithms for reinforcement learning are based on MDP. We implement the variational quantum circuit for complex-valued action value function approximation and compare the performance against other methods like complex neural networks. e learning performance shows advantage over some other classical methods. We further use the parameters learned from simulation for predictions with IBMQ systems. e agent is able to reach the goal state with the predictions made by IBM machines.
is discovery suggests that the use of variational quantum circuit for POMDP provides possible advantage. is paper is organized as follows. Background section introduces the concept of POMDP. In Methods section, variational quantum circuit and neural network are applied for complex-valued Q-learning. In Results and Discussion section, partially observable maze environment experiment results are shown. Conclusions section describes conclusion and future work.
1.1. Background. POMDP is a general framework for planning in the environments where perfect information cannot be gotten. In POMDP, the agent cannot fully observe the state but instead receives an observation from the environment, and the observation does not satisfy Markov property. A POMDP is defined as a tuple (S, A, T, R, Ω, O, c). S is the set of states. A is the set of actions. T is the state transition probability. R is the reward function. Ω is the set of observations. O is the observation probability. c is the discount rate.
When the agent executes an action a in the environment, the state transitions from a state s to a next state s′ according to the state transition probability T � Pr(s ′ |s, a) and the agent receives an observation o from the environment according to the observation probability O � Pr(o|s ′ , a) and a reward r according to the reward function R(s, a). e history h t is the time series of actions and observations and is expressed as h t � a 0 , o 1 , a 1 , o 2 , . . . , a t−1 , o t . As the agent cannot fully observe state, it uses the belief state, which is the probability distribution that represents in which states the agent is. Belief state B is denoted by B(s, h) � Pr(s|h), and it can be updated by the following formula: .
(1) e agent's policy is denoted by π(h, a) � Pr(a|h) or π(b, a) � Pr(a|b). e purpose of the agent is to get a policy π * that maximizes the expected total reward E π [ T i�t c i− t r i ]. ere are two types of methods for planning POMDP. One method is based on value iteration method [7,29]. is type of method updates the belief state and value function using a model. Belief state is updated by equation (1), and value function is updated by updating the set of alpha vectors. Another method is the method using a black box simulator [8]. Black box simulator is a model that receives state and action and returns next state, observation, and reward. is method executes Monte Carlo tree search and Monte Carlo update of belief state using this simulator. Although these two planning methods use model or black box model, as normally model is not given to the agent, algorithms learning model or algorithms not using model are needed. e former is model-based method, and the latter is model-free method.
Model-based methods inference the model and then execute planning using the learned model. To learn the model, Bayesian method [30,31] and nonparametric approach [32] are proposed. Model-free methods are RNN method [9,10,33], complex-valued reinforcement learning [11][12][13], etc. ese methods do not use belief state but incorporate time series information directly into the value function. In this paper, we focus on the model-free method especially complex-valued reinforcement learning.

Complex-Valued Q-Learning for POMDP.
A POMDP problem that we are interested in here can be described by a tuple (S, A, T, R, Ω, O, c). S is a discrete set of states. A is a discrete set of actions. T(s ′ |s, a) is a state transition probability matrix for s, s ′ ∈ S and a ∈ A. R(s, a) is a reward function for s ∈ S and a ∈ A. Ω is a discrete set of observations. O(o|s, a) is an observation probability matrix for o ∈ Ω, s ∈ S, and a ∈ A. c ∈ (0, 1) is the discount rate. In the examples in this work, both the state transition and the observation are deterministic. We look for a policy for which the expected future cumulative reward is optimized. e policy can be derived from the action value Q-function in Q-learning.
Complex-valued Q-learning is based on the iteration: where _ Q is the complex-valued observation-action value function. e dot notation _ X for some quantity X is used throughout the manuscript to remind us that the quantity is complex-valued. o t and a t are the observation and action at time t, respectively. e learning rate α ∈ (0, 1) is a real number. e reward r t is a real number. c ∈ (0, 1) is the discounted rate, which is a real number. _ β � e iω is a complex hyperparameter for some real constant ω. _ Q max (t) is defined as Here, _ I t means the complex conjugation of _ I t . R denotes taking the real part of a complex number. e learned time series is reproduced by updating each complex number in the opposite phase direction. An eligibility trace method is further implemented such that the action value function is updated according to where for 0 ≤ k < N e and N e is the trace length. is update rule is exact for the table where the complete _ Q-table must be stored in the memory. In the variational approaches, the _ Q function is variationally optimized by minimizing the functional: with gradient descent θ←θ − α(zL/zθ). Finally, the policy is the Boltzmann stochastic policy: where T is a temperature hyperparameter.

Variational Quantum Circuit for Q-Function.
e _ Q function is constructed by the expectation value: where c Q is a scaling constant. θ is the set of trainable variational parameters for trainable unitary e quantity (l + 1) will be called the circuit depth. θ in (o, a) is a set of input parameters which encode the observation o and action a. θ in (o, a) parameterizes the input layer local unitaries. U out (θ out , a) is an output unitary to be measured by Hadamard test [34] against the output wave function ψ out . U out (θ out , a) is parameterized by some trainable θ out and the action a. e circuit structure is depicted in Figure 1. We implement three types of encoding schemes, which are summarized in the following paragraphs.
In Type 1 and Type 3 quantum circuits, the input layer is where the index i means the local rotation is acting on i-th qubit. e parameters θ Z in and θ Y in are uniform for all the qubits (independent of qubit index i). For Type 1 circuit, the encoding is where the subscript i means the local rotation is acting on i-th qubit. e encoding is done by binary representation of observation o where the subscript i means the local rotation is acting on i-th qubit. All the θ Z′ i,out , θ Y i,out , and θ Z i,out are trainable. For Type 1 and Type 2, we use U out (θ out , a) � ⊗ i∈ 0,a+1 for an action a ∈ A. Figure 2 depicts the quantum circuits encoding used in this work.
All circuit parameters are updated by gradient descent with loss function (5). e gradient of this loss function with respect to each parameter is calculated by the back propagation method on a classical simulator. As quantum circuit calculates the output state by the dot product of the gate matrix and the input state, this calculation is a special case of complex-valued neural networks. erefore, back propagation method of complex-valued neural networks [35] can be used in the back propagation of quantum circuit. e gradient of loss function (5) with respect to _ Q function is calculated by where _ Q Target � (r t+1 + c _ Q max (t)) _ u t (k).
ese back propagations are the almost same as the calculation of neural network back propagation, but using the quantum gates is different point. When a gate u(ϑ) with parameter ϑ is applied to input |ψ i 〉, the output |ψ o 〉 is gotten by |ψ o � U(ϑ)|ψ i 〉, where U(ϑ) is the transformed matrix from gate matrix u(ϑ) to calculate the inner product. e gradient is gotten by where zL/z|ψ o 〉 is the gradient of loss function L with respect to the output |ψ o 〉. e gradient with respect to parameter ϑ is calculated by where M ij represents the (i, j) element of the matrix M. Figure 3(b) shows the back propagation of the gate calculation. e gradient of loss with respect to the last output is back propagated by equation (10), and parameters are updated by gradient of equation (11).

Neural Networks.
For experimental comparison, we also implement complex-valued neural networks. e detail for the architecture of the neural networks is presented in this section. We use two types of neural networks for experiments. For both Type 1 and Type 2 neural networks, there are 3 layers: one input layer, one hidden layer, and one output layer. In Type 1 neural network, there are two input layer neurons and one output layer neuron. Both the action a and the observation o are encoded in the input layer neurons. e output neuron then gives _ Q(o, a). In Type 2 neural network, there is one input neuron and |A| output neurons. Only the observation o is encoded in the input layer neuron. Each output neuron then gives _ Q(o, a) for an action a. For both Type 1 and Type 2 neural networks, there is one hidden layer, and there are 30 neurons in the hidden layer. e networks are depicted in Figure 4. We use the learning rate 0.0001 for the hidden layer and 0.001 for the output layer. We defined the architecture type 1 based on the paper [13] and the architecture type 2 for comparison. We update the parameters by gradient descent using back propagation [35]. In complex-valued neural networks, fully connected layer is calculated by where x is the input vector. W is the weight matrix. b is the bias vector. o is the weighted sum of input. z is the output vector. ey are all complex numbers. f c (v) is the activation function for complex numbers and defined by where f(u) is activation function for real numbers.
where a means the complex conjugation of a. (zL/zz) is the gradient of loss L with respect to output vector z of this layer. e parameters are updated by gradient descent using these gradients.

e Maze Environment.
e partially observable maze environment used in our experiments is depicted in Figure 5. We defined the environment with reference to the maze environments of the paper [11,12]. In particular, the structure is the same as in the maze of the paper [12]. Each cell in the maze represents a possible state. e agent starts from the start state and aims for the goal state.

Results and Discussion
We perform numerical simulations to compare the learning curves of different approaches. Figure 6(a) shows the learning curves for _ Q-table (blue solid line with circle marker), complex neural networks (orange solid line and green dashed line), quantum circuit (red dotted line, purple dashed-dotted line, and brown solid line with "+" cross marker), and long short-term memory (LSTM, pink solid line with "x" cross marker). e vertical axis is the number of steps to the goal, so lower value means a better policy. e data is the average of 50 independent runs. Each episode has maximum number of steps 1000. e total number of episodes is 5000, and the data is further smoothed out by taking average over 500 sequential episodes. In all the _ Q learning experiments, the trace length is N e � 6. For LSTM, the history sequence length is 6. All the quantum circuits are using 6 qubits with depth � 3. All the neural networks have one hidden layer consisting of 30 neurons. We observe that the _ Q-table method has the most stable learning curve, and the learned policy is able to reach the goal with lower number of steps. is is expected since _ Q-table does not use any approximation in the representation of _ Q-function. Type 3 quantum circuit gives bad result, where the learned policy does not really improve the reaching time to the goal. e performance of Type 1 quantum circuits is not significantly different comparing to other classical complex  neural network approaches. We note that Type 2 quantum circuit provides the best result among all the _ Q approximation schemes. It is even better than LSTM approach in our experiments.
We compare the results for various hyperparameters. Figure 6(b) shows the learning curves for Type 2 quantum circuit with different trace number. We observe that the performance could be improved by using trace number 4 (green dashed line) rather than trace number 1 (blue solid line with circle marker) or 2 (orange solid line). However, the performance of trace number 10 (brown solid line with cross marker) is not good either. We observe the best learning curves for the trace number N e around 6 (red dotted line).
Since the best performance could be obtained by Type 2 circuit with trace number around 6, we then fix the circuit type to Type 2 and trace number to 6 and compare the learning curves for different widths and depths. Figure 7(a) shows the learning curves for various circuit depths. It is observed that the learning curve could be improved by increasing circuit depth from 1 (blue solid line), 2 (orange dashed line), 3 (green dotted line), to 4 (red solid line with circle marker) for n � 6. Figure 7(b) shows the learning curves for various circuit widths with fixed circuit depth � 4. We observe that higher circuit width makes the learning task more difficult. e learning curve for n � 5 (blue solid line) is better than that of n � 6 (orange dashed line) and n � 7 (green dotted line).
We then take an offline learning scheme to compare prediction performances of different machines. at is, the parameters are obtained from state-vector simulator-based training process, and the predictions are done using various different methods. e test results are depicted in Figure 8. We test the predictions by using a Numpy-based state-vector simulator [36], the QASM simulator provided by Qiskit [37], and the IBMQ system ibmq_guadalupe [38]. e horizontal axis "episode" indicates the number of episodes for training. Hence, episodes � 1000 means that 1000 episodes training is performed, and then the learned parameters are used for the corresponding prediction experiment. Each data point is the average of 5000 runs. e 5000 runs are conducted by 50 independently learned parameters, and for each learning, there are 100 prediction experiments. e number of shots for Hadamard-test estimation is 4096 for the QASM simulator. e quantum circuit type is Type 2. e number of qubits is 5, and circuit depth is 4. e trace number is 6. For the IBM Q system, a set of parameters trained with 5000 episodes is used for a prediction experiment. e number of shots is 1024. Five experiments are executed, and the number of steps to reach the target state is 67 steps, 1000 steps, 502 steps, 1000 steps, and 1000 steps, respectively. Since the maximum number of steps is 1000, "1000 steps" means that the agent does not reach the target in the experiment.

Conclusions
In this work, we propose the quantum circuit algorithm for POMDP based on the complex-valued Q learning. We implemented several encoding schemes and compare them to other classical approaches by numerical simulations. e observed learning curve suggests that, with suitable encoding, the learning efficiency of quantum complex-valued Q learning can be better than other classical methods like complex-valued neural networks. e performance of quantum circuit can be further improved by suitable choice of hyper-parameters. e learned parameters from simulators are then tested with IBMQ systems. e agent is able to reach the goal state with predictions made by real quantum machines. Our results provide a new method for POMDP problems with potential quantum advantage. e partially observable maze environment experiment executed in this work is a discrete state environment. For future works, our proposed method could be applied to the other simulation problems, such as partially observable continuous space environment Mountain Car [4,39]. Since the encoding scheme showing better result in this work is only used for discrete space, to solve continuous problem without discretization, another new encoding scheme for continuous environment will be needed.

Data Availability
e code used to support the findings of this study has been deposited in the GitHub repository https://github.com/ tomo920/QC-ComplexValuedRL.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.   Mathematical Problems in Engineering 9