Ensemble Network Architecture for Deep Reinforcement Learning

The popular deep Q learning algorithm is known to be instability because of the Q-value’s shake and overestimation action values under certain conditions. These issues tend to adversely affect their performance. In this paper, we develop the ensemble network architecture for deep reinforcement learning which is based on value function approximation.The temporal ensemble stabilizes the training process by reducing the variance of target approximation error and the ensemble of target values reduces the overestimate and makes better performance by estimating more accurate Q-value. Our results show that this architecture leads to statistically significant better value evaluation and more stable and better performance on several classical control tasks at OpenAI Gym environment.


Introduction
Reinforcement learning (RL) algorithms [1,2] are very suitable for learning to control an agent by letting it interact with an environment.In recent years, deep neural networks (DNN) have been introduced into reinforcement learning, and they have achieved a great success on the value function approximation.The first deep -network (DQN) algorithm which successfully combines a powerful nonlinear function approximation technique known as DNN together with the -learning algorithm was proposed by Mnih et al. [3].In this paper, experience replay mechanism was proposed.Following the DQN work, a variety of solutions have been proposed to stabilize the algorithms [3][4][5][6][7][8][9].The deep -networks classes have achieved unprecedented success in challenging domains such as Atari 2600 and some other games.
Although DQN algorithms have been successful in solving many problems because of their powerful function approximation ability and strong generalization between similar state inputs, they are still poor in solving some issues.Two reasons for this are as follows: (a) the randomness of the sampling is likely to lead to serious shock and (b) these systematic errors might cause instability, poor performance, and sometimes divergence of learning.In order to address these issues, the averaged target DQN (ADQN) [10] algorithm is implemented to construct target values by combining target -networks continuously with a single learning network, and the Bootstrapped DQN [11] algorithm is proposed to get more efficient exploration and better performance with the use of several -networks learning in parallel.Although these algorithms do reduce the overestimate, they do not evaluate the importance of the past learned networks.Besides, high variance in target values combined with the max operator still exists.
There are some ensemble algorithms [4,12] solving this issue in reinforcement learning, but these existing algorithms are not compatible with nonlinearly parameterized value functions.
In this paper, we propose the ensemble algorithm as a solution to this problem.In order to enhance learning speed and final performance, we combine multiple reinforcement learning algorithms in a single agent with several ensemble algorithms to determine the actions or action probabilities.In supervised learning, ensemble algorithms such as bagging, boosting, and mixtures of experts [13] are often used for learning and combining multiple classifiers.But in RL, ensemble algorithms are used for representing and learning the value function.
Based on an agent integrated with multiple reinforcement learning algorithms, multiple value functions are learned at the same time.The ensembles combine the policies derived from the value functions in a final policy for the agent.The majority voting (MV), the rank voting (RV), the Boltzmann multiplication (BM), and the Boltzmann addition (BA) are used to combine RL algorithms.While these methods are costly in deep reinforcement learning (DRL) algorithms, we combine different DRL algorithms that learn separate value functions and policies.Therefore in our ensemble approaches we combine the different policies derived from the update targets learned by deep -networks, deep Sarsa networks, double deep -networks, and other DRL algorithms.As a consequence, this leads to reduced overestimations, more stable learning process, and improved performance.

Related Work
2.1.Reinforcement Learning.Reinforcement learning is a machine learning method that allows the system to interact with and learn from the environment to maximize cumulative return rewards.Assume that the standard reinforcement learning setting where an agent interacts with the environment .We can describe this process with Markov Decision Processes (MDP) [2,9].It can be specified as a tuple (, , , , ).At each step , the agent receives a state   , and select an action   from the set of legal actions  according to the policy , where  is a policy mapping sequences to actions.The action is passed to the environment .In addition, the agent receives the next state  +1 and a reward signal   .This process continues until the agent reaches a terminal state.
The agent seeks to maximize the expected discounted return, where we define the future discounted return at time  as   = ∑ ∞ =0    + with discount factor  ∈ (0, 1].The goal of the RL agent is to learn a policy which makes the future discounted return maximize.For an agent behaving according to a stochastic policy , the value of the stateaction pair can be defined as follows:   (, ) = {  |   = ,   = , }.The optimal action-value function  satisfies the Bellman equation The reinforcement learning algorithms estimate the action value function by iteratively updating the Bellman equation  * (, ) =    ∼ [ +  max    * (  ,   ) | , ].When  → ∞, the algorithm makes -value function converge to the optimal action value function [1].If the optimal function  * is known, the agent can select optimal actions by selecting the action with the maximal value in a state:  * = argmax   * (, ).

Target Deep 𝑄
Learning.RL agents update their model parameters while they observe a stream of transitions like (  ,   ,  +1 ,  +1 ).They discard the incoming data after a single update.There are two issues with this method.The first one is that there are strong correlations among the incoming data, which may break the assumption of many popular stochastic gradient-based algorithms.Secondly, the minor changes in the  function may result in a huge change in the policy, which makes the algorithm difficult to converge [7,9,14,15].
As for the deep -networks algorithms proposed in (Mnih et al., 2013), two aspects are improved.On the one hand, the action value function is approximated by the DNN, DQN uses the DNN with a parameter  to approximate the value function, (, ; ) ≈  * (, ; ); on the other hand, the experience replay mechanism is adopted.The algorithm learns from sampled transitions from an experience buffer, rather than learning fully online.This mechanism makes it possible to break the temporal correlations by mixing more and less recent experience for updating and training.This model free reinforcement learning algorithm solves the problem of "model disaster" and uses the generalized approximation method of the value function to solve the problem of "dimension disaster." The convergence issue was mentioned in 2015 by Schaul et al. [14].The above -learning update rules can be directly implemented in a neural network.DQN uses the DNN with parameters  to approximate the value function ∘ .The parameter  updates from transition (  ,   ,  +1 ,  +1 ) are given by the following [11]: with The update targets for Sarsa can be described as follows: where  is the scalar learning rate. − are target network parameters which are fixed to  − =   .In case the squared error is taken as a loss function   (  ) =    ∼ (  − (, ;   )) 2 .
In general, experience replay can reduce the amount of experience required to learn and replace it with more computation and more memory, which are often cheaper resources than the RL agent's interactions with its environment [14].

Double Deep 𝑄 Learning.
In -learning and DQN, the max operator uses the same values to both select and evaluate an action.This can therefore lead to overoptimistic value estimates (van Hasselt, 2010).To mitigate this problem, the update targets value of double -learning error can then be written as follows: DDQN is the same as for DQN [8], but with the target  DQN  replaced with  DDQN  .

Ensemble Methods for Deep Reinforcement Learning
As Note that the more recent target network is likely to be more accurate at the beginning of the training and the accuracy of the target networks is increasing as the training goes on.So we denote a learning rate parameter  ∈ (0, 1] here for target network.The weight of th target network is   =  −1 / ∑  =1  −1 .So the learned -value function by temporal ensemble can be described as follows: ) .
As lim →1  −1 / ∑  =1  −1 = 1/, we can see that the target networks have the same weights when  equals 1.This formula indicates that the closer the target networks are, the greater the target networks' weight is.As target networks become more accurate, their weights become equal.The loss function remains the same as in DQN and so does the parameter update equation: In every iteration, the parameters of the oldest ones are removed from the target network buffer and the newest ones are added to the buffer.Note that the -value functions are inaccurate at the beginning of training.So the parameter  may be a function of time and even the state space.

Ensemble of Target
Besides these update targets formula, other algorithms based on value function approximators can be also used to combine.The update targets according to the algorithm  at time  will be denoted by   = ∑  =1      .The loss function remains the same as in DQN and so does the parameter update equation: The architecture of our ensemble algorithm is shown in Figure 1; these two parts are combined together by evaluated network.
The temporal ensemble stabilizes the training process by reducing the variance of target approximation error [10].Besides, the ensemble of target values reduces the overestimate and makes better performance by estimating more accurate -value.The temporal and target values ensemble algorithm are given by Algorithm 1.
As the ensemble network architecture shares the same input-output interface with standard -networks and target networks, we can recycle all learning algorithms with -networks to train the ensemble architecture.

Experiments
4.1.Experimental Setup.So far, we have carried out our experiments on several classical control and Box2D environments on OpenAI Gym: CartPole-v0, MountainCar-v0, and LunarLander-v2 [15].We use the same network architecture, learning algorithms, and hyperparameters for all these environments.
We trained the algorithms using 10,000 episodes and used the Adaptive Moment Estimation (Adam) algorithm to minimize the loss with learning rate  = 0.00001 and set  the batch size to 32.The summary of the configuration is provided below.The target network updated each 300 steps.The behavior policy during training was -greedy with  annealed linearly from 1 to 0.01 over the first five thousands steps and fixed at 0.01 thereafter.We used a replay memory of ten thousands most recent transitions.We independently executed each method 10 times, respectively, on every task.For each running time, the learned policy will be tested 100 times without exploration noise or prior knowledge by every 100 training episodes to calculate the average scores.We report the mean and standard deviation of the convergence episodes and the scores of the best policy.

Results and Analysis.
We consider three baseline algorithms that use target network and value function approximation, namely, the version of the DQN algorithm from the Nature paper [8], DSN that reduce over estimation [17], and DDQN that substantially improved the state-of-the-art by reducing the overestimation bias with double -learning [9].
Using this 10 no-ops performance measure, it is clear that the ensemble network does substantially better than a single network.For comparison we also show results for DQN, DSN, and DDQN. Figure 2 shows the improvement of the ensemble network over the baseline single network of DQN, DSN, and DDQN.Again, we see that the improvements are often very dramatic.
The results in Table 1 show that algorithms we presented can successfully train neural network controllers on the classical control domain on OpenAI Gym.A detailed comparison shows that there are several games in which TE DQN greatly improves upon DQN, DSN, and DDQN.Noteworthy examples include CartPole-v0 (performance has been improved by 13.6%, 79.5%, and 7.8%, and variance has been reduced by 100%, 100%, and 100%), MountainCar-v0 (performance has been improved by 26.7%, 21.2%, and 24.8%, and variance has been reduced by 31.6%, 77.9%, and 8.4%), and LunarLander-v2 (performance has been improved by 28.3%, 32.8%, and 50.5%, and variance has been reduced by 19.2%, 46.4%, and 50.5%).

Conclusion
We introduced a new learning architecture, making temporal extension and the ensemble of target values for deep  learning algorithms, while sharing a common learning module.The new ensemble architecture, in combination with some algorithmic improvements, leads to dramatic improvements  Although the ensemble algorithms are superior to a single reinforcement learning algorithm, it is noted that the computational complexity is higher.The experiments also show that the temporal ensemble makes the training process more stable, and the ensemble of a variety of algorithms makes the estimation of the -value more accurate.The combination of the two ways enables the training to achieve a stable convergence.This is due to the fact that ensembles improve independent algorithms most if the algorithms predictions are less correlated.So that the output of the -network based on the choice of action can achieve balance between exploration and exploitation.
In fact, the independence of the ensemble algorithms and their elements is very important on the performance for ensemble algorithms.In further works, we want to analyze the role of each algorithm and each -network in different stages, so as to further enhance the performance of the ensemble algorithm.

) 3 . 3 .
The Ensemble Network Architecture.The temporal and target values ensemble algorithm (TEDQN) is an integrated architecture of the value-based DRL algorithms.As shown in Sections 3.1 and 3.2, the ensemble network architecture has two parts to avoid divergence and improve performance.

Figure 2 :
Figure 2: Training curves tracking the agent's average score and average predicted action-value.(a) Performance comparison of all algorithms in terms of the average reward on each task.(b) Average predicted action-value on a held-out set of states on each task.Each point on the curve is the average of the action-value  computed over the held-out set of states.(c) The performance of DQN and TEDQN on each task.The darker line shows the average scores of each algorithm, and the orange shaded area shows the two extreme values of DQN and the green shaded area shows TE DQN.
So we can solve this issue by integrating different versions of the target network.In contrast to a single classifier, ensemble algorithms in a system have been shown to be more effective.They can lead to a higher accuracy.Bagging, boosting, and Ada Boosting are methods to train multiple classifiers.But in RL, ensemble algorithms are used for representing and learning the value function.They are combined by major voting, Rank Voting, Boltzmann Multiplication, mixture model, and other ensemble methods.If the errors of the single classifiers are not strongly correlated, this can significantly improve the classification accuracy.3.1.Temporal Ensemble.As described in Section 2.2, the DQN classes of deep reinforcement learning algorithms use a target network with parameters  − copied from   every  steps.Temporal Ensemble method is suitable for the algorithms which use a target network for updating and training.Temporal ensemble uses the previous  learned networks to produce the value estimate and builds up  ∈  complete networks with  distinct memory buffers.The recent -value function is trained according to its own target network (, ;   ).So each one of -value functions  1 ,  2 , . . .,   represents temporally extended estimate of value function.
DQN classes use DNNs to approximate the value function, it has strong generalization ability between similar state inputs.The generalization can cause divergence in the case of repeated bootstrapped temporal difference updates.

Table 1 :
The columns present the average performance of DQN, DSN, DDQN, EDQN, and TE-DQN after 10000 episodes, using -greedy policy with  = 0.0001 after 10000 steps.The standard variation represents the variability over seven independent trials.Average performance improved with the number of averaged networks.existing approaches for deep RL in the challenging classical control issues.In practice, this ensemble architecture can be very convenient to integrate the RL methods based on the approximate value function. over