Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents

Vanilla policy gradient methods suﬀer from high variance, leading to unstable policies during training, where the policy’s performance ﬂuctuates drastically between iterations. To address this issue, we analyze the policy optimization process of the navigation method based on deep reinforcement learning (DRL) that uses asynchronous gradient descent for optimization. A variant navigation (asynchronous proximal policy optimization navigation, appoNav ) is presented that can guarantee the policy monotonic improvement during the process of policy optimization. Our experiments are tested in DeepMind Lab, and the experimental results show that the artiﬁcial agents with appoNav perform better than the compared algorithm.


Introduction
Navigation in an unstructured environment is one of the most important abilities for mobile robotics and artificial agents [1][2][3]. Traditional methods mainly divide navigation into several parts [4]: simultaneous localization and mapping (SLAM) [5][6][7], path planning [8], and semantic segmentation [9,10]. e methods mentioned are not an end-to-end algorithm where each part is a challenging research subject, and the fusion of each part often leads to large computational errors. To reduce the fusion error, we focus on the end-to-end navigation based on deep reinforcement learning where navigational abilities could emerge as the byproduct of an artificial agent learning policy with reward maximization.
With the fast development of deep learning [11][12][13][14], a variety of DRL architectures have been proposed [2]. Mnih et al. [15] presented the advances in training deep neural networks to develop the deep Q-network (DQN), which can learn successful policies directly from high-dimensional image inputs using end-to-end reinforcement learning. Onpolicy reinforcement learning methods such as actor-critic (AC) [16,17] were proposed such that the actor is the policy, and the critic is the baseline. Minh et al. [18] presented asynchronous variants of AC algorithms, termed as asynchronous advantage actor-critic (A3C), and showed that parallel actor-learners have a stabilizing effect on training artificial agents. Researchers can construct navigation agents based on these DRL algorithms. However, vanilla policy gradient methods have poor data efficiency [19], which leads to navigation agents suffering from high variance and unstable policies.
In this work, we take A3C as an example to show how to guarantee the policy monotonic improvement. e training environment is DeepMind Lab [20], and it is a first-person 3D virtual environment designed for research and development of general artificial intelligence. DeepMind Lab can be used to study how autonomous artificial agents learn complex tasks in large, partially observed, and visually diverse worlds. In addition, the worlds are rendered with rich science fiction-style visuals. Actions are to look around and move in the 3D virtual world, and example tasks include navigation in different mazes. Mirowski et al. [21] proposed a DRL navigation method based on A3C [18], augmented with auxiliary learning targets, to train artificial agents to navigate in DeepMind Lab. For ease of expression, we call the DRL navigation using A3C as a3cNav.
In this paper, the issues on policy optimization for navigation based on the vanilla policy gradient are analyzed; this type of navigation cannot control the change of expected advantage when an artificial agent learns to navigate in a maze. Based on the navigation techniques presented in [21], we show how to reduce training variances and get higher reward when an artificial agent interacts with an environment. Inspired by [19,22], we adjust the policy update process of the navigation in [21] to guarantee the monotonic improvement of the navigation policy. Experimental results show that an artificial agent via appoNav learns better navigation policy in DeepMind Lab and suffers from lower standard deviation than a3cNav.

Related Work
Traditional navigation, which is model-based, includes simultaneous localization and mapping (SLAM) [5,7,23], path planning [8,24], and semantic segmentation [9]. Each part of them is a challenge research area, and the fusion of them often leads to large computation error. Moreover, model-based navigation needs to model the environments effectively for some dynamic and complex scenes, which severely affect navigation performance.
With recent advances in DRL, many navigation methods based on DRL have been proposed [2]. DRL navigation, which is end to end, avoids the computation error caused by the fusion of traditional navigation. Mirowski et al. [21] addressed navigation via auxiliary depth prediction and loop-closure classification tasks. Jaderberg et al. [25] also used auxiliary tasks for navigation and incorporated A3C with control tasks and prediction tasks including pixel control and reward prediction. By using features extracted from the world model as inputs to an agent, Ha and Schmidhuber [26] used DRL to construct a world model and used the model in a car navigation task. Bruce et al. [27] leveraged an interactive world model based on DRL built from a single traversal of the environment and utilized a pretrained visual feature encoder to demonstrate successful zero-shot transfer under real-world environmental variations without fine-tuning. Banino et al. [28] proposed a vector-based navigation method that fuses DRL with gridlike representations in the artificial agent. When these DRL navigation agents interact with environments, the state sequences of each interaction change a lot, leading to large fluctuations in rewards. erefore, these DRL navigation methods suffer from high variance and have unstable policies during training.

Reinforcement Learning.
We consider the standard reinforcement learning setting where an artificial agent interacts with an environment over a number of discrete time steps. At each time step t, the agent receives a state s t from the environment and outputs an action a t according to its learned policy π. In return, the environment gives the agent a next sate s t+1 and a reward r t . e goal of reinforcement learning is to maximize the accumulated reward R t � ∞ k�0 c k r t+k , which is a discounted sum of rewards. e action-value function Q π � E[R t | s t � s, a] is the expected return following action a from state s under policy π. e value function V π � E[R t | s t � s] is the expected return from state s.
In policy-based methods, let π(a | s; θ) be a policy with parameters θ, which is updated by performing gradient ascent on E[R t ]. Policy gradient algorithms adjust the policy by updating parameters θ in the direction To reduce the variance of this estimate, Williams [29] subtracted a learned function called baseline b t (s t ) for the return, so the improved gradient becomes can be seen as an estimate of the advantage of action at under state s t . e numerical value of Q π (s, a) equals the value of R t ; hence, the advantage function can be rewritten as is method is called actor-critic (AC) architecture where the actor is the policy π and the critic is the baseline b t [16,17]. Minh et al. [18] presented asynchronous variants of AC algorithms, termed as asynchronous advantage actorcritic (A3C), and showed that parallel actor-learners have a stabilizing effect on training artificial agents.
When a DRL agent interacts with its environment, the state sequences of each interaction change a lot, leading to fluctuations in rewards. erefore, DRL algorithms (such as DQN and A3C) have unstable fluctuations during training. Researchers wonder whether they can find a method to reduce such fluctuations while maintaining a steady improvement in the policy. Schulman et al. [22] proposed trust region policy optimization (TRPO) to make the monotonic improvement for the policy. Furthermore, Schulman et al. [19] proposed proximal policy optimization (PPO) to simplify the calculation of TRPO. In addition, Heess et al. [30] proposed a distributed implementation of PPO, called distributed PPO. Besides the similar process of the gradient update with A3C, distributed PPO includes various tricks, such as normalizations (observation normalization, reward reshape normalization, and per-batch normalization of the advantages), sharing of algorithm parameters across local workers, and additional trust region constraint. ese tricks result in that the computation of distributed PPO is more complex than appoNav.

NavA3C + D 1 D 2 .
In this work, we use the NavA3C + D 1 D 2 architecture [21] as shown in Figure 1, which includes 2 CNNs and 2 LSTMs. NavA3C + D 1 D 2 has 4 inputs: the current RGB image x t , previous reward r t−1 , previous action a t−1 , and the current velocity v t . e 2 CNNs act as the encoder for RGB image x t , and the first LSTM makes associations between reward r t−1 and visual observations x t that are provided as context to the second LSTM from which the policy π(a t | s t ; θ) and the value V(s t ; θ v ) are computed. Artificial agents based on this architecture try to maximize the cumulative reward R t during their interaction with the maze and minimize the auxiliary depth losses L Depth1 and L Depth2 . Finally, the agent can learn how to 2 Journal of Robotics navigate in DeepMind Lab. For ease of expression, we rename NavA3C + D 1 D 2 as a3cNav. a3cNav is based on the A3C framework into which unsupervised auxiliary tasks are incorporated. erefore, its loss function includes the loss of A3C L A3C and the loss of auxiliary tasks. a3cNav can be optimized as follows: where λ Depth1 and λ Depth2 are weighting terms on the individual loss components. e global parameters θ of a3cNav are updated in multithread environments, and θ are copied to the local worker parameters θ ′ . e local worker of a3cNav interacts with the maze, and the policy gradients wrt θ ′ and the value gradients wrt θ v ′ are computed from the policy loss and value loss. e gradient for the parameter update is proportional to the product of advantage function A t . Equation (2) shows the calculation of gradients: where H(π(s t ; θ ′ )) is the entropy of the policy π, which improves exploration by discouraging premature convergence to suboptimal deterministic policies. en, asynchronous update of θ using dθ and of θ v using dθ v are applied into the global network for parameter update.

Monotonic Policy Improvement.
e artificial agent interacts randomly with the environment which in turn gives high-dimensional images to the agent. Hence, a3cNav has poor data efficiency and robustness. In addition, complex navigation environment that sends changing images to the artificial agent aggravates the variance and instability of training. In detail, each local worker of a3cNav interacts with the maze, and the gradients with big variance are applied to the global network of a3cNav, leading to the unstable training of the agent. In this section, we improve the parameter updates of a3cNav to guarantee its policy monotonic improvement.
In [22], a policy can be rewritten as where π denotes a stochastic policy and π is another policy. η(π) and η(π) are the expected discounted cost for π and π, respectively. Here, ρ π (s) is the distribution of the state s according to π, and A π is the advantage function following π. Equation (3) implies that if we want to reduce η or leave it as constant, we should keep the expected advantage a π(a | s) A π (s, a) ≤ 0 at every state s when a policy update π ⟶ π. is demonstrates that if we want to reduce the training variance of a3cNav and keep its policy monotonic improvement, we must guarantee a π(a | s)A π (s, a) ≤ 0. However, a3cNav cannot control the change of the expected advantage when the artificial agent learns to navigate in the maze.
To make the policy monotonic improvement, Schulman et al. [22] proposed a trust region constraint, as shown in equation (4), over policy update to make a π(a | s)A π (s, a) ≤ 0: E t KL π θ old ·|s t , π θ ·|s t ≤ δ.

(4)
Equation (4) is relatively complex and is not compatible with the architectures which include parameter sharing between the policy function and the value function, or with auxiliary tasks [19]. e policy and the value network of a3cNav both share the same network, and a3cNav has the auxiliary depth prediction. erefore, TRPO cannot be used into a3cNav.
E t min π θ a t s t π θ old a t s t A t , clip π θ a t s t π θ old a t s t , (5) PPO [19] improves TRPO with only first-order optimization and replaces the constraint with the clipped Figure 1: a3cNav architecture. In the architecture, image x t is the input of a3cNav, and following the full connection layer is a two-layer CNN which outputs depth D 1 as well as a two-layer stacked LSTM which outputs depth D 2 , policy π, and value V. In addition, auxiliary task used in this architecture in which the first LSTM only receives the reward and the velocity and previously selected action are fed into the second LSTM.
surrogate objective as equation (5). Hence, PPO is a firstorder optimization method and is compatible with parameter sharing and auxiliary tasks.

appoNav.
To make the monotonic improvement for the navigation policy, we seek to incorporate the features of PPO into the local worker of a3cNav. In each thread, the improved local policy tends to improve monotonically. And the new local gradients are applied to the global network, leading to the whole network with monotonic improvement. As the navigation method is based on the monotonic policy improvement of PPO, we call this navigation as appoNav.
Assume that the global network shared parameter vector θ and local worker parameter vector θ ′ . Equation (6) is the policy optimization loss of A3C [18]: When added to the local worker of a3cNav, the loss function becomes the form of equation (5) with entropy of the policy, and it is rewritten for the local workers as Equation (7) is the policy update of the local worker of a3cNav, that is, appoNav. Each local worker has a low variance than before and applies the new gradient to the global network for the policy update. Finally, the whole policy generated by appoNav has lower variance and more stable training performance.

Experimental Settings.
We implement our algorithm in TensorFlow and train it on Nvidia GeForce GTX Titan X GPU and Intel Xeon E5-2687W v2@3.4GHz * 17 CPU. e proposed method is evaluated in DeepMind Lab environments [20]. e action space in DeepMind Lab has 8 actions: the agent can rotate in small increments, accelerate forward or backward or sideways, or induce rotational acceleration while moving. Reward encourages the agent to learn navigation; a reward is achieved when the artificial agent reaches a goal from a random start location and orientation. If the agent reaches the goal, a new episode starts, and the same interaction restarts. Fruit represents the reward in DeepMind Lab: apples are worth 1 point, strawberries 2 points, and goals 10 points.
appoNav is evaluated by training the agent in stair-way_to_melon and nav_maze_static_01 of DeepMind Lab. For ease of expression, we name stairway_to_melon as the stairway maze and nav_maze_static_01 as the static01 maze. In each case, blue curve stands for a3cNav and orange for appoNav. For experimental analysis, we run 2500 episodes for the stairway maze and 7800 episodes for the maze01 maze. Table 1 shows the images that the artificial agent sees in the stairway maze; we stochastically select 3 episodes from time 600 to 2500 with  Journal of Robotics interval 100, which demonstrate three different states at the same time with different episodes. e artificial agents can receive different images and be not stuck in one place, which demonstrates the agents learning to navigation in stairway maze. Figure 2 shows the reward achieved by the artificial agent in stairway_to_melon; it shows that appoNav gets higher reward than a3cNav. In addition, we calculate the standard deviation (std) of the reward curve. From Table 2, the reward std of appoNav and a3cNav is 27.24 and 30.16, respectively; this shows that the learning process of the former is more stable than the latter one. e reason why our method converges faster is that the local worker of appoNav can generate a more stable policy with the monotonic improvement when it interacts with the stairway. During the training iterations, improved accumulated gradients are applied for the parameter update of appoNav, which make appoNav more stable than a3cNav.   For further verification of appoNav's effectiveness, we test our agent in the maze01 maze which is more complex than the stairway maze. Because the agent needs more time to converge, we stochastically select 3 episodes from time 1000 to 4800 with interval 200, as shown in Table 3. Figure 3 shows the reward achieved by the artificial agent in nav_maze_static_01; it demonstrates that appoNav performs better than a3cNav, and it has higher reward. Table 4 shows that the std of a3cNav is 28.99, and the std of appoA3C is 24.79. e policy learnt by appoNav is more stable than the policy learnt by a3cNav.

Experimental Results and Analysis.
Owing to that appoNav uses better gradient ascents to update each policy, the artificial agent with appoNav learns stronger navigation ability as each local worker produces a more stable policy in the complex maze.

Conclusion
Visual navigation-based vanilla policy gradient methods suffer from high variance and instability during training, where the navigation performance fluctuates greatly between iterations. We analyze the reason why visual navigation suffers such an issue and improve its policy update to guarantee the policy monotonic improvement. e improved method appoNav has lower standard deviation and gets higher reward. In short, appoNav can learn better navigation policy.

Data Availability
e raw data required to reproduce these findings are available to download from https://github.com/deepmind/ lab.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.