Fast Transfer Navigation for Autonomous Robots

Navigation technology enables indoor robots to arrive at their destinations safely. Generally, the varieties of the interior environment contribute to the difficulty of robotic navigation and hurt their performance. (is paper proposes a transfer navigation algorithm and improves its generalization by leveraging deep reinforcement learning and a self-attention module. To simulate the unfurnished indoor environment, we build the virtual indoor navigation (VIN) environment to compare our model and its competitors. In the VIN environment, our method outperforms other algorithms by adapting to an unseen indoor environment. (e code of the proposed model and the virtual indoor navigation environment will be released.


Introduction
Autonomous navigation is the key capability of an intelligent robot, enabling it to arrive at the destination on the optimal route. It plays an important role in many fields, i.e., mobile robotics [1], unmanned aerial vehicles [2], human-machine interaction [3], etc.
In recent years, the lack of skilled labor promotes the development of social robots for home renovation. erefore, this paper focuses on the indoor navigation of these robots, especially the efficiency of robotic navigation. e methods for autonomous navigation can be classified as the model-based approaches [4][5][6] and the model-free approaches [7].
Model-based methods [4][5][6] explicitly implement robotic location and mapping construction for robotic navigation. Cui et al. [4] modeled the path of the vehicle as pieces of lines and proposed to simultaneously estimate the positions of the global positioning system receiver and the parameters of the line. Yoon and Raychowdhury [5] designed NeuroSLAM to mimic place cells and head direction cells in a rodent brain for robotic location. Engel et al. [6] implemented large-scale direct monocular SLAM by operating on image intensities both for tracking and mapping. e application of radar sensors for situational awareness leads to the high cost of hardware. In addition, iterative location and mapping inevitably cause the accumulated error, which hurts the navigation performance of intelligent robots. Given the positions of a robot and a destination, algorithms [8] can plan the optimal path between them, and then dynamic control [9,10] is used to implement robotic motion and navigation.
Model-free methods adopt deep reinforcement learning (DRL) [11] to model the intelligent agent, where the interaction between the agent and environment is formulated as a Markov decision process (MDP).
In our previous work [12], we proposed a tutor-student network for transfer navigation. Specifically, the tutor module must be pretrained in a maze to guide the student module navigate in an unseen maze. However, this two-stage training process inevitably causes high computational costs. In this paper, we employ an attention mechanism to enhance the feature representation in the tutor module and facilitate training for fast transfer navigation. Besides, a comprehensive navigation platform, virtual indoor navigation (VIN), based on ViZDoom [13], is built to simulate the indoor scenes of rough houses. To our knowledge, it is the first 3D virtual environment for robotic navigation in rough houses. In conclusion, our contribution could be formatted in two ways: (1) We proposed an attention-based tutor-student model (ATS) for fast transfer navigation. We incorporate a self-attention module into the tutorstudent network for enhancing the feature representation and facilitating the training of robots. (2) We built the first 3D rough indoor environment for robotic navigation.
(3) We evaluated the proposed module in VIN and validated its effectiveness on fast transfer navigation.

Model-Free Navigation.
With its rapid development, deep learning is blended with reinforcement learning to achieve model-free navigation. A Siamese actor-critic network [14] was proposed for the goal-driven visual navigation by introducing the features of a goal. In addition to the goal, Mirowski et al. [15] leveraged multimodal sensory inputs for depth prediction and loop closure classification tasks. Mirowski et al. [16] employed a dual pathway structure to incorporate the locale-specific knowledge with general policies for navigating in multicities. Wang et al. [17] proposed a reinforced cross-modal matching (RCM) approach that enforces cross-modal grounding via reinforcement learning and introduced a self-supervised imitation learning (SIL) method to encourage an agent imitate its own past, good decisions for improving its generalization. Mousavian et al. [18] exploited an advanced detector concurrently with semantic segmentation to develop the transfer ability of agents. Li et al. [19] employed role-playing learning formulated with DRL for robotic navigation in a socially concomitant manner. Zeng et al. [12] presented a tutor-student framework which consists of a "tutor" module and a "student" module. e "tutor" module can facilitate the training of the "student" module with its general features and knowledge. [20][21][22] has been used frequently in various fields of computer vision, which makes more efficient use of regions of interest and releases distraction. Mnih et al. [20] adopted a recurrent network to adaptively extract features from a sequence of regions at high resolution. Google [21] proposed transformer, an attention-based network architecture, whose self-attention module replaces the encoder and decoder in traditional translation methods. It can greatly improve the efficiency of model inference and outperform other existing methods. Hu et al. [22] designed a squeeze and excitation block to weight the channel-wise features, which enables the model to focus on those significant features. Except channel-wise attention, Yu et al. [23] introduced conditional pixel-wise attention to improve the performance of naive Lite-HRNet by means of element-wise weighting operation.

Formulation.
Robotic navigation is formulated as MDP, which models an agent with deep reinforcement learning to interact with the environment as shown in Figure 1.
Specifically, the agent implements an action after observing the current state of the interactive environment and receives a reward simultaneously. In visual navigation, the state is an RGB image observed by the agent, and the reward implies whether the agent arrives at its destination or not. It aims at maximizing the accumulated rewards with the lowest time cost by learning an optimal policy. Considering the distribution of all potential actions, the accumulated reward R is formulated as the mathematical expectation of all possible rewards according to the implemented actions.
where θ represents the learnable parameters of policy π. It can be optimized iteratively to maximize the respected accumulated rewards R by the policy gradient algorithm.
where π(a t |s t ; θ k−1 ) is the probability of selected action a t under state s t following the policy π and α is the learning rate. Advantage Actor-Critic (A2C) [24], a typical policy gradient method, is used to learn this optimal policy for robotic navigation. e framework of the A2C network contains two head blocks, i.e., actor block and critic block. e critic block is used to estimate the accumulated reward, and the actor block aims to predict the optimal action for navigation.

Attention-Based Tutor-Student Net.
We proposed an attention-based tutor-student net to build the agent which is composed of a tutor module and student module. e tutor module trained beforehand in other mazes can extract general features for navigating in an unseen maze. e student module extracts locale-specific features and feeds them into the actor-critic network with general features (Figure 1). Training the tutor module beforehand inevitably increases the temporal cost. To mitigate this problem, we design a pixel-wise attention module in the tutor module to enhance the representation of general features.  Journal of Robotics ere is much similar and abundant information in a series of individual frames. erefore, every four RGB images are converted into gray images and stacked as a fourchannel input tensor in data preprocessing. Feature maps are extracted from this input tensor by feature extraction modules which consist of four convolution layers (see Figure 2).
In the tutor module, we take advantage of the attention mechanism to bridge the dependency within those features and enhance the feature representations. Since the in-plane information plays an important role in the performance of an agent, we emphasize on bridging the in-plane relationship. Specifically, the C × H × W feature maps from the feature extraction block are separated into H · W pixel-wise embeddings along the channel axial, as shown in Figure 3. Each embedding X is projected to a query vector Q, a key vector K, and a value vector V through mapping matrices W Q , W K , and W V , respectively. For the ith embedding X i , the distribution A i of query Q i measures the similarity between Q i and all key vectors.
where d is the dimension of the key. After obtaining the attention matrix A, we can generate a feature representation Z by weighting the value vector V adaptively according to the attention matrix A. Essentially, the attention block aggregates similar features globally, which can effectively alleviate the local dependency in convolutional operations.

Virtual Environment
Currently, there is no platform which contains various unfurnished rooms for robotic navigation. To verify the effectiveness of the proposed algorithm, a virtual unfurnished indoor environment is built based on Doom [13]. Specifically, to construct a building environment, a floor plan dataset is collected online with crawling technology, which consists of 4 parts: single-bedroom set, two-bedroom set, three-bedroom set, and four-bedroom set. Each set includes 20 plans with various shapes and scales. Figure 4 shows four randomly selected examples from each part. For each floor plan, a 3D navigation environment can be built using a generation tool. Finally, we obtain an indoor navigation platform which includes 80 different 3D scenes.
In the 3D navigation environment, an agent can observe the unfurnished room by using a first-person camera, as shown in Figure 5. e ceiling and floor are all earth yellow, and the wall is covered with the texture of brick patterns. In addition, we set a red cylinder object to represent the destination. To achieve the goal, the agent aims to arrive at the destination by moving forward, turning left, or turning right.
Except the first-person view image, RGBD image and labeling image are able to be captured in our platform. In addition, an agent can also obtain its location and orientation as auxiliary information.

Experiments
e navigation task of a decoration robot is to be able to reach a specified target location on an optimal route from any location in a single decoration environment to a single scene. In this paper, we test the proposed self-attentionbased algorithm in an interior decoration environment built on ViZDoom. Specifically, we evaluate our model in four different house types, i.e., single-bedroom setting, twobedroom setting, three-bedroom setting, and four-bedroom setting. Such experiments are effective in testing our model's ability to navigate across scenes.

Experiment Configuration.
At the beginning of each navigation task, the intelligent body randomly initializes its position.
is can effectively avoid the overfitting of the smart body to a single navigation route and improve the generalization performance of the model. In each decoration scene, the agent is trained 100 episodes, and each episode is trained 50000 steps. Considering that the input sequential images contain a lot of redundant information, we use three sequential frames as a state tensor with the image size of 640 × 480. Our model is implemented by TensorFlow [25] implementation, using Adam optimizer for parameter optimization and training it on an Nvidia GeForce GTX 1080 Ti GPU.

Experiment Results.
To validate the effectiveness of our proposed model, it is fairly compared with A2C and tutorstudent models in the same experiment configuration. e tutor-student model is implemented in a two-stage manner. erefore, we illustrate the experiment results in the following two sections.

5.2.1.
Training from Scratch. ATS, tutor-student, and A2C models are trained from scratch in four mazes with different styles, i.e., one-bedroom maze, two-bedroom maze, threebedroom maze, and four-bedroom maze. eir reward curves during the training process are shown in Figure 6. In all mazes, the reward curves of ATS not only raise and converge much earlier than the other two models but also receive higher rewards, especially in two-bedroom (see Figure 6(b)), three-bedroom (see Figure 6(c)), and fourbedroom mazes (see Figure 6(d)). Except the single-bedroom maze (see Figure 6(a)), the tutor-student model performs similarly as A2C. e advantage of the ATS model benefits from the self-attention module which can effectively facilitate the agent's training process.

Transfer Navigation.
To evaluate the performance of the three models in a transfer navigation task, we also select one maze for testing from each part in VIN. Similar to the tutor-student network, the tutor module of ATS is trained in the two-bedroom maze and three-bedroom maze and finetuned in the one-bedroom maze and four-bedroom maze for transfer navigation. To illustrate their advantage in robotic navigation, A2C is trained from scratch in one-bedroom and four-bedroom mazes as baseline. Figure 7 shows that ATS outperforms the tutor-student network and A2C model in the transfer navigation task. Specifically, in all mazes, the average reward curves of vanilla A2C start to raise at almost 40M steps and converge at 140M steps, while the tutor-student model improves its navigation ability and arrives at the destination earlier than A2C, which is obvious in transfer navigation in Figure 7(a). Because the structure of the one-bedroom maze is relatively easier than that of the four-bedroom maze, the average reward curves of the ATS and tutor-student network converge towards similar average rewards in Figures 7(a) and 7(c). However, in the four-bedroom maze with a complex structure (see Journal of Robotics 5 rewards than the tutor-student network and the A2C model, which benefits from the attention module.

Conclusion
Indoor navigation plays an important role in robot movement. e state-of-the-art tutor-student model achieves transfer navigation in two steps, which greatly increases the temporal cost and decreases its efficiency. Our paper proposes an ATS model to facilitate its training process for fast transfer navigation. Especially, a self-attention network is used to reweight the extracted features and focus on the taskaware features in the tutor module. To evaluate our model in unfurnished indoor navigation, we build VIN, a virtual navigation platform based on the Doom engine. e experiment results illustrate that the self-attention module effectively speeds the training process up and benefits to the fast transfer navigation of the proposed ATS model.

Data Availability
e training data used to support the findings of this study have been deposited in the GitHub repository (https:// github.com/WangChen100/vizdoomgymmaze).  Journal of Robotics