Incremental Instance-Oriented 3D Semantic Mapping via RGB-D Cameras for Unknown Indoor Scene

,


Introduction
Robot vision plays an important role with the development of artificial intelligence industries. With aid of RGB-D cameras (such as Kinect), robots can "see" and analyze the surrounding environment easily. en, how to make robots accurately and rapidly percept the meaning of objects in real-world environments without a prior knowledge is one of the most important technologies in robotic community. For tasks, such as path planning, object grabbing, or even autonomous driving, we need not only the semantic understanding of a single object but more important, the spatial relationships and layout among individual instances in a 3D environment. It thus leads to the demand of building highlevel instance-oriented representations of the scene that would greatly advance the human-robotic interaction. Hence, building progressive semantic instance-level 3D map for indoor scenes with multiview RGB-D images has always been a major project for researchers. e conventional methods of constructing object-aware semantic maps generally consist of two inseparable aspects: instance segmentation of 3D image and transformation across multiple views. e former focuses on obtaining semantic information via the convolutional neural network [1][2][3][4][5][6], which is followed by integrating geometric segmentation approach to label 3D objects of the scene. e latter usually carries out simultaneous localization and mapping (SLAM) [7][8][9], which completes 3D scene reconstruction using RGB-D cameras. Motivated by the mentioned technologies, several works efficiently combine them to generate a semantically segmented 3D map [10][11][12] and have achieved impressive results. However, such methods suffer from the oversegment problem or lack of proper data association strategy, and meanwhile, they are computationally inefficient, making them unsuitable for the real-time applications. Some other works focus on processing large-scale video retrieval [13][14][15], but they mainly deal with the entire scene.
is paper intends to incrementally build instance-orient semantic 3D maps via RGB-D cameras in real time. Without the need of a prior knowledge, the proposed mapping system contains optimized semantic information about the individual object instances from the scene and, meanwhile, integrates semantic probabilities from multiple viewpoints to a globally consistent 3D semantic map. e entire algorithm is basically carried out in three steps. First, RGB images captured by cameras undergo the Mask R-CNN [1] algorithm to generate 2D instance and class predictions. In the second step, the proposed system associates prediction results online into corresponding point cloud mapping by the SLAM system. To improve the instance accuracy, we utilize a Gaussian mixture model with the EM algorithm to cluster and optimize semantical labels predicted from the convolutional neural network. In the last step, we propose a voxel-based Bayesian update strategy towards incremental class update across different frames, which will be incorporated into the truncated signed distance function-(TSDF-) based reconstruction maps for the purpose of accelerating the computational efficiency and reducing time complexity. e major difference between our system and other works [10,16,17] is that we employ the projection relation between voxel and pixel directly to obtain instances semantic in the 3D map instead of using the combination between geometry segmentation on depth images and 2D instance segmentation methods. Doing so helps avoid oversegment with no computation increased. Moreover, our goal is to build an instance-level indoor map consisting of reconstructed object instances with semantic annotation. So, unlike many other dense reconstructions works [18][19][20] that pursuit accurate instance segmentation, the proposed approach aims to achieve the real-time performance, facilitating real-life robotic applications.
To sum up, the main contributions of this work are as follows: (i) A novel incremental instance-oriented mapping system that utilizes an RGB-D camera to obtain sequential images and represents as a TSDF-based voxelization map (ii) An optimization method based on a Gaussian mixture model that clusters the point cloud, further integrating TSDF volumes that contain semantic class and instance IDs (iii) A voxel-based Bayesian update strategy that tracks and updates class probability distribution across different frames to perform consistent global scene mapping (iv) Qualitative and quantitative analysis of the proposed system on the SceneNN [21] dataset in multiple scenarios

Related Works
2.1. Dense 3D Scene Reconstruction. We can roughly divide 3D reconstruction technologies based on RGB-D images into three categories: feature-based methods, voxel-based methods, and surfel-based methods. Feature-based methods, in general, involve front-end frame-to-frame motion through feature matching and back-end "loop closing" constraints from a heuristic search to perform pose graph optimization.
e first popular open-source system was RGB-D SLAM [22] proposed by Endres et al. Subsequent similar methods include DVO-SLAM by Kerl et al. [23] and ORB-SLAM2 by Mur-Artal and Tardos [24]. Although such methods directly consume the point cloud, they could cause incomplete instance segmentation in object-level mapping tasks. Voxel-based methods, such as [8,25,26], integrate all depth data of the sensor into a volume model from a 3D space, which uses the iterative closest point (ICP) algorithm to track camera poses and reconstruct dense 3D scene maps.

Semantic
Instance-Aware Mapping. Previous methods have addressed the task of mapping at the level of individual objects. Civera et al. [27] used a monocular SLAM system to create 3D environment maps and then inserted the modeled object from the built database. Similarly, Pavel et al. [28] also required priori 3D object models. Although these methods perform object-oriented semantic mapping, the requirement for priori knowledge of modeling objects makes it difficult for them to be applied in real-time human-robot interaction.
Recent developments in deep learning have also enabled the integration of rich semantic information within realtime simultaneous localization and mapping (SLAM) systems. e work in [11] fuses semantic predictions from a CNN into a dense map built with a SLAM framework. However, conventional semantic segmentation is unaware of object instances, i.e., it does not disambiguate between individual instances that belong to the same category. us, the approach in [11] does not provide any information about the geometry and relative placement of individual objects in the scene. A number of other works have addressed the task of detecting and segmenting individual semantically meaningful objects in 3D scenes without predefined shape templates [10,16,17,27,[29][30][31][32][33][34]. Runz et al. [32] employed the object detector for the first step and then updated the class probabilities of each element consisting of the reconstructed 3D map. As it has a huge time complexity, these methods suggested to only extract semantic information on a subset of the input frames; McCormac et al. [29] utilized the same prediction model but aims at extending the SLAM system by means of object-level pose graph optimizations and relocalizations. [16,17] are similar to that, but they employ depth segmentation methods to segment 3D instances, which led them to take different approaches and reach different goals. [22] Proposes an object-oriented mapping system that combines a Single Shot MultiBox Detector (SSD) [6] with ORB-SLAM2 [24]. ere are also several object-oriented dense 3D mapping methods [30,31], the main idea of which is to obtain 2D semantic information by a CNN framework, create associated relationships between 2D semantic and 3D mapping, and then utilize conditional random fields (CRFs) as a postprocessing step to refine the results of semantic segmentation. Another project worth mentioning is [35]. Although it also combines a CNN and SLAM to generate 3D semantic mapping, it adds a recurrent neural network (RNN) [28] in data association.

Instance Detection and Segmentation.
Nowadays, with the rapid development of the convolutional neural network, semantic-related tasks in real-world environments have shown some remarkable results. Beginning with the object detection [3,28] in RGB images, soon afterwards, Mask R-CNN came out which is further able to predict a per-pixel semantically annotated mask for each of the detected instances, achieving state-of-the-art results on the COCO [36] instance-level semantic segmentation task. Other similar works that are worth to mention, including YOLO [5] and SSD [6], deliver an outstanding performance in terms of accurately segmenting instances. With the help of 2D semantic information, we explore semantical objects in 3D environments.

Materials and Methods
e architecture of our system is shown in Figure 1. Each RGB image from the incoming video stream is processed with the Mask R-CNN framework to detect a semantically annotated segmentation mask, then, along with the corresponding depth image, is initialized to the point cloud using the projection method between coordinate frames followed by an optimization strategy using a Gaussian mixture model (GMM) for a more accurate instance label. Next, we employ a voxel-based Bayesian update method to merge class semantic or instance IDs across different frames. Finally, we complete the construction of an incremental instance-oriented semantic mapping system. Details of the proposed system are discussed in the following sections.

Semantic Instance Segmentation Method.
In order to annotate and segment the 3D instances in the scene, we needed to combine the 3D point cloud with its corresponding semantic class distribution and instance IDs. To label objects, we first employed the Mask R-CNN as an object detector to the input image. Mask R-CNN achieved real-time performance while showing high accuracy on the computer vision benchmarks, including the Microsoft COCO dataset [37] and the Pascal VOC collection of datasets [38]. Given the input image vides a set of bounding boxes as b i , i ⊂ N, 1 ≤ i ≤ M, and class probabilities are assigned to each bounding box as P(c i | I k ) ⊂ R by letting M ∈ R 100 * 15 * 15 be the number of bounding boxes and cεR 100 be the class category. Note: although there is a good deal of related research, we chose Mask-R-CNN to achieve the task because of its stability and ability to obtain good results on different datasets. is way, our system can theoretically handle another similar network for an acceleration or accuracy request.

Incremental 3D Semantic Instance-Oriented Update
3.2.1. 2D-3D Association with Semantic Information. One requirement of the proposed system is to know the camera pose in the target scene. In view of real-time and computing costs, we chose voxel hashing [9] as our SLAM system. is takes advantage of volumetric approaches to achieve dense surface representation while using spatial hashing techniques to avoid memory overhead.
e proposed system takes both RGB and depth information as the input and incrementally project them into a single 3D model to achieve the volumetric reconstruction. For each arriving RGB-D frame, the 6-DoF camera pose is estimated by combining ICP [36] and RGB alignment, denoted as T WC ∈ SE (3), where W represents the world coordinate and C represents the camera coordinate. en, we employ the homogeneous transformation matrix T −1 WC (k) � T CW (k) to project the transformation from the world coordinate to the camera coordinate. In our case, instead of integrating the original incoming RGB image, the proposed system takes the semantic image I k that was processed through the Mask R-CNN as the input, along with corresponding D k , and then generates the 3D reconstruction with the estimated camera pose. erefore, the initial point cloud with instance IDs has been generated.

Instance Refinement via the Gaussian Mixture Model.
After the rough 2D-3D data association of the SLAM system, point cloud data instances are initially formed, but some false matching points occurred during the projection process. In order to obtain more accurate object representation, we optimized the objects by formulating an accelerated generative model in the form of a GMM with a highly parallel hierarchical expectation-maximization (EM) algorithm, inspired by [39]. Also, there is an alternative clustering approach which can be used for optimization, such as ROC algorithm [12]. As a cluster solution for 3D point cloud data, the advantages of GMM are suited to our work. First, the projected data are embedded into the covariance matrices of GMM, which provides an effective way of processing noisy data. Second, because the storage requirements for a GMM are much lower, the system's ability to perform in real time is not affected. However, due to the computational complexity of the GMM, processing is relatively slow. Normally, the processing method would employ a kmeans algorithm to run on the sample set. Because our system already implements 2D-3D association using the projection method of the SLAM system, it generates the corresponding 3D cloud with semantic and instance annotations. is is equal to the process of the sample set, and Discrete Dynamics in Nature and Society 3 therefore, we can optimize the point cloud data clusters directly with the GMM.
(1) Model Definition. After masks m k j are produced by the Mask R-CNN integrated into depth map D k , we obtained a corresponding point cloud X � x 1 , . . . , x N of size N. We assume that there are K classes that can be altered according to the demands of different scenarios. e latent variable represents as Z � z 1 , . . . , z N , which is a discrete random variable related to sampled point cloud X. In our case, Z indicates classes, the purpose is to index which observed variable belongs to which Gaussian distribution, and the probability of Z represents as p(Z) � p 1 , . . . , p k . For our formulation, the parameter Θ � p k , μ k , Σ k that needs to be estimated with p k εp(Z) represents as class probability and μ k and Σ k being the mean and covariance matrix, respectively. Our function describing the generation of incoming point cloud data is a linear combination of Gaussians: with K k�1 p k � 1, and the point cloud data are sets of independent and identically distributed (iid) points.
(2) Executive Parameters. In our case, we are trying to maximize the overall likelihood of a set of Gaussians producing a given point cloud. e general way to compute the maximizer of a parameter is maximum likelihood estimation, but it is only suitable for one Gaussian distributioncontained problem; otherwise, it would not provide an analytical solution.
at is why we chose to solve this problem using the EM algorithm, which employs an iterative approach to finding the maximizer of a parameter.
Given initial value θ (0) , the function represents in E-step: In the M-Step, we maximize the expected log-likelihood with respect to θ. e objective function is Given a fixed set of expectations, one can solve for the optimal parameters at iteration t:  Figure 1: Overview of our incremental instance-level 3D scene reconstruction method. From continuous frames of an RGB-D sensor, our system performs on-the-fly reconstruction and 3D semantic prediction. All of our processing is performed on a frame-by-frame basis in an online fashion, thereby making it useful for real-time applications. 4 Discrete Dynamics in Nature and Society

Voxel-Based Bayesian Class Update Approach.
Because frame-wise segmentation processes each incoming RGB-D image pair independently, it lacks any spatiotemporal information about corresponding segments and instances across the different frames. erefore, we propose an incremental voxel-based Bayesian class update approach. According to Nießner et al. [9], given a series of RGB images I 1 , . . . , I k with semantic and instance IDs, as discussed in Section 3.2.1, and corresponding depth images D 1 , . . . , D k , volumetric representation divides them into a small square called a voxel, v, which stores information such as location, color, and class. In order to update the class distribution of each voxel according to the given classes of pixels from the 2D images, we must first find the correspondence between the voxel and the pixel. is is performed by the SLAM system. erefore, for the current incoming frame I k , the world coordinate of the corresponding voxel, v k ( u → ), in a 3D map is computed by using backprojection: where K denotes the intrinsic camera parameter and _ u denotes the corresponding homogeneous coordinate of the pixel's u → . Each voxel is then projected onto the RGB image plane via camera projection as follows: When a new image I k comes in, the system feeds it to the Mask R-CNN to segment n masks denoted as m k j , j � 1, 2, . . . , n. Mask R-CNN outputs masks that may overlap each other, so we do not directly gain a class distribution per pixel, as in semantic segmentation. erefore, we update the class distribution mask by mask. With the relationship between each pair of voxel and pixel computed from (6), we update the class distribution by an optimized recursive Bayesian update algorithm [11], which fits better with our system: e instance probability distribution update procedure is similar. Nonetheless, the two distributions are updated independently. We store a list of instance probabilities P(I v � I i ) for each voxel v with I representing instance IDs. We update the instance distribution according to the segmentation result given by the Mask R-CNN. e general update function for instance distribution adopts a recursive Bayesian update scheme as well:

Map
Integration. e instance segmentation in the 3D format mentioned above achieves associate class probabilities over multiple camera views. After voxel-based class update approach, every voxel's instance ID has been updated as I v . For map integration, we attempt to integrate 3D semantic instances into a globally volumetric map with greater speed. To this end, each clustered instance is progressive and integrated into a TSDF-based voxel grid, which is measurement from a depth map, D k , into a volume, V. V stores at each discrete voxel location, v � (v x , v y , v z ), both the current normalized truncated signed distance value, its associated weight, and instance class I v . And we use raycast, the main method for integrating information from sensor data into TSDF for tracking, data association, and visualization to render depth, normals, vertices, RGB, and object indices as shown in Figure 2. e fusion part of our system is incorporated with Voxblox [40], which is a real-time framework of 3D reconstruction based on volumetric TSDF representation. e main benefit of the Voxblox framework is that it has been extended to the label volume, which can store the instance label related with each voxel in the TSDF grid. At each view, the set of point clouds representing the 3D object with semantics is integrated into the voxel-based representation, and our system ensures consistency among the instance labels across different frames.

Results and Discussion
We evaluated the performance of our system on an Ubuntu operating system with an Intel Core i5-6500 CPU at 3.2 GHz and an Nvidia GeForce GTX1080 Ti GPU with 11 GB of RAM. Our system is built on top of ROS open-source middleware. e core function is implemented in Python and uses TensorFlow for instance predictions.
e Mask R-CNN uses ResNet-101 based on the publicly available implementation from Matterport Inc.
[41], with the pretrained weights provided for the Microsoft COCO dataset [37]. e input stream is typically a 640 × 480 resolution RGB-D video. To display the ability of progressive building of instanceaware maps per frame, we perform a Mask R-CNN thread simultaneously with 3D reconstruction upon every frame.
Although there are many 3D databases [42,43] for different research purposes, we chose the SceneNN dataset [21] to evaluate the 3D object accuracy of the proposed instance-level semantic mapping system, which contains 100 indoor scenes, including offices, bedrooms, living rooms, and kitchens, and scenes with repetitive objects; the SceneNN dataset also provides the annotations with fine-grained information, e.g., axisaligned bounding boxes, oriented bounding boxes, and object poses. It is suited to the task of reconstruction of the instanceoriented semantic mapping.

Run-Time Performance.
To demonstrate the efficiency of our system, we analyzed its run-time performance and compared it with other state-of-the-art systems, as shown in Table 1.
ese systems are mainly concentrated on object-level mapping tasks. Our system achieved a speed of 10.8 Hz while performing all processing components on every input frame, thereby outperforming other similar Discrete Dynamics in Nature and Society 5 systems in run-time tests. Compared to the process for utilizing the semantic information from the input image in conventional methods [16,29,32], the proposed system has substantially reduced the computational time by exploiting a voxel-based class probability update scheme. All systems were tested on the same sequences of the SceneNN dataset. Figure 3 shows the evaluation of the execution times upon each individual stage of the proposed incremental instance-level mapping system averaged over five sequences in the SceneNN dataset. Input RGB-D images have 640 × 480 resolution. Mask R-CNN runs on the GPU, while the rest of the components run on the CPU. e trend lines in the figure showed the data association module running under low rate, which the proposed method effectively improves the operation speed of the system; GMM module maintained on a stable running rate; the map integration module slowed down after 500 frames, ensuring the real-time demand of the system. Note that, by speeding up the system, it is possible to change to a faster object detector network, and the processing of map fusion and Mask R-CNN can occur simultaneously.

Accuracy.
Several recent research projects have focused on semantic instance segmentation of 3D scenes. e majority of these, however, takes as the input the full reconstructed scene, either processing it in chunks or directly as a whole. Because such methods are not constrained to   [34] Dense Every 6 frames 3 Hz PanopicFusion [44] Dense Every 10 frames 4.3 Hz Voxblox++ [16] Instance-oriented Every frame 1 Hz Pham et al. [45] Instance-oriented Every frame 1 Hz Fusion++ [29] Instance-oriented Every frame 4 Hz Ours Instance-oriented Every frame   Discrete Dynamics in Nature and Society progressively integrating predictions from partial observations into a global map but can learn from the entire 3D layout of the scene, they are not directly comparable with our work. Among the frameworks that study online, incremental instance-aware semantic mapping, we chose Grinvald et al. [16] as a comparison. Because we relied on a Mask R-CNN model trained on the 80 Microsoft COCO [38] object classes to get the instance IDs, we evaluated the segmentation accuracy on the nine object categories that were common to the SceneNN dataset [21]. e proposed approach was evaluated on the 10 indoor sequences from the SceneNN dataset, the same as Grinvald et al. [16] reported instancelevel segmentation results. e results in Table 2 demonstrate that our approach achieves better accuracy in most sequences compared with [16], which is one of the advanced methods focused on real-time incremental instance-aware 3D mapping. It is worth mentioning that further comparing it with [16], our system runs faster and is more suitable for human-robot interaction.
To expand the evaluation of the accuracy of our system, we compared class-averaged mean average precision (mAP) values over the ten evaluated categories with [16,45]. e results in Table 3 show that the proposed approach outperforms the baseline on six sequences. [45] focuses on building incremental 3D semantic maps of indoor scenes; although it is different from our system, there is an experiment designed for the accuracy of instance classes, and the author explained they only used a simple clustering algorithm to obtain instance semantic so that it can be used as a baseline to compare with similar systems. As the results shown in Table 3, our system highly outperformed in eight scenes compared to their system. Compared to Voxblox++, the proposed system exceeded in six sequences, which proved the advancement of our system. However, it did not perform better in sequences 16, 61, 96, and 206, through analyzing the categories in those sequenced, such as bed and sofa, had more clutter appearances, using the GMM model to optimize might cause oversegment which reduced accuracy. Also, Voxblox++ uses the geometric segmentation method which is better to segment objects with more details, such as chair. We will improve the algorithm in the future.
Furthermore, we showed the qualitative results about the proposed framework on the SceneNN dataset. We presented the incremental instance-oriented 3D semantic mapping generation process in Figure 4. As can be seen, the left image showed the respective progressive semantic segmentation results of our method, the middle image shows the final mapping results, and the right one shows the ground truth segmentation, and the 3D shapes of the object instances, such as chair, sofa, and desk, were incrementally generated  [21], the per-class average precision (AP) is computed using an intersection over union (IoU) threshold of 0.5 over the predicted 3D segmentation masks. Method  Bed  Chair  Sofa  Table  Books  Refrigerator  TV  Toilet  Bag   011  Voxblox++  -75  50  100  ----  Discrete Dynamics in Nature and Society 7 by our system. Because our system is designed to segment instances from the scene, the color of the instance is different from the ground truth, in which the color is assigned according to the classes. As our proposed mapping system focuses primarily on recovering instances of the scene, we have chosen to ignore the background and floor.

Ablation Analysis.
To further illustrate the performance of our GMM model pertaining to the optimized instance cluster, we carried out an ablation analysis to evaluate the effects of accuracy of instance, as shown in Figure 5. Circle A shows that, after GMM optimization, the boundaries of the instance are clearer, and the 8 Discrete Dynamics in Nature and Society segmentation is more accurate. And circle B displays that two different instances are segmented after GMM optimization. e same optimization result is showed in C, and the boundaries of different objects are clearer. is proves that cluster operation in the point cloud based on predicted class information is valid in dense semantic instance-level mapping.

Conclusions
Our proposed system is an efficient instance-oriented semantic mapping system. We employed a projection method in the SLAM system that could rapidly associate 2D "masks" and the corresponding depth images to generate a 3D point cloud with instance labels and then used a cluster optimized algorithm to resolve the confusion if projection mismatch occurred. For the 3D reconstruction, the resulting instanceaware semantically annotated volumetric maps are expected to provide benefits in navigation and manipulation planning tasks. However, as mentioned above, because our system focuses only on recovering 3D instances of an unknown scene, we overlooked the structure of the surrounding environment, such as walls and floors. In the future, we hope to come up with a method that could solve this problem in real time. And also, our system can be used in different applications, such as [44,[46][47][48]. We intend to research how the segmented instances can serve as semantic landmarks to promote the accuracy of the SLAM system in order to attain a full semantic SLAM system.

Data Availability
e experimental data of the SceneNN and Microsoft COCO dataset used to support the findings of this study are included within the paper.

Conflicts of Interest
e authors declare no conflicts of interest.