Symmetry-Aware 6D Object Pose Estimation via Multitask Learning

Although 6D object pose estimation has been intensively explored in the past decades, the performance is still not fully satisfactory, especially when it comes to symmetric objects. In this paper, we study the problem of 6D object pose estimation by leveraging the information of object symmetry. To this end, a network is proposed that predicts 6D object pose and object reflectional symmetry as well as the key points simultaneously via a multitask learning scheme. Consequently, the pose estimation is aware of and regulated by the symmetry axis and the key points of the to-be-estimated objects. Moreover, we devise an optimization function to refine the predicted 6D object pose by considering the predicted symmetry. Experiments on two datasets demonstrate that the proposed symmetry-aware approach outperforms the existing methods in terms of predicting 6D pose estimation of symmetric objects.


Introduction
6D object pose estimation is of remarkable importance to a variety of industrial applications, ranging from robotic manipulation [1,2] and autonomous navigation [3] to augmented reality [4]. Serving as the base for the perception of objects in the environment, it concerns the acquisition of the 6D pose (location and orientation) information [2,5]. e ultimate goal is to achieve speedy real-time 6D object after estimation with robust performance regardless of varying shape, texture, occlusion, illumination, or sensor noise. e recent development of 6D object pose estimation methods has been promising, thanks to the advancement of economical depth sensors. Existing work has explored the 6D object pose estimation for both household furniture (e.g., chair or table) [6] and table-top objects (e.g., box or book) [7]. ese methods are able to generate accurate object pose in many real-world scenarios by utilizing the 3D information from the depth image. Moreover, when integrating with the color image, the RGB-D image-based 6D object pose estimation methods [8] achieved competitive results on objects with complex geometry and moderate occlusion.
However, previous pose estimation approaches are still hardly satisfactory when it comes to dealing with symmetric objects. e reason lies in the fact that a symmetric object might correspond to multiple poses, leading to ambiguity in the training of the neural networks. On the contrary, the symmetry-related feature of the object has been proved to be one of the most informative geometric clues for a variety of applications [9] and has the potential to facilitate the pose estimation task as a complementary element [10]. In this work, we tackle the problem of 6D object pose estimation for objects with reflectional symmetry, which plays a crucial role in a variety of applications [11][12][13]. Specifically, the symmetry is predicted jointly with the symmetry axis, thus making these two relevant tasks boost each other.
To this end, this paper proposes an approach for 6D object pose estimation that is aware of and regulated by the symmetry axis and the key points of the to-be-estimated objects. During training, the proposed approach learns to predict the 6D object pose, the object symmetry, and the key points in a unified network. In particular, the network contains a multiscale feature extraction module to fuse the appearance feature and the geometric feature at multiple scales. e ground-truth object symmetry of the training data is generated in a self-supervised manner, so no manual data labelling is required. During testing, we propose to use an optimization function to determine the final prediction on the 6D object pose with the object symmetry. is decision-level optimization boosts the performance on the prediction of 6D object pose by the predicted object symmetry.
We evaluate the proposed method on two datasets: YCB-Video and ShapeNet. Experimental results demonstrate that our method outperforms the state-of-the-art methods on most of the symmetric objects. Also, we provide a qualitative comparison to the baseline method to demonstrate the effects of our symmetry-aware pose estimation approach. e contributions of this paper are as follows: (1) we introduce a multitask network to estimate the 6D object pose, the symmetry axis, and the key points at the same time; (2) we propose a multiscale feature extraction module to fuse the features from the color image and the depth image; (3) we devise an optimization function to refine the predicted 6D object pose by the predicted symmetry; (4) we show that our approach outperforms the existing methods on 6D object pose estimation of the symmetries objects.

6D Pose Estimation.
Given a single RGB image, previous methods estimate the 6D object pose by using either the template matching techniques or end-to-end data-driven neural networks. ese methods are limited by various factors, such as occlusion or the existence of ambiguity along the depth direction, and are inadequate for 3D data reasoning [14]. Another type of the 6D pose estimation approach is based on data from range sensors, such as depth camera or LIDAR. Existing approaches typically address this problem by first establishing rough pose candidates by using point features and then performing an iterative closest point (ICP) algorithm to refine and select the optimal pose [15]. Recently, Xiang, Song and Xiao, and Li [5,7,8] integrated the features from both the RGB image and the depth map by leveraging the feature fusion techniques. Wada [16] proposed an object-level volumetric fusion to reason 6D pose of multiple objects. ese approaches have proven to be fairly robust for scenarios with poor lighting conditions or heavy occlusions. Although we utilize a similar feature fusion approach, our method particularly improves these approaches by introducing a symmetry detection module. We demonstrate how our method outperforms the previous works for 6D pose estimation on symmetric objects.

3D Symmetry Detection.
3D symmetry detection has received significant research attention in computer vision and graphics communities for both synthetic and real-world applications. Conceptually, symmetry is well defined in mathematics and is geometrically measurable. Conventional symmetry detection methods [17,18] mostly use point clustering to detect symmetries of complete geometries (such as CAD models). However, 3D data acquired from sensors are possibly coupled with noise, occlusion, or complex lighting condition.
is makes the traditional symmetry detection method incapable. To tackle this problem, Ecins et al. proposed to detect an object from incomplete point cloud [19]. is method can detect symmetries for objects with simple geometry in occluded tabletop scenes, but it is still limited by its inferior generality so cannot be extended to more general object types. Another 3D symmetry detection approach is to first predict the complete geometry of the input data [20] followed by a conventional symmetry detection [21]. e drawback is that it requires the shape completion method to make point-level predictions with high accuracy, which is nontrivial as the training data collection and network training procedures are both effort-intensive. More recently, Shi proposed an endto-end deep neural network, which is able to predict both reflectional and rotational symmetries from RGB-D images [22]. Our method is inspired by their work. However, the output of our method is not only the symmetry but also the 6D pose.

Key Point Detection.
Efforts have been made to compute 6D pose parameters based on the detected key points via deep neural networks [14,[23][24][25]. Previously, key point detection on texture-less objects was proven challenging [26][27][28]. With the recent progress of deep learning, Rad and Lepetit, Tekin et al, and Hu [24,25,29] proposed to obtain the coordinates of the 2D key points via direction regression. Methods mentioned above are designed to minimize the 2D projection errors on the objects. However, small projection errors might still be large when it comes to the 3D world. 3D poses are obtained via 3D key points from two views of perspective provided by synthetic RGB images [30]. However, the depth information is missing with only RGB images. e nowadays economical depth sensors allow us to construct, compute, and detect key points in the real 3D world, thanks to the captured depth information.

Multitask Learning.
Multitask learning refers to the approach where multiple objectives corresponding to different tasks with a common representation are learned in parallel simultaneously [31,32]. It features the advantages of improved efficiency and accuracy respectively in terms of learning and prediction due to the fact that commonalities and differences across tasks are exploited [33]. In addition, it is effective in the avoidance of overfitting on a specific task since the network model is regularized [34]. Wang et al. managed to improve the 6D object pose estimation performance, especially under the condition of occlusion via a multitask learning network combining object recognition with pose estimation [35]. e issue of 6D pose estimation of multiple instances in the bin-picking situation was studied by Sock [36]. He demonstrated outstanding performance of the multitask network which learns depth, 2D detection, and 3D object pose estimation jointly as three subtasks. Xiang et al. proposed PoseCNN where the extracted feature maps are shared by three subtasks, namely, 3D rotation regression, 3D translation estimation, and semantic labelling [5].

Method
A 6D pose consists of a position and an orientation both of which are defined based on the camera coordinate frame in this paper. Specifically, a pose is defined by a rotation matrix R and a translation vector t. e representation of a pose is therefore a homogeneous transformation T � [R, t]. A reflectional symmetry plane is defined by a point on the plane and its plane normal, i.e., S � [p, n], where p is the location of the point and n is the plane normal.
3.1. Overview. We propose to estimate 6D object pose and object symmetry in a multitask network (see Figure 1). In particular, the network consists of a multiscale feature extraction module to fuse the features from the RGB image and the depth map. During training, 6D object pose estimation, symmetry prediction, and key point detection components are coupled with each other and trained by a multitask learning strategy.
During testing, we first predict the 6D object pose and object symmetry by a network inference. e predicted pose is then refined by an optimization process which considers the constraints provided by the predicted symmetry and the detected key points.

Multiscale Feature Extraction.
e input to our method is an RGB-D image which contains at least one object. In our problem setting, the segmentation is precomputed by a segmentation algorithm [5]. For the segmented object, we crop the pixels in the RGB-D image and compute the point cloud by using the intrinsic parameters of the camera.
Our network is derived from the pixelwise dense feature extraction network introduced in [8]. First, the point cloud is fed into a geometric feature extraction network. Different from [8], which uses PointNet as its backbone, we opt to use PointNet++ [37] because of its superior ability on feature extraction for objects with complex geometry. For the RGB image, we use a Resnet-based U-Net to extract pixel-level feature. e difference to [8] is that we enlarge the dilation in the convolution layers so that the network could perceive more context information. We found that this adjustment is of great significance for symmetry prediction. e multiscale features from the point cloud and the features from the RGB image are subsequently concatenated before being fed to another network to obtain the global feature by using an average pooling layer. e pixel-level feature is then concatenated with the global feature to form the overall pixelwise features which are in the end used to predict the 6D object pose and the symmetry as well as the key points.

Loss Function.
e multitask learning network comprises the pose predictor, the symmetry predictor, and the key point predictor whose losses are embedded into the overall loss function so that the symmetry and key point information can serve as additional regulations to the learning process for the pose prediction. In the end, the results of the 6D pose estimation and the symmetry estimation are output in the format of T and S. We define the symmetric transformation of the predicted symmetry S as e overall loss of the network training is the sum of the loss of point-level predictions. For each point, the loss consists of a pose estimation loss, a symmetry prediction loss, and a key point detection loss: where N is the total number of the points. e 6D pose estimation loss L i pose is defined as the average distance between the sampled points on the object transformed by the ground-truth pose and by the predicted pose of the i-th point: where M is the number of the sampled points, x j is the j-th point of the sampled points, and T � [R, t] is the groundtruth pose. e symmetry prediction loss L i symmetry is the average distance between the sampled points on the object transformed by the ground-truth symmetric transformation and the predicted symmetric transformation: where T s � [R s , t s ] is the ground-truth symmetric transformation. Similar to [38], the key point detection loss L i keypoint is the sum of the offset distances between the sampled points and the key points: where d p j and d p j are respectively the offset distance and the corresponding ground-truth between the j-th point and the p-th key point. Note that our key point detection module is different from [38], as our key points not only contain the points selected by the farthest point sampling but also their symmetric counterparts.

Multitask Network Training.
e three subtasks, i.e., pose prediction, symmetry prediction, and key point detection, share the same pixelwise feature maps extracted in prior and are trained in parallel jointly. e symmetry prediction task serves as an additional metric to reveal the quality of the pointwise features, hence aiding to boost the accuracy of the overall pose estimation task.

Complexity 3
3.5. Inference. During inference, we first extract the pointlevel features and make point-level 6D object pose and symmetry predictions. By averaging all the predictions, the ultimate prediction is generated. e design of multitask learning on the 6D object pose and the symmetry has made the two subtasks regulated by each other. However, we observe from the experiments that (1) the 6D object pose prediction and the symmetry prediction are not perfectly consistent with each other; (2) the error of symmetry prediction is noticeably smaller than the error of 6D object pose prediction, illustrating that the predicted symmetry could be further used to refine the predicted pose. To this end, we introduce an optimization function as follows to refine the predicted pose by considering the constraints provided by the predicted symmetry: where T(x j ) represents the location of transformed x j by T: We use Ceres Solver [39] to optimize the above function. We consider the T after the optimization as the final 6D object pose of our method.

Benchmark.
We create a benchmark to evaluate our method. e benchmark is built based on two datasets: YCB-Video [40] and ShapeNet [41]. YCB-Video consists of 92 RGB-D videos captured in indoor scenes with 21 different table-top objects. e images in the dataset are annotated with object pose. We compute the ground-truth for each object by using an offline symmetry detection method [21]. ShapeNet is a large-scale CAD model dataset with categorylabel annotations. To generate the training and testing data, we first perform a virtual scanning on the CAD model from random viewpoints around the object and then compute the ground-truth object pose and the symmetry. Note that,     [22], we only use the objects with one reflectional symmetry. In order to generate images with the occlusions and background clutter, we randomly select objects from both images of other ShapeNet objects or images from the realworld scenes [42]. e examples of the two datasets are shown in Figure 2.

Evaluation Metrics.
We evaluate the 6D object pose using the average closest point distance (ADD-S) proposed in [5]. Specifically, we report the area under the ADD-S curve (AUC). Given the ground-truth pose and the predicted one, ADD-S measures the mean distance between each sample point on the object transformed by the groundtruth pose and its closest neighbouring point among the sample points transformed by the predicted pose. We set the AUC threshold as 0.1 m. We also evaluate the percentage of predictions whose ADD-S is smaller than 2 cm.

Comparison to Baselines.
We compare our method with three baselines: PointFusion [43], PoseCNN [5], and Den-seFusion [8]. e quantitative results are shown in Table 1 (YCB-Video) and Table 2 (ShapeNet). It is clear that our method demonstrates the best results on both YCB-Video and ShapeNet. In particular, our method outperforms all the baselines on ShapeNet by a large margin. Given the fact that most of the objects in ShapeNet are symmetric, we therefore reckon that our proposed method is especially suitable for symmetric objects.

Qualitative Results.
To demonstrate the advantages of our method, we show the qualitative results of our method and DenseFusion on YCB-Video in Figure 3. It shows that our method is able to successfully produce accurate 6D object pose on cases where DenseFusion fails. We also visualize the predicted symmetry in Figure 4.  Figure 3: Qualitative comparison on 6D pose estimation performance between the proposed approach and previous work [8] with the YCB-Video dataset. Our method achieves more accurate pose estimation on a variety of objects.

Conclusion
In this paper, we focus on the problem of boosting 6D object pose estimation by leveraging object symmetry. We propose a network that predicts 6D object pose, object symmetry, and key points through multitask learning. e predicted 6D object pose is then refined by the predicted object symmetry via an optimization function. We evaluate our method using both quantitative and qualitative comparisons to the stateof-the-art approaches. Experimental results show that our method outperforms the three baseline approaches, particularly by a large margin in the case of ShapeNet where most objects are symmetric. For future work, we are interested in integrating other relevant geometry clues into the pose estimation network [22,44]. It is possible to reduce the size of the network and improve accuracy simultaneously, by considering relevant geometric mechanisms [44].

Data Availability
Data that support the findings of this study are available in the website https://http://www.ycbbenchmarks.com/https:// www.shapenet.org/.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.