Refined Voting and Scene Feature Fusion for 3D Object Detection in Point Clouds

An essential task for 3D visual world understanding is 3D object detection in lidar point clouds. To predict directly bounding box parameters from point clouds, existing voting-based methods use Hough voting to obtain the centroid of each object. However, it may be difficult for the inaccurately voted centers to regress boxes accurately, leading to the generation of redundant bounding boxes. For objects in indoor scenes, there are several co-occurrence patterns for objects in indoor scenes. Concurrently, semantic relations between object layouts and scenes can be used as prior context to guide object detection. We propose a simple, yet effective network, RSFF-Net, which adds refined voting and scene feature fusion for indoor 3D object detection. The RSFF-Net consists of three modules: geometric function, refined voting, and scene constraint. First, a geometric function module is used to capture the geometric features of the nearest object of the voted points. Then, the coarse votes are revoted by a refined voting module, which is based on the fused feature between the coarse votes and geometric features. Finally, a scene constraint module is used to add the association information between candidate objects and scenes. RSFF-Net achieves competitive results on indoor 3D object detection benchmarks: ScanNet V2 and SUN RGB-D.


Introduction
3D object detection aims to locate and recognize 3D objects from point clouds, playing an important role in visual recognition. Compared with 2D images, point clouds could describe the precise geometric shapes of 3D objects and they are robust to various environments, such as illumination changes and background changes. Terefore, predicting 3D bounding boxes using the point clouds in real-world environments is of high practical value, which is used in felds such as modal detection [1], AI navigation [2], indoor robot navigation [3], 3D cadastre models [4], and robot grasping [5,6].
Driven by advances in 2D images using convolutional neural networks (CNNs), several architectures, such as DSS [1] and 2D-driven method [7], have been proposed for object detection in point clouds. Tese methods, applying image-based feature extraction techniques, project point clouds onto multiperspective views. One of the major limitations of multiview-based methods is loss of height data and 3D shape information, which markedly decreases the accuracy of the bounding box regression branch. Other methods, such as 3D ConvNet [8], VoxelNet [9], and PV-RCNN [10], divide point clouds into equally distributed grids, which usually contain an uneven number of points due to the sparsity of point clouds. Inspired by the success of PointNet [11] and its variants in object classifcation and semantic segmentation, instead of converting irregular point clouds to grids, point-based approaches, such as VoteNet [12], MLCVNet [13], HGNet [14], and BRNet [15], process raw points directly to learn 3D representations.
VoteNet [12] is a detection framework that uses the Hough voting strategy [3]. First, VoteNet samples seed points by feeding a full point cloud into PointNet and then, applying a deep neural network, uses seed points to return the center of objects. Finally, the voted centers are then used to generate box proposals. However, two notable limitations to VoteNet still exist. (1) Voted centers are poorly located and lack geometric information. A voted center in VoteNet is a vital point, where a bounding box is located. By contrast, seed points generated by PointNet++ and distributed on the surface of an object have the geometric information of the objects. (2) Te proposed module groups and predicts every object individually, without taking into account global scene features and local relationships between those objects. Compared with an outdoor scene, in an indoor scene there are several co-occurrence patterns for objects, such as bathtub, shower curtain, toilet in a bathroom; tables and chairs in a conference room; and bed and cabinet in a bedroom. We conclude that those scene co-occurrence patterns, which improve object detection performance, are critical for 3D understanding. A natural way to use these cooccurrence patterns is to design a module to fuse the local relations between objects and global scene features.
In summary, we propose a novel voting-based indoor 3D object detection method, refned voting and scene feature fusion network (RSFF-Net), which incorporates end-to-end learnable attention supervised feature enhancement into a voting-based framework. For each coarse vote, a fxed number of points from original point clouds is sampled randomly to create geometric object information. Ten, a refned voting module combines the coarse votes with the features of resampled points to take a second vote to improve voted center quality. Inspired by the idea of scene context in 2D object detection, we add global scene context to model the semantic association between scenes and objects, such as a living room having a sofa, a study room having a bookcase, and a bathroom having a toilet.
Te efectiveness of refned voting is illustrated in Figure 1. In the frst column (W/O Revote), red points denote ground truth centers; blue and green points denote the vote results of VoteNet and our proposed RSFF-Net, respectively. Many blue points deviate from the center points, which are denser after refned voting (green points) near the ground truth than those of the single voting network. In the second column (VoteNet), red arrows mark false detections and false detection boxes. In the third column (RSFF-Net), bounding boxes of diferent colors represent diferent object categories. Te predicted bounding boxes (pink and red boxes) of the proposed RSFF-Net ft better than those of the previous VoteNet due to the improved accuracy of the voted object centers. Te number of certain duplicate boxes (purple boxes) was reduced, and the number of false detections (dark blue boxes) decreased.
Te specifc contributions of this paper are as follows: (i) We propose a novel end-to-end 3D object detection framework, RSFF-Net, to address the central voting error and small object detection. Te proposed RSFF-Net generates local voting attention regions, reliably reselects voting points from seed points, and then trains and refnes object proposals to achieve more robust object classifcation and more accurate bounding boxes. (ii) We design three novel modules to fuse features from both seeds and votes. Tese modules ensure that the central point being voted obtains a comprehensive merged feature. Tese modules also use semantic relations between object layouts and scenes to refne proposals. (iii) To demonstrate the efectiveness of our proposed modules, extensive experiments were conducted with the ScanNet V2 [16] and SUN RGB-D [17] datasets. In addition, the proposed RSFF-Net performs better on small objects than VoteNet.  [20][21][22][23][24], and points [12-15, 25, 26].

Voxel Grid.
Voxel-based approaches convert irregular point clouds into 3D voxels [9,18,19,27,28]. VoxelNet [9] divides a point cloud into equally spaced 3D voxels, applies multilayer perceptrons (MLPs) to points, and obtains a unifed feature representation in each voxel. In [27,28], the authors encoded each non-empty voxel with six statistical quantities and fused multiple local statistics to represent each voxel. HVNet [19] and Voxel-FPN [18] aggregate a set of multiscale voxel features generated by voxelization from various voxel sizes. PV-RCNN [10] combines a keypoint feature with a voxel feature to obtain accurate location information. Supervoxels [29][30][31] use a novel supervoxel segmentation algorithm to enhance road boundaries from 3D point clouds. Voxel-based methods often use computationally inefcient 3D sparse convolutions to extract features from the voxel representation.

Bird's Eye View.
In contrast to building voxel grids, many existing studies render cloud points into 2D regular lattices [20,21,23], project the points onto bird's eye view (BEV) images, and extract features with 2D convolutional layers. MV3D [21] introduces a 3D object proposal generation module and a multiview encoding scheme to combine region-wise features. AVOD [32] also consists of two networks: region proposal and prediction. Te region proposal network must perform multimodal feature fusion on high-resolution feature maps. In [20,24], the authors projected a point cloud onto a 2D BEV image and a proposal-free single-stage detector. Tese handcrafted BEV methods easily achieve stable, efcient speed but sacrifce accuracy, which is limited by coarse-grained point cloud representations.  [12-15, 25, 33-35] have used PointNet++ as the backbone network to directly extract features from unordered point cloud for 3D object detection. VoteNet [12] votes for group points to a center point based on learned seed features from PointNet++. Tis method and its variants [13,14,25] yield excellent results. MLCVNet [13] introduces multilevel contextual information into the voting stages to equip the network with the ability to learn object-level and global-level context. HGNet [14] uses a hierarchical graph network to capture the relationship between center points. By optimizing votes and feature fusion between points, FFRNet [35] improves object detection. 3DSSD [25] uses the FPS sampling strategy to decrease inference time. BRNet [15] backtraces the representative points and revisits seed points to better capture local structural features. To equip the network with the ability to learn objectlevel and global-level context, Pan et al. [33] designed two transformer modules to learn context-aware representations at the object and scene levels. H3DNet [34] defnes a hybrid set of geometric primitives and refnes the bounding boxes by an overcomplete set of constraints generated by those geometric primitives. However, the efects of the above networks are general in complex and changeable indoor scenes containing many details.

Attention-Based
Network. Inspired by the idea of selfattention in natural language processing [36], recent studies have applied self-attention mechanisms to improve scene understanding by modeling the relationships between objects [13,[37][38][39]. For example, in [40], for 2D vision, an attention-based method was proposed for joint visual language modeling. Recently, DETR [41] employed a transformer for 2D object detection and achieved excellent performance. Regarding 3D point data processing, the work in [39] uses a point context attention network that encodes local features into global features to capture the contextual information in 3D points. Conversely, PCAN [38] proposes a point attention transformer to process point clouds. When detecting 3D objects in large-scale point clouds, in [42], an attention-based PointNet is proposed to fnd regions of interest instead of processing the entire scene. MLCVNet [13] learns multilevel contextual information between patches, objects, and scenes. HGNet [14] uses multilevel semantic information and shape attention graph convolution to capture shape information from the original point clouds. VoTr [43] uses a self-attention mechanism to solve the limitation of the receptive feld size of voxel grids and establishes long-distance perceptual relationships between voxels. Based on multiple ranges of attention networks, Pointformer [33] designs a novel backbone network for 3D point clouds. Attentional-PointNet [42] uses an attention mechanism to classify each small area in the threedimensional space. 3DETR [44] applies attention operations in disordered point clouds to capture remote context information. Previous methods mainly used attention networks to learn the relationships between points or fnd concerned local regions. Our aim is to use an attention mechanism to capture the semantic information between a global scene and its objects. Computational Intelligence and Neuroscience object center, and optimize proposals with local and global semantic association constraints. Te overall architecture of the proposed RSFF-Net is shown in Figure 2.

Materials and Methods
In voting-based methods, subsampling strategies may corrupt the spatial geometry of an object in a point cloud. Prior learned seed features severely afect the accuracy of voted centers. Te vote clustering operation for an individual object also ignores the relationships between objects. Terefore, to improve the accuracy of voting centers and integrate local-global association information into the proposal network, we propose a novel voting-based method, RSFF-Net, for refned voting and scene feature fusion operations. Taking an unordered set of 3D points as input, the proposed RSFF-Net outputs a set of object bounding boxes, B; each box, b∈B, is associated with a predefned category label, a center location, the size of the bounding boxes, and the orientation. As shown in Figure 2, RSFF-Net consists of three primary modules: GFM, RVM, and SCM.
To extract point features from irregular point clouds, we use PointNet++ to generate seed points. Next, a Hough voting operation predicts coarse voting points. Ten, GFM resamples original points near coarse votes and learns object structural features for refned voting. RVM takes a second Hough voting after combining coarse voting and the feature of revisit points to output refned virtual center points. Te RVM module helps to accurately locate a 3D bounding box and reduce overlapping bounding boxes. SCM uses an attention network to integrate global scene context with auxiliary proposal clustering.

Coarse Voting Module.
In a 2D image, an object center must be a real pixel having a rich texture. However, in 3D point clouds, the object center is typically far from the surface of the object and cannot be scanned by a data collection device. Tus, we generate new virtual points to represent object centers using an evolved version of 3D voting, which is inspired by the Hough voting framework [39].
We use PointNet++ to learn multidimensional features from initial point clouds P input . Te backbone generates the seed points S � s i M i�1 , where s i is the i-th point in seed points. Every seed has features [x i , f i ]∈ R (3+c) , where 3 + C represents the three-dimensional coordinates and feature information from its surrounding points within a radius, respectively. Specifcally, the structure of PointNet++ consists of several set abstract (SA) layers and feature propagation layers, in which the parameters refer to the point cloud feature learning backbone network in VoteNet. Te voting block uses the point patches with seed features as input and regresses the coarse votes Coarse vote prediction is performed by a multilayer perceptron.

Geometric Function Module.
In VoteNet, 1,024 seed points represent the characteristics of the whole set of the input point clouds. Te coarse votes are derived from the seeds, which ignore the details of a single object. Hence, to enhance the object features and learn the potential geometric features of a single object, we resample some original points around the coarse votes.
We frst use farthest point sampling to sample uniform reference points R � r i p i�1 based on coarse votes. Returning to the original points, we use a minor modifed K-nearest point sampling strategy, which adds the distance from object center to select the revisited points around reference points. We obtain the local point set, P i � p 1 , p 2 , . . . , p k , from the original points near the reference point, r i . We label point, p i , in the revisited set P as follows: the distance between the point and the nearest object center is less than 0.3, and the original point is also the kth point closest to the reference point, r i . After that, with the ReLU activation function, the network learns the geometric features of an object from revisited points using three MLPs. Te module takes the coordinates of reference points r i and the features of revisited point sets P i as input and outputs the learned features f i of r i .

Refned Voting Module.
We place the fused features, f i , the revisited points, P i , and the coarse votes, R, into the MLP for feature fusion. Similar to the Hough voting framework, RVM replaces set abstraction layers with self-attention feature propagation. We adjusted the number of convolutional channels and embedding method of the original voting layer, which uses an attention mechanism to learn the local information of a point. With Euclidean space ofset, ∆x i ∈ R 3 , and feature ofset, Te predicted 3D ofset, ∆x i , is supervised explicitly by the following regression loss: where 1[s i on object] indicates whether a seed point is on the surface of an object; M pos is the count of the total number of revoting points on the surface of an object surface; ∆x * i is the ground truth displacement from the revote position x i to the bounding box center of the object to which it belongs; ∆vx i is the ofest of coarse votes; ∆sx i is the ofset of seeds; and α and β are hyperparameters.

Scene Constraint Module.
In an indoor scene, there are several common senses for object layout, such as bookcases appearing in a study room, sofas in a living room, and toilets in a bathroom. Te indoor objects also have strong mutual semantic associations, which can be used as a priori information for indoor object detection. Inspired by the idea of scene context extraction in [45], we propose a scene constraint module (SCM), which uses global scene context information to improve the performance of the bounding box proposal and classifcation. Figure 3 shows the detailed composition of SCM. An attention coding design modifed for the global scene is used to learn the semantic association between objects and scenes. SCM uses global features from original points and object-level local feature clusters from revoting to create a new branch; then, the module applies a cross-attention mechanism to model objects and scenes. Given , we sample K revotes as the refned vote centers by farthest point sampling. Ten, we generate K clusters by grouping K-nearest neighbors of each cluster center and learn cluster features by several MLPs. Each cluster, C i � [X i , F i ], is sent to the MLPs, and then, 1 × 1 convolution is used to form a single vector representing the cluster as the key and query. We introduced global feature patches, P � p 1 , p 2 , . . . , p M , from seeds to obtain a vector after convolution and max-pooling and then fed the vector into the self-attention module with the key and query values to generate a new feature map. Te encoding of supervision relationships is summarized as follows: wherei � 1, . . . , k, C super i is the ith cluster, and Attention (•) is the attention mapping of CGNL [46].

Proposal and Classifcation.
After grouping, we use a network to generate bounding boxes and classifcation.
Given C super i with Z i ∈ R 3 as the center location and H i ∈ R D as the cluster feature, an object proposal for this cluster p(C) is generated by passing the set input through a PointNet-like module as follows: Computational Intelligence and Neuroscience In equation (3), the feature point set from each candidate is processed independently by MLP 1 . Ten, decoding information for detection and classifcation is extracted, maximally pooled (by channel) to a single vector, and passed to MLP 2 for prediction. Te information of refned voting points is further combined and scored. To obtain the standard coordinate proposal, we convert the voting position, by equation (4), into a local standardized coordinate system. Te proposal p(C) contains fve parameters (center, heading, scale, objectness, and category) to describe the bounding box.
We use cross-entropy loss to supervise the objectivity scores of negative proposals near the center (within 0.3 m) of the ground truth object or far from (above 0.6 m) any center. For positive proposals, we further supervise the bounding box estimation and class prediction bounding box based on ground truth. Specifcally, we follow the method described in VoteNet, which decouples box loss to center regression, heading angle estimation, and box size estimation. For semantic classifcation, we also use cross-entropy loss. In the regression of all detection losses, we use Huber (smooth-L1 [47]) loss.

Results and Discussion
In this section, we frst describe the dataset used in our experiments and the experimental setup. Ten, several ablation studies were conducted to demonstrate the superiority of the proposed module in RSFF-Net. Finally, the compared baselines and experimental results on ScanNet V2 [16] and SUN RGB-D [17] datasets are used to demonstrate the superiority of the proposed RSSF-Net. [16] is a richly annotated dataset of 3D reconstructed meshes of indoor scenes, which contains about 1,200 training examples collected from hundreds of diferent rooms and is annotated with semantic and instance segmentation for eighteen object categories. We sampled vertices from the reconstructed meshes as the input in point clouds. Because ScanNet V2 does not provide oriented bounding box annotation, as in [21], we predict axis-aligned bounding boxes instead. Te inputs for the proposed RSSF-Net are randomly subsampled points from the raw data (i.e., 40,000 points from a 3D mesh in the ScanNet V2 dataset).

SUN RGB-D Dataset. SUN RGB-D [17]
is a singleview RGB-D dataset used for research on 3D scene understanding that contains 10,335 indoor RGB and corresponding depth images. Te RGB images are aligned with the depth channel and used to query the corresponding image area from the 3D point scene. Each point in the point cloud has a semantic label and an object bounding box. Tere are 37 types of annotated objects in the dataset. We trained and recorded the results from the ten most common categories, which are the same as those for VoteNet.

Experimental Setup.
Inputs of RSSF-Net are the randomly downsampled point clouds, containing 20 k points for the SUN RGB-D dataset and 40 k for the ScanNet V2 dataset. In addition to XYZ coordinates, each point contains a height feature, indicating its distance to the ground. Floor height is estimated to be 1% of the height of all points. To increase the training data, we randomly subsampled data from a feld point cloud. Point cloud data are randomly fipped in two horizontal directions, randomly rotated on the vertical axis by [−5°, 5°], and scaled randomly by [0.9, 1.1]. Te end-to-end model, RSSF-Net, is trained by using the Adam optimizer with a batch size of 8. Te base learning rate was 0.005 for the ScanNet V2 dataset. RSSF-Net was trained for 180 epochs on both datasets. To verify timeliness, we referred to the comparison method based on the PyTorch platform equipped with two NVIDIA GeForce RTX 2080 Ti GPUs, which require approximately 4.5 hours to train the model with the ScanNet V2 dataset until convergence and approximately eleven hours with the SUN RGB-D dataset.

Individual and Combined Efects of Submodules.
To quantitatively evaluate the efectiveness of the proposed contextual submodules of RSSF-Net, we performed experiments with diferent combinations of these modules. Te baseline method was VoteNet. Ten, we added the proposed submodules one by one into the baseline model. Applying the GFM, RVM, and SCM modules led to the following improvements in mAP@0.25: 3, 4.1, and 3.8, respectively. Te results of the diferent combinations of the three modules are detailed in Table 1, with the highest mAP@0.25 score being 65.9.
We tested the combined work efectiveness of the three modules and found that, with the cooperation of the RVM module, the performance of the network improved significantly. Supplementing the geometric information of the proposal in the original point cloud helps improve the revoted center point by 1.8 points. SCM helps to judge the semantic category of the proposal by infltrating the scene layout information into the proposal's learning of the surrounding environment and, simultaneously, improves the quality of the proposal's center point.

Te Efect of Submodule Location.
In addition, we also perform a detailed ablation study to analyze the efect of the proposed three submodules when placed in diferent positions. As shown in Figure 4 Table 2. We considered placing SCM after the coarse voting or refned voting and determined its performance separately. From the results, SCM is more efective after refned voting. Global feature supervision of the candidates efectively manages the features between candidates and supervises the detection and classifcation of the candidates, eliminating results with large deviations, while considering the rules that must be followed for objects in indoor scenes. Te global scene also assists in the classifcation of objects (i.e., there is typically a bed in a bedroom; a bathtub typically appears only in a bathroom).
Tere are diferences when H3DNet processed the SUN RGB-D dataset (e.g., H3DNet subsampled 40,000 points from each scene for input, while other methods used 20,000 points). In addition, H3DNet reported only the results of each category when using PointNet++ as the backbone network. Other comparison methods used only PointNet++ as the backbone network. Te proposed RSFF-Net also achieves a marginal improvement of 1.5 in terms of mAP@ 0.25 over H3DNet, even when using a diferent backbone.
We also compared our methods with several baseline methods on the SUN RGB-D dataset. Te results are given in Table 4, which shows that the proposed RSFF-Net achieves performance comparable with most existing methods. When considering a point cloud only, VoteNet obtained a detection accuracy of 57.7 in terms of mAP@0.25. Note that the proposed RSFF-Net provides a marked absolute gain of 3.6 compared to VoteNet. Despite the diferences in the datasets, RSFF-Net still outperforms, improving by 3.6% and 10.7% on mAP@0.25 and mAP@0.5, respectively. We observe that on the two datasets, in the case of mAP@0.5, the performance of RSFF-Net is superior, indicating that when the IoU is 0.5, our method provides more high-quality proposals than VoteNet. Te location is more accurate and efcient, as fully refected by our experimental results.
In Tables 5 and 6, we show the respective dataset accuracies under the semantic categories. Table 5 shows the detailed performance scores under the semantic categories for ScanNet V2 (i.e., for each object category in the ScanNet   Computational Intelligence and Neuroscience

RSFF-Net (ours)
Geo only 61. 3 43.6 Bold denotes the overall best result. Te reason for these results may be that these objects have regular geometric edges, which can be better supplemented, making the features learned by refned voting more accurate. In addition, these objects have special semantic classes and unique locations. Te SCM is more sensitive to the layout information of these objects in the global scene, so it is more sensitive in detection. When presenting the 3D object detection results for the SUN RGB-D validation dataset, we assessed performance using the SUN RGB-D V1 data to make a fair comparison with existing methods. As shown in Table 6, the proposed RSFF-Net achieved the best performance on mAP@0.25 on 7 out of 10 of the classes from the SUN RGB-D dataset. Te proposed RSFF-Net also had a visible efect on bathtub, desk, and bookshelf, which increased by 9.4, 5.6, and 7.6 points, respectively. Tese objects all have a strong relationship with the scene, indicating that the semantics of the upper and lower levels are instructive and helpful to the detection.

Results with ScanNet Dataset.
Many objects, such as windows, doors, and pictures, which are embedded in or attached to walls in indoor scenes, are typically markedly diferent from walls in RGB images. However, these objects appear on the surface of walls and are easily confused with walls in pure point clouds. Tus, the objects are easily incorrectly detected by detectors without RGB image inputs. As shown in Figure 5, all doors and windows in three images are embedded in the walls. VoteNet exhibits relatively poor performance in all three scenes.
As shown in the frst scene in Figure 5, both doors and windows are classifed inaccurately, and several invalid boxes are predicted in the top right corner. In the second scene, VoteNet [12] also classifes a door as a window and misses the curtains near the window. Also, duplicate boxes are generated for the window and door in the third scene. Although detecting windows and doors correctly in the frst two scenes, MLCVNet [13] also classifes the door incorrectly in the second scene and creates one additional box for the door in the third scene. Although detecting correctly in the frst and third scenes, 3DETR [44] incorrectly detects the door as a locker in the second scene. In contrast, the proposed RSFF-Net successfully recognizes the doors and windows in all three scenes and also correctly detects the window curtains in the second scene, whereas both VoteNet and MLCVNet tend to miss the curtains.
Additionally, in contrast to the other two methods, according to the partially enlarged image, the boxes predicted by the proposed RSFF-Net ft much better around the real objects. For example, while MLCVNet, 3DETR, and RSFF-Net all detect the glass doors in the frst scenario of Figure 5, it is clear that the bounding box of RSFF-Net is more perfect. A possible reason for the proposed RSFF-Net being able to efectively reduce duplicate and empty bounding boxes is that the proposed refned voting module improves center point location and directs the network to pay attention to the correct regions. Te proposed RSFF-Net also moves adjacent points to the centers, which helps remove duplicate boxes during the non-maximum suppression (NMS) operation. In some narrow rooms, such as living room and bathroom, each scene has several object categories with a large diference in object size and geometry.
Diferent objects are distributed in a specifc form and have strong semantic relations with each other. Te objects, such as sofas and cofee tables, and toilets and bathtubs frequently appear in pairs. Figure 6 shows that RSFF-Net exhibits a strong ability to improve detection precision in those scenarios. In the frst two scenes, the proposed RSFF-Net outputs only a total of nine bounding boxes for eight objects, whereas VoteNet, MLCVNet, and 3DETR output eighteen, ffteen, and twelve bounding boxes, respectively. Accuracy of the boxes is also better with the proposed RSFF-Net. Again, it is reasonable to assume that this improvement is related to the benefts of the refned voting module.
Considering bounding box quality, the proposed RSFF-Net achieves a nearly perfect result for the sofa (light blue boxes), bathtub (yellow box), toilet (light green), and door (red box). In contrast, by producing low-quality boxes, both VoteNet and MLCNet generate an inadequate box for the larger sofa in the frst scene and, in the second scene, produce duplicate boxes for the door; VoteNet, MLCVNet, and 3DETR produce several invalid boxes (dark blue) in empty areas. Both MLCVNet and the proposed RSFF-Net achieve acceptable results for the toilet, bathtub, and shower curtain.
Sometimes, indoor scenes contain densely packed objects in certain regions. During inference, plenty of centers in an image are of the same category, thereby increasing the difculty of individual detection. Detailed results are shown in Figure 7. In all three scenes, both 3DETR and RSFF-Net separate the rows of chairs. However, VoteNet misses several chairs in the center region of the second scene and misses a few windows in the third scene. Overall, in this situation, both VoteNet and MLCNet are prone to generating redundant boxes in this situation. Also, all three methods predict two chairs in the top left of the second scene, even though they are not labeled in the ground truth.
In the multifunctional scenes, the proposed RSFF-Net does not show a better result than other detectors (see Figure 8). From a functional point of view, a room can be regarded as a study room or a living room. In Scenario 1, there are three tables, one sofa, several chairs, and many objects on tables. Cluttered objects that cover the tables become noise data and make it even difcult to extract key feature for tables. Terefore, for all three methods, the results for the tables are poor. Similarly, Scenario 2 can be seen as the combination of kitchen and living room. Only two windows are embedded in the bottom of the scene. Owing to the bookshelf-like object between the windows, all methods generated redundant boxes for them. None of the three detectors correctly detected the furniture in the right center of the room (marked in light blue). Two possible reasons for this are the lack of training samples and the occlusion of part of the object.

GT VoteNet
MLCVNet 3DETR RSFFNet Figure 5: Comparison of object detection results between VoteNet [12], MLCVNet [13], 3DETR [44], and the proposed RSFF-Net for objects embedded in the wall, which are similar to the wall. Bounding boxes of diferent colors represent diferent object categories.
GT VoteNet MLCVNet 3DETR RSFFNet Figure 6: Comparison of object detection results between VoteNet [12], MLCVNet [13], and 3DETR [44]. Additionally, the proposed RSFF-Net achieves acceptable results for objects of various sizes and shapes in a special room. Bounding boxes of diferent colors represent diferent object categories.

Results with SUN RGB-D.
Some qualitative results on the SUN RGB-D dataset are shown in Figure 9. Boxes of different colors represent diferent types of objects. Te bounding boxes pointed by red arrows denote the correctly detected objects that exist in the RGB image but are unlabeled in the point cloud. Bounding boxes of diferent colors represent diferent object categories. As seen in Figure 9, the proposed RSFF-Net achieves promising results in a wide range of scenes including the bedroom, living room, and conference room. It is also noteworthy that almost all the objects in ground truth (GT) are detected correctly by our proposed RSFF-Net in those images, whereas VoteNet still has a few kinds of missed and false-positive detection.
Additionally, many objects in the RGB image are not labeled or missing in the GT. For instance, the TV cabinet in the frst scene is unlabeled and the chairs in last two scenes are only  [12], MLCVNet [13], 3DETR [44], and the proposed RSFF-Net for multifunctional scenes. Bounding boxes of diferent colors represent diferent object categories.
partially observed by the sensor. For those objects, RSFF-Net exhibits signifcant improvement over VoteNet, thereby demonstrating the efectiveness of the proposed approach.

Conclusion
3D object detection in indoor scenes is used in various AI environments. Te proposed RSFF-Net introduces three novel modules to achieve better feature learning, center voting, and bounding box regression. Te geometric function module attempts to add detailed object information for small objects caused by downsampling. Refned voting improves the accuracy of center points. Scene constraints introduce the relationships between a scene and its objects to improve classifcation accuracy. Compared with the several existing methods, the proposed RSFF-Net achieves a higher accuracy on both the ScanNet and SUN RGB-D datasets. In future work, we plan to apply these modules to other 3D Image GT VoteNet RSFF-Net scene understanding tasks, such as instance segmentation and 3D object reconstruction.

Data Availability
Te SUN RGB-D data used to support the fndings of this study have been deposited in the SUN RGB-D repository (https://rgbd.cs.princeton.edu/). Te ScanNet data used to support the fndings of this study have been deposited in the ScanNet repository (https://www.scan-net.org/).

Conflicts of Interest
Te authors declare that they have no conficts of interest.