3DGeoDet

General-purpose Geometry-aware Image-based 3D Object Detection

Under Review, 2024

1The Hong Kong Polytecnic University
†Corresponding author.

Performance of our proposed 3DGeoDet. (a) We visualize the performance of different methods in terms of mAP@50 and mAP@25 utilizing varying number of views during inference stage on ScanNetV2 validation split. 3DGeoDet achieves the best performance compared to existing methods for all view configurations. (b) 3DGeoDet is trained end-to-end with supervision from ground truth 3D bounding boxes and depth maps (optional), whereas CN-RMA employs a super complex training strategy which requires ground truth TSDF volume and 3D bounding boxes.

Abstract

This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach from single- or multi-view RGB images of indoor scenes. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manner based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly by the proposed voxel occupancy attention. To further enhance the 3D awareness of the feature volume, we integrate it with an implicit 3D representation, truncated signed distance function. Without supervision from 3D signals, we significantly improve the model’s comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets, i.e., achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset and a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset. Our image-based method narrows the performance gap compared to the point cloud-based approach, achieving even comparable results.


Results on ScanNetV2 Benchmark

Our 3DGeoDet achieves state-of-the-art performance on ScanNetV2 benchmark. Specifically, compared to the point cloud-based method VoteNet, our method achieves better results using only 50 views of images. Red indicates the best performance and blue indicates the second best performance.
ScanNetV2 Benchmark
Our 3DGeoDet outperforms state-of-the-art methods for all number of testing views (Note that we use 20 views for training for all methods). Red indicates the best performance and blue indicates the second best performance.
ScanNetV2 Benchmark

ScanNetV2 Visualization Results

We visualize several representative examples of 3DGeoDet on ScanNetV2 validation set, covering various indoor environments such as living rooms, bedrooms, bathrooms, kitchens and libraries.

Scene042300 (Ours)

Scene042300 (Ground truth)

Scene070200 (Ours)

Scene070200 (Ground truth)

Scene066400 (Ours)

Scene066400 (Ground truth)

Scene065501 (Ours)

Scene065501 (Ground truth)

Scene065300 (Ours)

Scene065300 (Ground truth)

Scene059802 (Ours)

Scene059802 (Ground truth)

Scene058000 (Ours)

Scene058000 (Ground truth)

Scene055000 (Ours)

Scene055000 (Ground truth)


ScanNetV2 Visualization Results Compared with SOTA Approaches

We compare 3DGeoDet with CN-RMA on ScanNetV2 validation set. Left is our result and right is the result from CN-RMA.


Original Image CN-RMA
Modified Image Ours
Original Image CN-RMA
Modified Image Ours
Original Image CN-RMA
Modified Image Ours
Original Image CN-RMA
Modified Image Ours

Results on SUN RGB-D Benchmark

3DGeoDet outperforms ImvoxelNet by 25.4% and 73.5% in mAP@0.25 and mAP0.5. Moreover, it significantly narrows down the performance gap between the monocular detection methods and point cloud-based detection methods (VoteNet).


SUN RGB-D Benchmark

SUN RGB-D Visualization Results Compared with SOTA Approaches

We compare Im with ImVoxelNet on SUN RGB-Dvalidation set. Left is our result and right is the result from ImVoxelNet.


Original Image ImVoxelNet
Modified Image Ours
Original Image ImVoxelNet
Modified Image Ours
Original Image ImVoxelNet
Modified Image Ours
Original Image ImVoxelNet
Modified Image Ours

Acknowledgment

The research work was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust.

BibTeX 🙏