This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach from single- or multi-view RGB images of indoor scenes. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manner based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly by the proposed voxel occupancy attention. To further enhance the 3D awareness of the feature volume, we integrate it with an implicit 3D representation, truncated signed distance function. Without supervision from 3D signals, we significantly improve the model’s comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets, i.e., achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset and a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset. Our image-based method narrows the performance gap compared to the point cloud-based approach, achieving even comparable results.