Chen Zehui, Chen Zheng, Li Zhenyu, Zhang Shiquan, Fang Liangji, Jiang Qinhong, Wu Feng, Zhao Feng
IEEE Trans Image Process. 2024;33:4488-4500. doi: 10.1109/TIP.2024.3430473. Epub 2024 Aug 21.
Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D.
多视图3D目标检测(MV3D)通过利用周围摄像头的多个视角特征取得了巨大进展。尽管在各种应用中展现出了广阔前景,但由于单目深度估计中的不适定问题,在3D空间中通过摄像头视图准确检测目标极其困难。最近,Graph-DETR3D提出了一种基于图的新颖3D-2D查询范式,用于聚合多视图图像进行3D目标检测,并取得了有竞争力的性能。尽管它通过可学习的3D图用2D图像特征丰富了查询表示,但由于采用单帧输入设置,其深度和速度估计能力仍然有限。为了解决这个问题,我们引入了一个统一的时空图建模框架,以在多帧输入设置下充分利用多视图图像线索。得益于动态图架构的灵活性和稀疏性,我们通过有效的注意力机制将原始3D图提升到4D空间,以自动在空间和时间层面感知图像信息。此外,考虑到主要的延迟瓶颈在于图像主干,我们提出了一种用于多视图3D目标检测的新颖密集-稀疏蒸馏框架,以在不牺牲检测精度的情况下减少计算量,使其更适合实际部署。为此,我们提出了Graph-DETR4D,这是一个基于Graph-DETR3D构建的更快更强的多视图3D目标检测框架。在nuScenes和Waymo基准上进行的大量实验证明了Graph-DETR4D的有效性和效率。值得注意的是,我们的最佳模型在nuScenes测试排行榜上达到了62.0% 的NDS。代码可在https://github.com/zehuichen123/Graph-DETR4D获取。