CI3D：3D驾驶场景中动态物体与静态地图元素的上下文交互

Cai Feipeng, Chen Hao, Deng Liuyuan

IEEE Trans Image Process. 2024;33:2867-2879. doi: 10.1109/TIP.2023.3340607. Epub 2024 Apr 15.

Multi-view 3D visual perception including 3D object detection and Birds'-eye-view (BEV) map segmentation is essential for autonomous driving. However, there has been little discussion about 3D context attention between dynamic objects and static elements with multi-view camera inputs, due to the challenging nature of recovering the 3D spatial information from images and performing effective 3D context interaction. 3D context information is expected to provide more cues to enhance 3D visual perception for autonomous driving. We thus propose a new transformer-based framework named CI3D in an attempt to implicitly model 3D context interaction between dynamic objects and static map elements. To achieve this, we use dynamic object queries and static map queries to gather information from multi-view image features, which are represented sparsely in 3D space. Moreover, a dynamic 3D position encoder is utilized to precisely generate queries' positional embeddings. With accurate positional embeddings, the queries effectively aggregate 3D context information via a multi-head attention mechanism to model 3D context interaction. We further reveal that sparse supervision signals from the limited number of queries result in the issue of rough and vague image features. To overcome this challenge, we introduce a panoptic segmentation head as an auxiliary task and a 3D-to-2D deformable cross-attention module, greatly enhancing the robustness of spatial feature learning and sampling. Our approach has been extensively evaluated on two large-scale datasets, nuScenes and Waymo, and significantly outperforms the baseline method on both benchmarks.

包括3D目标检测和鸟瞰图（BEV）地图分割在内的多视图3D视觉感知对于自动驾驶至关重要。然而，由于从图像中恢复3D空间信息并进行有效的3D上下文交互具有挑战性，因此关于多视图相机输入下动态物体和静态元素之间的3D上下文注意力的讨论很少。3D上下文信息有望提供更多线索，以增强自动驾驶的3D视觉感知。因此，我们提出了一个名为CI3D的基于Transformer的新框架，试图隐式地对动态物体和静态地图元素之间的3D上下文交互进行建模。为了实现这一点，我们使用动态物体查询和静态地图查询从多视图图像特征中收集信息，这些特征在3D空间中稀疏表示。此外，利用动态3D位置编码器精确生成查询的位置嵌入。通过准确的位置嵌入，查询通过多头注意力机制有效地聚合3D上下文信息，以对3D上下文交互进行建模。我们进一步发现，来自有限数量查询的稀疏监督信号会导致图像特征粗糙和模糊的问题。为了克服这一挑战，我们引入了一个全景分割头作为辅助任务和一个3D到2D的可变形交叉注意力模块，大大增强了空间特征学习和采样的鲁棒性。我们的方法在两个大规模数据集nuScenes和Waymo上进行了广泛评估，并且在两个基准测试中均显著优于基线方法。

相似文献

CI3D: Context Interaction for Dynamic Objects and Static Map Elements in 3D Driving Scenes.

IEEE Trans Image Process. 2024;33:2867-2879. doi: 10.1109/TIP.2023.3340607. Epub 2024 Apr 15.

Divide and Conquer: Improving Multi-Camera 3D Perception With 2D Semantic-Depth Priors and Input-Dependent Queries.

IEEE Trans Image Process. 2024;33:897-909. doi: 10.1109/TIP.2024.3352808. Epub 2024 Jan 23.

Surrounding-aware representation prediction in Birds-Eye-View using transformers.

Front Neurosci. 2023 Jul 4;17:1219363. doi: 10.3389/fnins.2023.1219363. eCollection 2023.

Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection.

IEEE Trans Image Process. 2024;33:4488-4500. doi: 10.1109/TIP.2024.3430473. Epub 2024 Aug 21.

Object recognition in medical images via anatomy-guided deep learning.

Med Image Anal. 2022 Oct;81:102527. doi: 10.1016/j.media.2022.102527. Epub 2022 Jun 25.

AFTR: A Robustness Multi-Sensor Fusion Model for 3D Object Detection Based on Adaptive Fusion Transformer.

Sensors (Basel). 2023 Oct 12;23(20):8400. doi: 10.3390/s23208400.

CBG-Net: Cross-modality and cross-scale balance network with global semantics for multi-modal 3D object detection.

Neural Netw. 2024 Nov;179:106535. doi: 10.1016/j.neunet.2024.106535. Epub 2024 Jul 14.

Unified 3D and 4D Panoptic Segmentation via Dynamic Shifting Networks.

IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3480-3495. doi: 10.1109/TPAMI.2023.3349304. Epub 2024 Apr 3.

Monocular Quasi-Dense 3D Object Tracking.

IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1992-2008. doi: 10.1109/TPAMI.2022.3168781. Epub 2023 Jan 6.

EFNet: enhancing feature information for 3D object detection in LiDAR point clouds.

J Opt Soc Am A Opt Image Sci Vis. 2024 Apr 1;41(4):739-748. doi: 10.1364/JOSAA.511948.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

CI3D: Context Interaction for Dynamic Objects and Static Map Elements in 3D Driving Scenes.

IEEE Trans Image Process. 2024;33:2867-2879. doi: 10.1109/TIP.2023.3340607. Epub 2024 Apr 15.

Divide and Conquer: Improving Multi-Camera 3D Perception With 2D Semantic-Depth Priors and Input-Dependent Queries.

IEEE Trans Image Process. 2024;33:897-909. doi: 10.1109/TIP.2024.3352808. Epub 2024 Jan 23.

Surrounding-aware representation prediction in Birds-Eye-View using transformers.

Front Neurosci. 2023 Jul 4;17:1219363. doi: 10.3389/fnins.2023.1219363. eCollection 2023.

Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection.

IEEE Trans Image Process. 2024;33:4488-4500. doi: 10.1109/TIP.2024.3430473. Epub 2024 Aug 21.

Object recognition in medical images via anatomy-guided deep learning.

Med Image Anal. 2022 Oct;81:102527. doi: 10.1016/j.media.2022.102527. Epub 2022 Jun 25.

AFTR: A Robustness Multi-Sensor Fusion Model for 3D Object Detection Based on Adaptive Fusion Transformer.

Sensors (Basel). 2023 Oct 12;23(20):8400. doi: 10.3390/s23208400.

CBG-Net: Cross-modality and cross-scale balance network with global semantics for multi-modal 3D object detection.

Neural Netw. 2024 Nov;179:106535. doi: 10.1016/j.neunet.2024.106535. Epub 2024 Jul 14.

Unified 3D and 4D Panoptic Segmentation via Dynamic Shifting Networks.

IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3480-3495. doi: 10.1109/TPAMI.2023.3349304. Epub 2024 Apr 3.

Monocular Quasi-Dense 3D Object Tracking.

IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1992-2008. doi: 10.1109/TPAMI.2022.3168781. Epub 2023 Jan 6.

EFNet: enhancing feature information for 3D object detection in LiDAR point clouds.

J Opt Soc Am A Opt Image Sci Vis. 2024 Apr 1;41(4):739-748. doi: 10.1364/JOSAA.511948.

CI3D: Context Interaction for Dynamic Objects and Static Map Elements in 3D Driving Scenes.

作者信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献