Suppr超能文献

CI3D:3D驾驶场景中动态物体与静态地图元素的上下文交互

CI3D: Context Interaction for Dynamic Objects and Static Map Elements in 3D Driving Scenes.

作者信息

Cai Feipeng, Chen Hao, Deng Liuyuan

出版信息

IEEE Trans Image Process. 2024;33:2867-2879. doi: 10.1109/TIP.2023.3340607. Epub 2024 Apr 15.

Abstract

Multi-view 3D visual perception including 3D object detection and Birds'-eye-view (BEV) map segmentation is essential for autonomous driving. However, there has been little discussion about 3D context attention between dynamic objects and static elements with multi-view camera inputs, due to the challenging nature of recovering the 3D spatial information from images and performing effective 3D context interaction. 3D context information is expected to provide more cues to enhance 3D visual perception for autonomous driving. We thus propose a new transformer-based framework named CI3D in an attempt to implicitly model 3D context interaction between dynamic objects and static map elements. To achieve this, we use dynamic object queries and static map queries to gather information from multi-view image features, which are represented sparsely in 3D space. Moreover, a dynamic 3D position encoder is utilized to precisely generate queries' positional embeddings. With accurate positional embeddings, the queries effectively aggregate 3D context information via a multi-head attention mechanism to model 3D context interaction. We further reveal that sparse supervision signals from the limited number of queries result in the issue of rough and vague image features. To overcome this challenge, we introduce a panoptic segmentation head as an auxiliary task and a 3D-to-2D deformable cross-attention module, greatly enhancing the robustness of spatial feature learning and sampling. Our approach has been extensively evaluated on two large-scale datasets, nuScenes and Waymo, and significantly outperforms the baseline method on both benchmarks.

摘要

包括3D目标检测和鸟瞰图(BEV)地图分割在内的多视图3D视觉感知对于自动驾驶至关重要。然而,由于从图像中恢复3D空间信息并进行有效的3D上下文交互具有挑战性,因此关于多视图相机输入下动态物体和静态元素之间的3D上下文注意力的讨论很少。3D上下文信息有望提供更多线索,以增强自动驾驶的3D视觉感知。因此,我们提出了一个名为CI3D的基于Transformer的新框架,试图隐式地对动态物体和静态地图元素之间的3D上下文交互进行建模。为了实现这一点,我们使用动态物体查询和静态地图查询从多视图图像特征中收集信息,这些特征在3D空间中稀疏表示。此外,利用动态3D位置编码器精确生成查询的位置嵌入。通过准确的位置嵌入,查询通过多头注意力机制有效地聚合3D上下文信息,以对3D上下文交互进行建模。我们进一步发现,来自有限数量查询的稀疏监督信号会导致图像特征粗糙和模糊的问题。为了克服这一挑战,我们引入了一个全景分割头作为辅助任务和一个3D到2D的可变形交叉注意力模块,大大增强了空间特征学习和采样的鲁棒性。我们的方法在两个大规模数据集nuScenes和Waymo上进行了广泛评估,并且在两个基准测试中均显著优于基线方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验