使用Transformer在鸟瞰视角下进行周围环境感知表示预测。

Surrounding-aware representation prediction in Birds-Eye-View using transformers.

作者信息

Yu Jiahui, Zheng Wenli, Chen Yongquan, Zhang Yutong, Huang Rui

机构信息

Shenzhen Institute of Artificial Intelligence and Robotics for Society, and the SSE/IRIM, The Chinese University of Hong Kong, Shenzhen, Guangdong, China.

The Shenzhen Academy of Inspection Quarantine, Shenzhen, Guangdong, China.

出版信息

Front Neurosci. 2023 Jul 4;17:1219363. doi: 10.3389/fnins.2023.1219363. eCollection 2023.

DOI:10.3389/fnins.2023.1219363

PMID:37469840

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10352774/

Abstract

Birds-Eye-View (BEV) maps provide an accurate representation of sensory cues present in the surroundings, including dynamic and static elements. Generating a semantic representation of BEV maps can be a challenging task since it relies on object detection and image segmentation. Recent studies have developed Convolutional Neural networks (CNNs) to tackle the underlying challenge. However, current CNN-based models encounter a bottleneck in perceiving subtle nuances of information due to their limited capacity, which constrains the efficiency and accuracy of representation prediction, especially for multi-scale and multi-class elements. To address this issue, we propose novel neural networks for BEV semantic representation prediction that are built upon Transformers without convolution layers in a significantly different way from existing pure CNNs and hybrid architectures that merge CNNs and Transformers. Given a sequence of image frames as input, the proposed neural networks can directly output the BEV maps with per-class probabilities in end-to-end forecasting. The core innovations of the current study contain (1) a new pixel generation method powered by Transformers, (2) a novel algorithm for image-to-BEV transformation, and (3) a novel network for image feature extraction using attention mechanisms. We evaluate the proposed Models performance on two challenging benchmarks, the NuScenes dataset and the Argoverse 3D dataset, and compare it with state-of-the-art methods. Results show that the proposed model outperforms CNNs, achieving a relative improvement of 2.4 and 5.2% on the NuScenes and Argoverse 3D datasets, respectively.

摘要

鸟瞰（BEV）地图提供了对周围环境中存在的感官线索的准确表示，包括动态和静态元素。生成BEV地图的语义表示可能是一项具有挑战性的任务，因为它依赖于目标检测和图像分割。最近的研究已经开发了卷积神经网络（CNN）来应对这一潜在挑战。然而，当前基于CNN的模型由于其有限的能力，在感知信息的细微差别方面遇到瓶颈，这限制了表示预测的效率和准确性，特别是对于多尺度和多类元素。为了解决这个问题，我们提出了用于BEV语义表示预测的新型神经网络，它基于Transformer构建，没有卷积层，其方式与现有的纯CNN以及融合CNN和Transformer的混合架构有显著不同。给定一系列图像帧作为输入，所提出的神经网络可以在端到端预测中直接输出具有每个类概率的BEV地图。当前研究的核心创新包括：（1）一种由Transformer驱动的新像素生成方法，（2）一种用于图像到BEV转换的新颖算法，以及（3）一种使用注意力机制的图像特征提取新网络。我们在两个具有挑战性的基准数据集NuScenes数据集和Argoverse 3D数据集上评估了所提出模型的性能，并将其与最先进的方法进行了比较。结果表明，所提出的模型优于CNN，在NuScenes和Argoverse 3D数据集上分别实现了2.4%和5.2%的相对改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9f8/10352774/aa98477f3450/fnins-17-1219363-g0001.jpg

相似文献

Surrounding-aware representation prediction in Birds-Eye-View using transformers.使用Transformer在鸟瞰视角下进行周围环境感知表示预测。

Front Neurosci. 2023 Jul 4;17:1219363. doi: 10.3389/fnins.2023.1219363. eCollection 2023.

CI3D: Context Interaction for Dynamic Objects and Static Map Elements in 3D Driving Scenes.CI3D：3D驾驶场景中动态物体与静态地图元素的上下文交互

IEEE Trans Image Process. 2024;33:2867-2879. doi: 10.1109/TIP.2023.3340607. Epub 2024 Apr 15.

Learning Cross-Attention Discriminators via Alternating Time-Space Transformers for Visual Tracking.通过交替时空变换器学习交叉注意力鉴别器用于视觉跟踪

IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):15156-15169. doi: 10.1109/TNNLS.2023.3282905. Epub 2024 Oct 29.

H2Former: An Efficient Hierarchical Hybrid Transformer for Medical Image Segmentation.H2Former：一种用于医学图像分割的高效分层混合 Transformer

IEEE Trans Med Imaging. 2023 Sep;42(9):2763-2775. doi: 10.1109/TMI.2023.3264513. Epub 2023 Aug 31.

Boundary-aware context neural network for medical image segmentation.边界感知上下文神经网络在医学图像分割中的应用。

Med Image Anal. 2022 May;78:102395. doi: 10.1016/j.media.2022.102395. Epub 2022 Feb 14.

Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation.基于多语义特征，利用经过稳健优化的基于变换器预训练方法的全词掩码和卷积神经网络从电子病历中进行中文临床命名实体识别：模型开发与验证

JMIR Med Inform. 2023 May 10;11:e44597. doi: 10.2196/44597.

Transfer Learning Based Semantic Segmentation for 3D Object Detection from Point Cloud.基于迁移学习的点云三维目标检测语义分割。

Sensors (Basel). 2021 Jun 8;21(12):3964. doi: 10.3390/s21123964.

LiDAR-Based 3D Temporal Object Detection via Motion-Aware LiDAR Feature Fusion.基于激光雷达的三维目标检测：通过运动感知激光雷达特征融合实现的时间目标检测

Sensors (Basel). 2024 Jul 18;24(14):4667. doi: 10.3390/s24144667.

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks.用于高效视觉Transformer和卷积神经网络的动态空间稀疏化

IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10883-10897. doi: 10.1109/TPAMI.2023.3263826. Epub 2023 Aug 7.

A novel hybrid transformer-CNN architecture for environmental microorganism classification.一种用于环境微生物分类的新型混合变压器-CNN 架构。

PLoS One. 2022 Nov 11;17(11):e0277557. doi: 10.1371/journal.pone.0277557. eCollection 2022.

本文引用的文献

Robot Policy Improvement With Natural Evolution Strategies for Stable Nonlinear Dynamical System.基于自然进化策略的稳定非线性动力系统的机器人策略改进

IEEE Trans Cybern. 2023 Jun;53(6):4002-4014. doi: 10.1109/TCYB.2022.3192049. Epub 2023 May 17.

A Survey on Vision Transformer.视觉Transformer综述

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):87-110. doi: 10.1109/TPAMI.2022.3152247. Epub 2022 Dec 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用Transformer在鸟瞰视角下进行周围环境感知表示预测。

Surrounding-aware representation prediction in Birds-Eye-View using transformers.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献