Yu Jiahui, Zheng Wenli, Chen Yongquan, Zhang Yutong, Huang Rui
Shenzhen Institute of Artificial Intelligence and Robotics for Society, and the SSE/IRIM, The Chinese University of Hong Kong, Shenzhen, Guangdong, China.
The Shenzhen Academy of Inspection Quarantine, Shenzhen, Guangdong, China.
Front Neurosci. 2023 Jul 4;17:1219363. doi: 10.3389/fnins.2023.1219363. eCollection 2023.
Birds-Eye-View (BEV) maps provide an accurate representation of sensory cues present in the surroundings, including dynamic and static elements. Generating a semantic representation of BEV maps can be a challenging task since it relies on object detection and image segmentation. Recent studies have developed Convolutional Neural networks (CNNs) to tackle the underlying challenge. However, current CNN-based models encounter a bottleneck in perceiving subtle nuances of information due to their limited capacity, which constrains the efficiency and accuracy of representation prediction, especially for multi-scale and multi-class elements. To address this issue, we propose novel neural networks for BEV semantic representation prediction that are built upon Transformers without convolution layers in a significantly different way from existing pure CNNs and hybrid architectures that merge CNNs and Transformers. Given a sequence of image frames as input, the proposed neural networks can directly output the BEV maps with per-class probabilities in end-to-end forecasting. The core innovations of the current study contain (1) a new pixel generation method powered by Transformers, (2) a novel algorithm for image-to-BEV transformation, and (3) a novel network for image feature extraction using attention mechanisms. We evaluate the proposed Models performance on two challenging benchmarks, the NuScenes dataset and the Argoverse 3D dataset, and compare it with state-of-the-art methods. Results show that the proposed model outperforms CNNs, achieving a relative improvement of 2.4 and 5.2% on the NuScenes and Argoverse 3D datasets, respectively.
鸟瞰(BEV)地图提供了对周围环境中存在的感官线索的准确表示,包括动态和静态元素。生成BEV地图的语义表示可能是一项具有挑战性的任务,因为它依赖于目标检测和图像分割。最近的研究已经开发了卷积神经网络(CNN)来应对这一潜在挑战。然而,当前基于CNN的模型由于其有限的能力,在感知信息的细微差别方面遇到瓶颈,这限制了表示预测的效率和准确性,特别是对于多尺度和多类元素。为了解决这个问题,我们提出了用于BEV语义表示预测的新型神经网络,它基于Transformer构建,没有卷积层,其方式与现有的纯CNN以及融合CNN和Transformer的混合架构有显著不同。给定一系列图像帧作为输入,所提出的神经网络可以在端到端预测中直接输出具有每个类概率的BEV地图。当前研究的核心创新包括:(1)一种由Transformer驱动的新像素生成方法,(2)一种用于图像到BEV转换的新颖算法,以及(3)一种使用注意力机制的图像特征提取新网络。我们在两个具有挑战性的基准数据集NuScenes数据集和Argoverse 3D数据集上评估了所提出模型的性能,并将其与最先进的方法进行了比较。结果表明,所提出的模型优于CNN,在NuScenes和Argoverse 3D数据集上分别实现了2.4%和5.2%的相对改进。