Li Yangguang, Huang Bin, Chen Zeren, Cui Yufeng, Liang Feng, Shen Mingzhu, Liu Fenggang, Xie Enze, Sheng Lu, Ouyang Wanli, Shao Jing
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8665-8679. doi: 10.1109/TPAMI.2024.3414835. Epub 2024 Nov 6.
Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation or depth representation. Our Fast-BEV consists of five parts, we innovatively propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image features to 3D voxel space, (2) a multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Among them, (1) and (3) enable Fast-BEV to be fast inference and deployment friendly on the on-vehicle chips, (2), (4) and (5) ensure that Fast-BEV has competitive performance. All these make Fast-BEV a solution with high performance, fast inference speed, and deployment-friendly on the on-vehicle chips of autonomous driving. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model (Li et al. 2022) and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model (J. Huang and G. Huang, 2022). Our largest model (R101@900×1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips.
最近,基于鸟瞰图(BEV)表示的感知任务越来越受到关注,并且BEV表示作为下一代自动驾驶汽车(AV)感知的基础很有前景。然而,大多数现有的BEV解决方案要么需要大量资源来执行车载推理,要么性能一般。本文提出了一个简单而有效的框架,称为Fast-BEV,它能够在车载芯片上更快地进行BEV感知。为了实现这一目标,我们首先通过实验发现,BEV表示无需基于昂贵的变压器变换或深度表示就可以足够强大。我们的Fast-BEV由五个部分组成,我们创新性地提出:(1)一种轻量级且便于部署的视图变换,它能快速将二维图像特征转换到三维体素空间;(2)一种多尺度图像编码器,它利用多尺度信息以获得更好的性能;(3)一种高效的BEV编码器,其专门设计用于加速车载推理。我们还引入了:(4)一种针对图像和BEV空间的强大数据增强策略以避免过拟合;(5)一种多帧特征融合机制以利用时间信息。其中,(1)和(3)使Fast-BEV能够在车载芯片上快速推理且便于部署,(2)、(4)和(5)确保Fast-BEV具有有竞争力的性能。所有这些使得Fast-BEV成为一种在自动驾驶车载芯片上具有高性能、快速推理速度且便于部署的解决方案。通过实验,在2080Ti平台上,我们的R50模型在nuScenes验证集上可以以47.3%的NDS运行52.6帧每秒,超过了BEVDepth-R50模型(Li等人,2022)的41.3帧每秒和47.5%的NDS以及BEVDet4D-R50模型(J. Huang和G. Huang,2022)的30.2帧每秒和45.7%的NDS。我们最大的模型(R101@900×1600)在nuScenes验证集上建立了具有竞争力的53.5%的NDS。我们还在当前流行的车载芯片上开发了一个具有相当准确性和效率的基准。