Wu Yiming, Li Ruixiang, Qin Zequn, Zhao Xinhai, Li Xi
IEEE Trans Image Process. 2025;34:689-700. doi: 10.1109/TIP.2024.3427701. Epub 2025 Jan 28.
Vision-based Bird's Eye View (BEV) representation is an emerging perception formulation for autonomous driving. The core challenge is to construct BEV space with multi-camera features, which is a one-to-many ill-posed problem. Diving into all previous BEV representation generation methods, we found that most of them fall into two types: modeling depths in image views or modeling heights in the BEV space, mostly in an implicit way. In this work, we propose to explicitly model heights in the BEV space, which needs no extra data like LiDAR and can fit arbitrary camera rigs and types compared to modeling depths. Theoretically, we give proof of the equivalence between height-based methods and depth-based methods. Considering the equivalence and some advantages of modeling heights, we propose HeightFormer, which models heights and uncertainties in a self-recursive way. Without any extra data, the proposed HeightFormer could estimate heights in BEV accurately. Benchmark results show that the performance of HeightFormer achieves SOTA compared with those camera-only methods.
基于视觉的鸟瞰图(BEV)表示是一种新兴的自动驾驶感知方法。核心挑战在于利用多摄像头特征构建BEV空间,这是一个一对多的不适定问题。深入研究以往所有的BEV表示生成方法后,我们发现它们大多可分为两类:在图像视图中对深度进行建模或在BEV空间中对高度进行建模,且大多以隐式方式进行。在这项工作中,我们建议在BEV空间中显式地对高度进行建模,与对深度进行建模相比,这种方法无需像激光雷达这样的额外数据,并且能够适配任意的相机配置和类型。从理论上讲,我们证明了基于高度的方法和基于深度的方法之间的等价性。考虑到对高度进行建模的等价性和一些优势,我们提出了HeightFormer,它以自递归的方式对高度和不确定性进行建模。无需任何额外数据,所提出的HeightFormer就能在BEV中准确估计高度。基准测试结果表明,与那些仅使用摄像头的方法相比,HeightFormer的性能达到了当前最优水平。