Jun Woomin, Lee Sungjin
Korea Electronics Technology Institute, Seongnam 13488, Republic of Korea.
Department of Smart Automotive, Soonchunhyang University, Asan 31538, Republic of Korea.
Sensors (Basel). 2025 Apr 4;25(7):2300. doi: 10.3390/s25072300.
This study addresses the optimization of a camera-based bird's eye view (BEV) segmentation technique that operates in real-time within an embedded system environment while maintaining high accuracy despite limited computational resources. Specifically, it examines three technical approaches for BEV segmentation in autonomous driving: depth-based methods, MLP-based methods, and transformer-based methods, focusing on key techniques such as lift-splat-shoot, HDMapNet, and BEVFormer. A mathematical analysis of these methods is conducted, followed by a comparative performance evaluation using the nuScenes dataset. The optimization process was carried out in three stages: accuracy improvement, latency reduction, and model size optimization. In the first stage of the process, the three modules for BEV segmentation (encoder, view transformation, and decoder) were selected with the goal of maximizing mIoU performance. In the second stage, environmental variable optimization was performed through input resolution adaptation and data augmentation to improve accuracy. Finally, in the third stage, model compression was applied to minimize model size and latency for efficient deployment on embedded systems. Experimental results from the first stage show that the lift-splat-shoot view transformation model, based on the InternImage-B encoder and EfficientNet-B0 decoder, achieved the highest performance with 54.9 mIoU at an input image size of 448×800. Notably, the lift-splat-shoot view transformation model with the InternImage-T encoder and EfficientNet-B0 decoder demonstrated performance of 53.1 mIoU while achieving high efficiency (51.7 ms and 159.5 MB, respectively). The application of the second stage revealed that increasing the input resolution does not always lead to improved accuracy, and there is an optimal resolution size depending on the model. In this study, the best performance was achieved with an input image size of 448×800. During the third stage, FP16 quantization enabled a 50% reduction in memory size and decreased latency while maintaining similar or identical mIoU performance. When deployed on the NVIDIA AGX Orin device, which operates under power constraints, energy efficiency improved, although it resulted in higher latency under certain power supply conditions. As a result, the InternImage encoder-based lift-splat-shoot technique was shown to achieve the highest accuracy performance relative to latency and model size. This approach outperformed the original method by achieving a 29.2% higher mIoU while maintaining similar latency performance and reducing memory size by 32.2%.
本研究致力于优化一种基于摄像头的鸟瞰视图(BEV)分割技术,该技术在嵌入式系统环境中实时运行,尽管计算资源有限,但仍能保持高精度。具体而言,研究了自动驾驶中BEV分割的三种技术方法:基于深度的方法、基于多层感知器(MLP)的方法和基于变换器的方法,重点关注提升-拼接-投射(lift-splat-shoot)、高清地图网络(HDMapNet)和BEVFormer等关键技术。对这些方法进行了数学分析,随后使用nuScenes数据集进行了性能比较评估。优化过程分三个阶段进行:提高准确率、降低延迟和优化模型大小。在该过程的第一阶段,选择了用于BEV分割的三个模块(编码器、视图变换和解码器),目标是最大化平均交并比(mIoU)性能。在第二阶段,通过输入分辨率适配和数据增强进行环境变量优化,以提高准确率。最后,在第三阶段,应用模型压缩来最小化模型大小和延迟,以便在嵌入式系统上高效部署。第一阶段的实验结果表明,基于InternImage-B编码器和EfficientNet-B0解码器的提升-拼接-投射视图变换模型,在输入图像大小为448×800时,以54.9的mIoU实现了最高性能。值得注意的是,采用InternImage-T编码器和EfficientNet-B0解码器的提升-拼接-投射视图变换模型表现出53.1的mIoU性能,同时实现了高效性(分别为51.7毫秒和159.5兆字节)。第二阶段的应用表明,提高输入分辨率并不总是能提高准确率,并且存在一个取决于模型的最佳分辨率大小。在本研究中,输入图像大小为448×800时实现了最佳性能。在第三阶段,半浮点16位(FP-16)量化使内存大小减少了50%,并降低了延迟,同时保持了相似或相同的mIoU性能。当部署在受功率限制运行的英伟达AGX Orin设备上时,能源效率有所提高,尽管在某些电源条件下导致了更高的延迟。结果表明,基于InternImage编码器的提升-拼接-投射技术在延迟和模型大小方面实现了最高的准确率性能。该方法比原始方法表现更优,在保持相似延迟性能的同时,mIoU提高了29.2%,内存大小减少了32.2%。