Cheng Haixing, Liu Chengyong, Gu Wenzhe, Wu Yuyi, Zhao Mengye, Liu Wentao, Wang Naibang
China Coal Energy Research Institute Co., Ltd., Xi'an, Shaanxi Province, China.
School of Mechanical and Electrical Engineering, China University of Mining and Technology (Beijing), Beijing, China.
PLoS One. 2025 Sep 4;20(9):e0331195. doi: 10.1371/journal.pone.0331195. eCollection 2025.
Multi-modal data fusion plays a critical role in enhancing the accuracy and robustness of perception systems for autonomous driving, especially for the detection of small objects. However, small object detection remains particularly challenging due to sparse LiDAR points and low-resolution image features, which often lead to missed or imprecise detections. Currently, many methods process LiDAR point clouds and visible-light camera images separately, and then fuse them in the detection head. However, these approaches often fail to fully exploit the advantages of multi-modal sensors and overlook the potential for enhancing the correlation between modalities before feature fusion. To address this, we propose a novel LiDAR-guided multi-modal fusion framework for object detection, called LGMMfusion. This framework leverages the depth information from LiDAR to guide the generation of image Bird's Eye View (BEV) features. Specifically, LGMMfusion promotes spatial interaction between point clouds and pixels before the fusion of LiDAR BEV and image BEV features, enabling the generation of higher-quality image BEV features. To better align image and LiDAR features, we incorporate a multi-head multi-scale self-attention mechanism and a multi-head adaptive cross-attention mechanism, using the prior depth information from point clouds to generate image BEV features that better match the spatial positions of LiDAR BEV features. Finally, the LiDAR BEV features and image BEV features are fused to provide enhanced features for the detection head. Experimental results show that LGMMfusion achieves 71.1% NDS and 67.3% mAP on the nuScenes validation set, while also improving the detection of small objects and enhancing the detection accuracy of most objects.
多模态数据融合在提高自动驾驶感知系统的准确性和鲁棒性方面起着关键作用,特别是在小物体检测方面。然而,由于激光雷达点云稀疏和图像特征分辨率低,小物体检测仍然极具挑战性,这常常导致检测遗漏或不准确。目前,许多方法分别处理激光雷达点云和可见光相机图像,然后在检测头中进行融合。然而,这些方法往往无法充分利用多模态传感器的优势,并且在特征融合之前忽视了增强模态间相关性的潜力。为了解决这个问题,我们提出了一种用于物体检测的新型激光雷达引导多模态融合框架,称为LGMMfusion。该框架利用激光雷达的深度信息来指导图像鸟瞰图(BEV)特征的生成。具体而言,LGMMfusion在融合激光雷达BEV和图像BEV特征之前促进点云和像素之间的空间交互,从而能够生成更高质量的图像BEV特征。为了更好地对齐图像和激光雷达特征,我们引入了多头多尺度自注意力机制和多头自适应交叉注意力机制,利用点云的先验深度信息来生成与激光雷达BEV特征空间位置更匹配的图像BEV特征。最后,将激光雷达BEV特征和图像BEV特征融合,为检测头提供增强的特征。实验结果表明,LGMMfusion在nuScenes验证集上实现了71.1%的NDS和67.3%的mAP,同时还改善了小物体的检测并提高了大多数物体的检测精度。