Shi Huisheng, Wang Xin, Zhao Jianghong, Hua Xinnan
Department of Remote Sensing Engineering, Henan College of Surveying and Mapping, Zhengzhou 451464, China.
Beiqi Foton Motor Co., Ltd., Beijing 102206, China.
Sensors (Basel). 2025 Apr 14;25(8):2474. doi: 10.3390/s25082474.
To bridge the modality gap between camera images and LiDAR point clouds in autonomous driving systems-a critical challenge exacerbated by current fusion methods' inability to effectively integrate cross-modal features-we propose the Cross-Modal Fusion (CMF) framework. This attention-driven architecture enables hierarchical multi-sensor data fusion, achieving state-of-the-art performance in semantic segmentation tasks.The CMF framework first projects point clouds onto the camera coordinates through the use of perspective projection to provide spatio-depth information for RGB images. Then, a two-stream feature extraction network is proposed to extract features from the two modalities separately, and multilevel fusion of the two modalities is realized by a residual fusion module (RCF) with cross-modal attention. Finally, we design a perceptual alignment loss that integrates cross-entropy with feature matching terms, effectively minimizing the semantic discrepancy between camera and LiDAR representations during fusion. The experimental results based on the SemanticKITTI and nuScenes benchmark datasets demonstrate that the CMF method achieves mean intersection over union (mIoU) scores of 64.2% and 79.3%, respectively, outperforming existing state-of-the-art methods in regard to accuracy and exhibiting enhanced robustness in regard to complex scenarios. The results of the ablation studies further validate that enhancing the feature interaction and fusion capabilities in semantic segmentation models through cross-modal attention and perceptually guided cross-entropy loss (Pgce) is effective in regard to improving segmentation accuracy and robustness.
为了弥合自动驾驶系统中相机图像和激光雷达点云之间的模态差距(当前融合方法无法有效整合跨模态特征,这一关键挑战进一步加剧),我们提出了跨模态融合(CMF)框架。这种注意力驱动的架构实现了分层多传感器数据融合,在语义分割任务中达到了当前最优的性能。CMF框架首先通过透视投影将点云投影到相机坐标上,为RGB图像提供空间深度信息。然后,提出了一个双流特征提取网络,分别从两种模态中提取特征,并通过带有跨模态注意力的残差融合模块(RCF)实现两种模态的多级融合。最后,我们设计了一种感知对齐损失,将交叉熵与特征匹配项相结合,有效地最小化了融合过程中相机和激光雷达表示之间的语义差异。基于SemanticKITTI和nuScenes基准数据集的实验结果表明,CMF方法分别实现了64.2%和79.3%的平均交并比(mIoU)分数,在准确性方面优于现有的最优方法,并且在复杂场景中表现出更强的鲁棒性。消融研究的结果进一步验证了通过跨模态注意力和感知引导的交叉熵损失(Pgce)增强语义分割模型中的特征交互和融合能力,对于提高分割准确性和鲁棒性是有效的。