用于3D目标检测的密集投影融合

Dense projection fusion for 3D object detection.

作者信息

Chen Zhao, Hu Bin-Jie, Luo Chengxi, Chen Guohao, Zhu Haohui

机构信息

School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510640, China.

出版信息

Sci Rep. 2024 Oct 8;14(1):23492. doi: 10.1038/s41598-024-74679-9.

DOI:10.1038/s41598-024-74679-9

PMID:39379475

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11461888/

Abstract

Fusing information from LiDAR and cameras can effectively enhance the overall perceptivity of autonomous vehicles in various scenarios. Despite the relatively good results achieved by point-wise fusion and Bird's-Eye-View (BEV) fusion, they still cannot fully leverage the image information and lack of effective depth information. For any fusion methods, the multi-modal features first need to be concatenated along the channel, and then the fused features are extracted using convolutional layers. However, this type of fusion methods is effective, but too coarse which causes that the fused features cannot pay more attention to the regions with important features and suffer from severe noise. To tackle these issues, we propose in this paper a Dense Projection Fusion (DPFusion) approach. It consists of two new modules: dense depth map guided BEV transform (DGBT) module and multi-modal feature adaptive fusion (MFAF) module. The DGBT module first quickly estimates the depth of each pixel and then projects all image features to the BEV space, making full use of the image information. The MFAF module computes the image weights and point cloud weights for each channel in each BEV grid and then adaptively weights and fuses the image BEV features with the point cloud BEV features. It is worth pointing out that the MFAF module makes the fused features pay more attention to background outline and object outline. Our proposed DPFusion demonstrates competitive results in 3D object detection, achieving a mean Average Precision (mAP) of 70.4 and a nuScenes detection score (NDS) of 72.3 on the nuScenes validation set.

摘要

融合来自激光雷达和摄像头的信息可以有效提高自动驾驶车辆在各种场景下的整体感知能力。尽管逐点融合和鸟瞰图（BEV）融合取得了相对较好的结果，但它们仍然无法充分利用图像信息，并且缺乏有效的深度信息。对于任何融合方法，多模态特征首先需要沿通道拼接，然后使用卷积层提取融合特征。然而，这种类型的融合方法虽然有效，但过于粗糙，导致融合特征无法更多地关注具有重要特征的区域，并且受到严重噪声的影响。为了解决这些问题，我们在本文中提出了一种密集投影融合（DPFusion）方法。它由两个新模块组成：密集深度图引导的BEV变换（DGBT）模块和多模态特征自适应融合（MFAF）模块。DGBT模块首先快速估计每个像素的深度，然后将所有图像特征投影到BEV空间，充分利用图像信息。MFAF模块计算每个BEV网格中每个通道的图像权重和点云权重，然后自适应地加权并融合图像BEV特征和点云BEV特征。值得指出的是，MFAF模块使融合特征更加关注背景轮廓和物体轮廓。我们提出的DPFusion在3D目标检测中展示了具有竞争力的结果，在nuScenes验证集上实现了70.4的平均精度均值（mAP）和72.3的nuScenes检测分数（NDS）。