Sharma Sachin, Meyer Richard T, Asher Zachary D
Department of Mechanical and Aerospace Engineering, Western Michigan University, 1903 West Michigan Ave, Kalamazoo, MI 49008, USA.
Sensors (Basel). 2024 Sep 9;24(17):5841. doi: 10.3390/s24175841.
Current state-of-the-art (SOTA) LiDAR-only detectors perform well for 3D object detection tasks, but point cloud data are typically sparse and lacks semantic information. Detailed semantic information obtained from camera images can be added with existing LiDAR-based detectors to create a robust 3D detection pipeline. With two different data types, a major challenge in developing multi-modal sensor fusion networks is to achieve effective data fusion while managing computational resources. With separate 2D and 3D feature extraction backbones, feature fusion can become more challenging as these modes generate different gradients, leading to gradient conflicts and suboptimal convergence during network optimization. To this end, we propose a 3D object detection method, Attention-Enabled Point Fusion (AEPF). AEPF uses images and voxelized point cloud data as inputs and estimates the 3D bounding boxes of object locations as outputs. An attention mechanism is introduced to an existing feature fusion strategy to improve 3D detection accuracy and two variants are proposed. These two variants, AEPF-Small and AEPF-Large, address different needs. AEPF-Small, with a lightweight attention module and fewer parameters, offers fast inference. AEPF-Large, with a more complex attention module and increased parameters, provides higher accuracy than baseline models. Experimental results on the KITTI validation set show that AEPF-Small maintains SOTA 3D detection accuracy while inferencing at higher speeds. AEPF-Large achieves mean average precision scores of 91.13, 79.06, and 76.15 for the car class's easy, medium, and hard targets, respectively, in the KITTI validation set. Results from ablation experiments are also presented to support the choice of model architecture.
当前最先进的(SOTA)仅使用激光雷达的探测器在三维目标检测任务中表现出色,但点云数据通常很稀疏且缺乏语义信息。从相机图像中获得的详细语义信息可以与现有的基于激光雷达的探测器相结合,以创建一个强大的三维检测管道。对于两种不同的数据类型,开发多模态传感器融合网络的一个主要挑战是在管理计算资源的同时实现有效的数据融合。由于采用单独的二维和三维特征提取主干,特征融合可能会变得更具挑战性,因为这些模式会生成不同的梯度,从而导致网络优化过程中的梯度冲突和次优收敛。为此,我们提出了一种三维目标检测方法,即注意力增强点融合(AEPF)。AEPF使用图像和体素化点云数据作为输入,并将目标位置的三维边界框估计作为输出。将注意力机制引入到现有的特征融合策略中,以提高三维检测精度,并提出了两种变体。这两种变体,即AEPF-Small和AEPF-Large,满足了不同的需求。AEPF-Small具有轻量级注意力模块和较少参数,提供快速推理。AEPF-Large具有更复杂的注意力模块和更多参数,比基线模型提供更高的精度。在KITTI验证集上的实验结果表明,AEPF-Small在保持SOTA三维检测精度的同时,推理速度更快。在KITTI验证集中,AEPF-Large对于汽车类别的简单、中等和困难目标,平均精度得分分别达到91.13、79.06和76.15。还给出了消融实验的结果,以支持模型架构的选择。