在体素区域卷积神经网络（Voxel-RCNN）中融合残差和级联注意力机制用于3D目标检测

Fusing Residual and Cascade Attention Mechanisms in Voxel-RCNN for 3D Object Detection.

作者信息

Lu You, Zhang Yuwei, Fan Xiangsuo, Cai Dengsheng, Gong Rui

机构信息

School of Automation, Guangxi University of Science and Technology, Liuzhou 545000, China.

Liugong Machinery Co., Ltd., Liuzhou 545000, China.

出版信息

Sensors (Basel). 2025 Sep 4;25(17):5497. doi: 10.3390/s25175497.

DOI:10.3390/s25175497

PMID:40942926

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12431463/

Abstract

In this paper, a high-precision 3D object detector-Voxel-RCNN-is adopted as the baseline detector, and an improved detector named RCAVoxel-RCNN is proposed. To address various issues present in current mainstream 3D point cloud voxelisation methods, such as the suboptimal performance of Region Proposal Networks (RPNs) in generating candidate regions and the inadequate detection of small-scale objects caused by overly deep convolutional layers in both 3D and 2D backbone networks, this paper proposes a Cascade Attention Network (CAN). The CAN is designed to progressively refine and enhance the proposed regions, thereby producing more accurate detection results. Furthermore, a 3D Residual Network is introduced, which improves the representation of small objects by reducing the number of convolutional layers while incorporating residual connections. In the Bird's-Eye View (BEV) feature extraction network, a Residual Attention Network (RAN) is developed. This follows a similar approach to the aforementioned 3D backbone network, leveraging the spatial awareness capabilities of the BEV. Additionally, the Squeeze-and-Excitation (SE) attention mechanism is incorporated to assign dynamic weights to features, allowing the network to focus more effectively on informative features. Experimental results on the KITTI validation dataset demonstrate the effectiveness of the proposed method, with detection accuracy for cars, pedestrians, and bicycles improving by 3.34%, 10.75%, and 4.61%, respectively, under the KITTI hard level. The primary evaluation metric adopted is the 3D Average Precision (AP), computed over 40 recall positions (R40). The Intersection over IoU thresholds used are 0.7 for cars and 0.5 for both pedestrians and bicycles.

摘要

在本文中，采用了一种高精度的3D目标检测器——体素区域卷积神经网络（Voxel-RCNN）作为基线检测器，并提出了一种名为RCAVoxel-RCNN的改进检测器。为了解决当前主流3D点云体素化方法中存在的各种问题，例如区域提议网络（RPN）在生成候选区域时性能欠佳，以及3D和2D主干网络中过深的卷积层导致对小尺度目标检测不足的问题，本文提出了一种级联注意力网络（CAN）。CAN旨在逐步细化和增强提议区域，从而产生更准确的检测结果。此外，引入了一个3D残差网络，通过减少卷积层数量并结合残差连接来改进小目标的表示。在鸟瞰图（BEV）特征提取网络中，开发了一种残差注意力网络（RAN）。这遵循了与上述3D主干网络类似的方法，利用了BEV的空间感知能力。此外，还引入了挤压激励（SE）注意力机制为特征分配动态权重，使网络能够更有效地聚焦于信息丰富的特征。在KITTI验证数据集上的实验结果证明了所提方法的有效性，在KITTI硬难度水平下，汽车、行人及自行车的检测准确率分别提高了3.34%、10.75%和4.61%。采用的主要评估指标是在40个召回位置（R40）上计算的3D平均精度（AP）。用于汽车的交并比（IoU）阈值为0.7，用于行人和自行车的IoU阈值均为0.5。