Yu Hongqi, Zhang Xiaoqin, Zhou Xiaolong, Chan Sixian
Key Laboratory of Intelligent Informatics for Safety & Emergency of Zhejiang Province, Wenzhou University, Wenzhou, 325035, Zhejiang, China.
The College of Electrical and Information Engineering, Quzhou University, Quzhou, 324000, Zhejiang, China.
Neural Netw. 2025 Nov;191:107818. doi: 10.1016/j.neunet.2025.107818. Epub 2025 Jul 5.
3D object detection is crucial for autonomous driving, enabling accurate object classification and localization in the real world. Existing methods typically rely on basic element-wise operations to fuse multi-modal features from point clouds and images, limiting the effective learning of camera semantics and LiDAR spatial information. Additionally, the inherent sparsity of point clouds leads to distribution imbalances in receptive fields, and the complexity of 3D objects conceals implicit relational contexts. To address these limitations, we propose CIDRA-Net, a cross-modal interaction fusion network with distribution-relation awareness. First, we introduce a region cross-modal interaction fusion (RCIF) module that combines LiDAR features with camera depth information through dual-modal attention. We then separate and enhance two distribution-level features using a dual-branch distribution perception (DBDP) module to learn point distributions. Additionally, a global-local relation mining (GLRM) strategy is employed to capture both local and global contextual information for better object understanding and refined regression tasks. Our approach achieves state-of-the-art performance on the nuScenes and KITTI benchmarks while demonstrating strong generalization across backbones and robustness against sensor errors.
3D目标检测对于自动驾驶至关重要,它能够在现实世界中实现准确的目标分类和定位。现有方法通常依赖于基本的逐元素操作来融合来自点云和图像的多模态特征,这限制了对相机语义和激光雷达空间信息的有效学习。此外,点云固有的稀疏性导致感受野中的分布不平衡,并且3D物体的复杂性掩盖了隐含的关系上下文。为了解决这些限制,我们提出了CIDRA-Net,一种具有分布关系感知的跨模态交互融合网络。首先,我们引入了一个区域跨模态交互融合(RCIF)模块,该模块通过双模态注意力将激光雷达特征与相机深度信息相结合。然后,我们使用双分支分布感知(DBDP)模块分离并增强两个分布级特征以学习点分布。此外,采用全局-局部关系挖掘(GLRM)策略来捕获局部和全局上下文信息,以更好地理解物体并完成精细化回归任务。我们的方法在nuScenes和KITTI基准测试中取得了领先的性能,同时在不同骨干网络上展现出强大的泛化能力以及对传感器误差的鲁棒性。