• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

辅助监督深度辅助变压器和跨模态注意力融合在单目3D目标检测中的应用。

Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection.

作者信息

Wang Zhijian, Liu Jie, Sun Yixiao, Zhou Xiang, Sun Boyan, Kong Dehong, Xu Jay, Yue Xiaoping, Zhang Wenyu

机构信息

School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China.

Anshan Power Supply Company, Liaoning Electric Power Limited Company of State Grid, Anshan, Liaoning, China.

出版信息

PeerJ Comput Sci. 2025 Jan 28;11:e2656. doi: 10.7717/peerj-cs.2656. eCollection 2025.

DOI:10.7717/peerj-cs.2656
PMID:39896000
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11784746/
Abstract

Monocular 3D object detection is the most widely applied and challenging solution for autonomous driving, due to 2D images lacking 3D information. Existing methods are limited by inaccurate depth estimations by inequivalent supervised targets. The use of both depth and visual features also faces problems of heterogeneous fusion. In this article, we propose Depth Detection Transformer (Depth-DETR), applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection. Depth-DETR introduces two additional depth encoders besides the visual encoder. Two depth encoders are supervised by ground truth depth and bounding box respectively, working independently to complement each other's limitations and predicting more accurate target distances. Furthermore, Depth-DETR employs cross modal attention mechanisms to effectively fuse three different features. A parallel structure of two cross modal transformer is applied to fuse two depth features with visual features. Avoiding early fusion between two depth features enhances the final fused feature for better feature representations. Through multiple experimental validations, the Depth-DETR model has achieved highly competitive results in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, with an AP score of 17.49, representing its outstanding performance in 3D object detection.

摘要

由于二维图像缺乏三维信息,单目三维目标检测是自动驾驶领域应用最广泛且最具挑战性的解决方案。现有方法受到不等价监督目标导致的深度估计不准确的限制。同时使用深度和视觉特征也面临异构融合的问题。在本文中,我们提出了深度检测Transformer(Depth-DETR),在单目三维目标检测中应用辅助监督深度辅助Transformer和跨模态注意力融合。Depth-DETR除了视觉编码器之外还引入了两个额外的深度编码器。两个深度编码器分别由真实深度和边界框监督,独立工作以互补彼此的局限性并预测更准确的目标距离。此外,Depth-DETR采用跨模态注意力机制来有效融合三种不同特征。应用两个跨模态Transformer的并行结构将两个深度特征与视觉特征融合。避免两个深度特征之间的早期融合增强了最终融合特征,以获得更好的特征表示。通过多次实验验证,Depth-DETR模型在卡尔斯鲁厄理工学院和丰田技术研究所(KITTI)数据集上取得了极具竞争力的结果,平均精度(AP)得分为17.49,表明其在三维目标检测中具有出色的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/41dd5c153688/peerj-cs-11-2656-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/76ecd25e7f07/peerj-cs-11-2656-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/2f4509b27ade/peerj-cs-11-2656-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/c0c072bb780c/peerj-cs-11-2656-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/41dd5c153688/peerj-cs-11-2656-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/76ecd25e7f07/peerj-cs-11-2656-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/2f4509b27ade/peerj-cs-11-2656-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/c0c072bb780c/peerj-cs-11-2656-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fe4/11784746/41dd5c153688/peerj-cs-11-2656-g004.jpg

相似文献

1
Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection.辅助监督深度辅助变压器和跨模态注意力融合在单目3D目标检测中的应用。
PeerJ Comput Sci. 2025 Jan 28;11:e2656. doi: 10.7717/peerj-cs.2656. eCollection 2025.
2
A 3D hierarchical cross-modality interaction network using transformers and convolutions for brain glioma segmentation in MR images.一种使用变换和卷积的 3D 层次跨模态交互网络,用于磁共振图像中的脑胶质瘤分割。
Med Phys. 2024 Nov;51(11):8371-8389. doi: 10.1002/mp.17354. Epub 2024 Aug 13.
3
MonoAux: Fully Exploiting Auxiliary Information and Uncertainty for Monocular 3D Object Detection.单目辅助(MonoAux):充分利用辅助信息和不确定性进行单目3D目标检测
Cyborg Bionic Syst. 2024 Mar 27;5:0097. doi: 10.34133/cbsystems.0097. eCollection 2024.
4
GFI-Net: Global Feature Interaction Network for Monocular Depth Estimation.GFI-Net:用于单目深度估计的全局特征交互网络。
Entropy (Basel). 2023 Feb 26;25(3):421. doi: 10.3390/e25030421.
5
NAN-DETR: noising multi-anchor makes DETR better for object detection.NAN-DETR:噪声多锚点使DETR在目标检测方面表现更优。
Front Neurorobot. 2024 Oct 14;18:1484088. doi: 10.3389/fnbot.2024.1484088. eCollection 2024.
6
Focal DETR: Target-Aware Token Design for Transformer-Based Object Detection.聚焦 DETR:基于 Transformer 的目标检测中的目标感知 Token 设计。
Sensors (Basel). 2022 Nov 10;22(22):8686. doi: 10.3390/s22228686.
7
GDVIFNet: A generated depth and visible image fusion network with edge feature guidance for salient object detection.GDVIFNet:一种用于显著目标检测的具有边缘特征引导的生成式深度与可见光图像融合网络。
Neural Netw. 2025 Aug;188:107445. doi: 10.1016/j.neunet.2025.107445. Epub 2025 Apr 5.
8
Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation.用于单目深度估计的残差视觉Transformer和自适应融合自动编码器
Sensors (Basel). 2024 Dec 26;25(1):80. doi: 10.3390/s25010080.
9
FSH-DETR: An Efficient End-to-End Fire Smoke and Human Detection Based on a Deformable DEtection TRansformer (DETR).FSH-DETR:一种基于可变形检测变压器(DETR)的高效端到端火灾烟雾和人体检测方法。
Sensors (Basel). 2024 Jun 23;24(13):4077. doi: 10.3390/s24134077.
10
CBG-Net: Cross-modality and cross-scale balance network with global semantics for multi-modal 3D object detection.CBG-Net:用于多模态 3D 目标检测的具有全局语义的跨模态和跨尺度平衡网络。
Neural Netw. 2024 Nov;179:106535. doi: 10.1016/j.neunet.2024.106535. Epub 2024 Jul 14.

本文引用的文献

1
KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D.KITTI-360:用于二维和三维城市场景理解的新型数据集和基准
IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3292-3310. doi: 10.1109/TPAMI.2022.3179507. Epub 2023 Feb 3.
2
FCOS: A Simple and Strong Anchor-Free Object Detector.FCOS:一种简单且强大的无锚框目标检测器。
IEEE Trans Pattern Anal Mach Intell. 2022 Apr;44(4):1922-1933. doi: 10.1109/TPAMI.2020.3032166. Epub 2022 Mar 4.
3
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
更快的 R-CNN:基于区域建议网络的实时目标检测。
IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6.