Wang Zhijian, Liu Jie, Sun Yixiao, Zhou Xiang, Sun Boyan, Kong Dehong, Xu Jay, Yue Xiaoping, Zhang Wenyu
School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China.
Anshan Power Supply Company, Liaoning Electric Power Limited Company of State Grid, Anshan, Liaoning, China.
PeerJ Comput Sci. 2025 Jan 28;11:e2656. doi: 10.7717/peerj-cs.2656. eCollection 2025.
Monocular 3D object detection is the most widely applied and challenging solution for autonomous driving, due to 2D images lacking 3D information. Existing methods are limited by inaccurate depth estimations by inequivalent supervised targets. The use of both depth and visual features also faces problems of heterogeneous fusion. In this article, we propose Depth Detection Transformer (Depth-DETR), applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection. Depth-DETR introduces two additional depth encoders besides the visual encoder. Two depth encoders are supervised by ground truth depth and bounding box respectively, working independently to complement each other's limitations and predicting more accurate target distances. Furthermore, Depth-DETR employs cross modal attention mechanisms to effectively fuse three different features. A parallel structure of two cross modal transformer is applied to fuse two depth features with visual features. Avoiding early fusion between two depth features enhances the final fused feature for better feature representations. Through multiple experimental validations, the Depth-DETR model has achieved highly competitive results in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, with an AP score of 17.49, representing its outstanding performance in 3D object detection.
由于二维图像缺乏三维信息,单目三维目标检测是自动驾驶领域应用最广泛且最具挑战性的解决方案。现有方法受到不等价监督目标导致的深度估计不准确的限制。同时使用深度和视觉特征也面临异构融合的问题。在本文中,我们提出了深度检测Transformer(Depth-DETR),在单目三维目标检测中应用辅助监督深度辅助Transformer和跨模态注意力融合。Depth-DETR除了视觉编码器之外还引入了两个额外的深度编码器。两个深度编码器分别由真实深度和边界框监督,独立工作以互补彼此的局限性并预测更准确的目标距离。此外,Depth-DETR采用跨模态注意力机制来有效融合三种不同特征。应用两个跨模态Transformer的并行结构将两个深度特征与视觉特征融合。避免两个深度特征之间的早期融合增强了最终融合特征,以获得更好的特征表示。通过多次实验验证,Depth-DETR模型在卡尔斯鲁厄理工学院和丰田技术研究所(KITTI)数据集上取得了极具竞争力的结果,平均精度(AP)得分为17.49,表明其在三维目标检测中具有出色的性能。