Liu Xuanxuan, Tang Shuai, Feng Mengdie, Guo Xueqi, Zhang Yanru, Wang Yan
Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, 518000, Guangdong, Shenzhen, China.
School of Future Technology, South China University of Technology, 511442, Guangdong, Guangzhou, China.
Sci Rep. 2025 Apr 15;15(1):12860. doi: 10.1038/s41598-025-97568-1.
Monocular depth estimation plays a crucial role in many downstream visual tasks. Although research on monocular depth estimation is relatively mature, it commonly involves strategies that entail increasing both the computational complexity and the number of parameters to achieve superior performance. Particularly in practical applications, enhancing the accuracy of depth prediction while ensuring computational efficiency remains a challenging issue. To tackle this challenge, we propose a novel and simple depth estimation model called SimMDE, which treats monocular depth estimation as an ordinal regression problem. Beginning with a baseline encoder, our model is equipped with a Deformable Cross-Attention Feature Fusion (DCF) decoder with sparse attention. This decoder efficiently integrates multi-scale feature maps, markedly reducing the quadratic complexity of the Transformer model. For the extraction of finer local features, we propose a Local Multi-dimensional Convolutional Attention (LMC) module. Meanwhile, we propose a Wavelet Attention Transformer (WAT) module to achieve pixel-level precise classification of images. Furthermore, we also conduct extensive experiments on two widely recognized depth estimation benchmark datasets: NYU and KITTI. The experimental findings unequivocally demonstrate that our model attains exceptional accuracy in depth estimation while upholding high computational efficiency. Remarkably, our framework SimMDE, extending from AdaBins, demonstrates enhancements, resulting in substantial improvements of 11.7% and 10.3% in the absolute relative error (AbsRel) on the NYU and KITTI datasets, respectively, with fewer parameters.
单目深度估计在许多下游视觉任务中起着关键作用。尽管单目深度估计的研究相对成熟,但它通常涉及一些策略,这些策略需要增加计算复杂度和参数数量以实现卓越的性能。特别是在实际应用中,在确保计算效率的同时提高深度预测的准确性仍然是一个具有挑战性的问题。为了应对这一挑战,我们提出了一种新颖且简单的深度估计模型SimMDE,它将单目深度估计视为一个有序回归问题。从一个基线编码器开始,我们的模型配备了一个具有稀疏注意力的可变形交叉注意力特征融合(DCF)解码器。该解码器有效地整合了多尺度特征图,显著降低了Transformer模型的二次复杂度。为了提取更精细的局部特征,我们提出了一个局部多维卷积注意力(LMC)模块。同时,我们提出了一个小波注意力Transformer(WAT)模块来实现图像的像素级精确分类。此外,我们还在两个广泛认可的深度估计基准数据集NYU和KITTI上进行了广泛的实验。实验结果明确表明,我们的模型在深度估计中达到了卓越的准确性,同时保持了高计算效率。值得注意的是,我们从AdaBins扩展而来的框架SimMDE展示了改进,在参数较少的情况下,在NYU和KITTI数据集上的绝对相对误差(AbsRel)分别大幅提高了11.7%和10.3%。