基于内容和上下文特征的深度单目深度估计。

Deep Monocular Depth Estimation Based on Content and Contextual Features.

机构信息

Department of Computer Engineering and Mathematics, Universitat Rovira i Virgil, Campus Sescelades, Avinguda dels Paisos Catalans, 26, 43007 Tarragona, Spain.

出版信息

Sensors (Basel). 2023 Mar 8;23(6):2919. doi: 10.3390/s23062919.

DOI:10.3390/s23062919

PMID:36991629

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10055838/

Abstract

Recently, significant progress has been achieved in developing deep learning-based approaches for estimating depth maps from monocular images. However, many existing methods rely on content and structure information extracted from RGB photographs, which often results in inaccurate depth estimation, particularly for regions with low texture or occlusions. To overcome these limitations, we propose a novel method that exploits contextual semantic information to predict precise depth maps from monocular images. Our approach leverages a deep autoencoder network incorporating high-quality semantic features from the state-of-the-art HRNet-v2 semantic segmentation model. By feeding the autoencoder network with these features, our method can effectively preserve the discontinuities of the depth images and enhance monocular depth estimation. Specifically, we exploit the semantic features related to the localization and boundaries of the objects in the image to improve the accuracy and robustness of the depth estimation. To validate the effectiveness of our approach, we tested our model on two publicly available datasets, NYU Depth v2 and SUN RGB-D. Our method outperformed several state-of-the-art monocular depth estimation techniques, achieving an accuracy of 85%, while minimizing the error Rel by 0.12, RMS by 0.523, and log10 by 0.0527. Our approach also demonstrated exceptional performance in preserving object boundaries and faithfully detecting small object structures in the scene.

摘要

最近，在基于深度学习的从单目图像估计深度图的方法方面取得了重大进展。然而，许多现有的方法依赖于从 RGB 照片中提取的内容和结构信息，这往往导致深度估计不准确，尤其是对于纹理或遮挡较少的区域。为了克服这些限制，我们提出了一种新颖的方法，利用上下文语义信息从单目图像预测精确的深度图。我们的方法利用了深度自动编码器网络，该网络结合了来自最新的 HRNet-v2 语义分割模型的高质量语义特征。通过将自动编码器网络与这些特征进行馈送，我们的方法可以有效地保持深度图像的不连续性，并增强单目深度估计。具体来说，我们利用与图像中物体的定位和边界相关的语义特征来提高深度估计的准确性和鲁棒性。为了验证我们的方法的有效性，我们在两个公开可用的数据集 NYU Depth v2 和 SUN RGB-D 上测试了我们的模型。我们的方法优于几种最新的单目深度估计技术，达到了 85%的准确性，同时最小化了误差 Rel 0.12、RMS 0.523 和 log10 0.0527。我们的方法还在保持物体边界和忠实地检测场景中小物体结构方面表现出色。