Yin Wei, Zhang Jianming, Wang Oliver, Niklaus Simon, Chen Simon, Liu Yifan, Shen Chunhua
IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):6480-6494. doi: 10.1109/TPAMI.2022.3209968. Epub 2023 Apr 3.
Despite significant progress made in the past few years, challenges remain for depth estimation using a single monocular image. First, it is nontrivial to train a metric-depth prediction model that can generalize well to diverse scenes mainly due to limited training data. Thus, researchers have built large-scale relative depth datasets that are much easier to collect. However, existing relative depth estimation models often fail to recover accurate 3D scene shapes due to the unknown depth shift caused by training with the relative depth data. We tackle this problem here and attempt to estimate accurate scene shapes by training on large-scale relative depth data, and estimating the depth shift. To do so, we propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes. As the two modules are trained separately, we do not need strictly paired training data. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to improve training with relative depth annotation. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation. Code is available at: https://github.com/aim-uofa/depth/.
尽管在过去几年中取得了显著进展,但使用单目图像进行深度估计仍面临挑战。首先,训练一个能够很好地推广到各种场景的度量深度预测模型并非易事,这主要是由于训练数据有限。因此,研究人员构建了大规模的相对深度数据集,这些数据集更容易收集。然而,由于使用相对深度数据进行训练导致的未知深度偏移,现有的相对深度估计模型往往无法恢复准确的3D场景形状。我们在此解决这个问题,并尝试通过在大规模相对深度数据上进行训练并估计深度偏移来估计准确的场景形状。为此,我们提出了一个两阶段框架,该框架首先从单目图像预测出未知比例和偏移的深度,然后利用3D点云数据预测深度偏移和相机焦距,从而使我们能够恢复3D场景形状。由于这两个模块是分开训练的,我们不需要严格配对的训练数据。此外,我们提出了一种图像级归一化回归损失和一种基于法线的几何损失,以改进相对深度标注的训练。我们在九个未见数据集上测试了我们的深度模型,并在零样本评估中取得了领先的性能。代码可在以下网址获取:https://github.com/aim-uofa/depth/ 。