IEEE Trans Pattern Anal Mach Intell. 2022 Mar;44(3):1623-1637. doi: 10.1109/TPAMI.2020.3019967. Epub 2022 Feb 3.
The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with five diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer, i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.
单目深度估计的成功依赖于大型且多样化的训练集。由于在不同环境中获取密集的真实深度数据具有挑战性,因此出现了许多具有不同特点和偏差的数据集。我们开发了一些工具,这些工具可以在训练期间混合多个数据集,即使它们的注释不兼容。具体来说,我们提出了一种稳健的训练目标,该目标对深度范围和比例的变化具有不变性,提倡使用有原则的多目标学习来组合来自不同来源的数据,并强调在辅助任务上对编码器进行预训练的重要性。有了这些工具,我们使用五个不同的训练数据集进行了实验,包括一个新的、大规模的数据源:3D 电影。为了展示我们方法的泛化能力,我们使用零样本跨数据集迁移,即在训练过程中未见过的数据集上进行评估。实验证实,混合来自互补源的数据可以极大地提高单目深度估计的性能。我们的方法在各种数据集上明显优于竞争方法,为单目深度估计树立了新的技术水平。