Bae Jinwoo, Hwang Kyumin, Im Sunghoon
IEEE Trans Pattern Anal Mach Intell. 2024 Apr;46(4):2224-2238. doi: 10.1109/TPAMI.2023.3332407. Epub 2024 Mar 6.
Monocular depth estimation has been widely studied, and significant improvements in performance have been recently reported. However, most previous works are evaluated on a few benchmark datasets, such as KITTI datasets, and none of the works provide an in-depth analysis of the generalization performance of monocular depth estimation. In this paper, we deeply investigate the various backbone networks (e.g.CNN and Transformer models) toward the generalization of monocular depth estimation. First, we evaluate state-of-the-art models on both in-distribution and out-of-distribution datasets, which have never been seen during network training. Then, we investigate the internal properties of the representations from the intermediate layers of CNN-/Transformer-based models using synthetic texture-shifted datasets. Through extensive experiments, we observe that the Transformers exhibit a strong shape-bias rather than CNNs, which have a strong texture-bias. We also discover that texture-biased models exhibit worse generalization performance for monocular depth estimation than shape-biased models. We demonstrate that similar aspects are observed in real-world driving datasets captured under diverse environments. Lastly, we conduct a dense ablation study with various backbone networks which are utilized in modern strategies. The experiments demonstrate that the intrinsic locality of the CNNs and the self-attention of the Transformers induce texture-bias and shape-bias, respectively.
单目深度估计已经得到了广泛研究,并且最近有报道称其性能有了显著提升。然而,之前的大多数工作都是在一些基准数据集(如KITTI数据集)上进行评估的,而且没有一项工作对单目深度估计的泛化性能进行深入分析。在本文中,我们针对单目深度估计的泛化深入研究了各种骨干网络(如卷积神经网络和Transformer模型)。首先,我们在分布内和分布外数据集上评估了最先进的模型,这些数据集在网络训练期间从未见过。然后,我们使用合成纹理偏移数据集研究了基于卷积神经网络/Transformer模型中间层表示的内部属性。通过大量实验,我们观察到Transformer模型表现出强烈的形状偏差,而卷积神经网络则表现出强烈的纹理偏差。我们还发现,对于单目深度估计,有纹理偏差的模型比有形状偏差的模型泛化性能更差。我们证明,在不同环境下捕获的真实世界驾驶数据集中也观察到了类似情况。最后,我们对现代策略中使用的各种骨干网络进行了密集的消融研究。实验表明,卷积神经网络的内在局部性和Transformer模型的自注意力分别导致了纹理偏差和形状偏差。