Kubota Yuki, Fukiage Taiki
Communication Science Laboratories, NTT, Inc., Kanagawa, Japan.
PLoS Comput Biol. 2025 Aug 19;21(8):e1013020. doi: 10.1371/journal.pcbi.1013020. eCollection 2025 Aug.
Human depth perception from 2D images is systematically distorted, yet the nature of these distortions is not fully understood. By examining error patterns in depth estimation for both humans and deep neural networks (DNNs), which have shown remarkable abilities in monocular depth estimation, we can gain insights into constructing functional models of this human 3D vision and designing artificial models with improved interpretability. Here, we propose a comprehensive human-DNN comparison framework for a monocular depth judgment task. Using a novel human-annotated dataset of natural indoor scenes and a systematic analysis of absolute depth judgments, we investigate error patterns in both humans and DNNs. Employing exponential-affine fitting, we decompose depth estimation errors into depth compression, per-image affine transformations (including scaling, shearing, and translation), and residual errors. Our analysis reveals that human depth judgments exhibit systematic and consistent biases, including depth compression, a vertical bias (perceiving objects in the lower visual field as closer), and consistent per-image affine distortions across participants. Intriguingly, we find that DNNs with higher accuracy partially recapitulate these human biases, demonstrating greater similarity in affine parameters and residual error patterns. This suggests that these seemingly suboptimal human biases may reflect efficient, ecologically adapted strategies for depth inference from inherently ambiguous monocular images. However, while DNNs capture metric-level residual error patterns similar to humans, they fail to reproduce human-level accuracy in ordinal depth perception within the affine-invariant space. These findings underscore the importance of evaluating error patterns beyond raw accuracy, providing new insights into how humans and computational models resolve depth ambiguity. Our dataset and methodology provide a framework for evaluating the alignment between computational models and human perceptual biases, thereby advancing our understanding of visual space representation and guiding the development of models that more faithfully capture human depth perception.
人类从二维图像中进行深度感知时会出现系统性扭曲,但这些扭曲的本质尚未完全被理解。通过研究人类和深度神经网络(DNN)在深度估计中的误差模式(DNN在单目深度估计中展现出了卓越能力),我们能够深入了解构建人类三维视觉功能模型以及设计具有更高可解释性的人工模型。在此,我们针对单目深度判断任务提出了一个全面的人类与DNN比较框架。利用一个全新的自然室内场景人类标注数据集,并对绝对深度判断进行系统分析,我们研究了人类和DNN中的误差模式。采用指数仿射拟合,我们将深度估计误差分解为深度压缩、每张图像的仿射变换(包括缩放、剪切和平移)以及残余误差。我们的分析表明,人类深度判断呈现出系统性且一致的偏差,包括深度压缩、垂直偏差(将视野下方的物体感知为更近)以及参与者之间一致的每张图像仿射扭曲。有趣的是,我们发现具有更高准确性的DNN部分重现了这些人类偏差,在仿射参数和残余误差模式上表现出更大的相似性。这表明这些看似次优的人类偏差可能反映了从本质上模糊的单目图像进行深度推断的高效、生态适应性策略。然而,虽然DNN捕捉到了与人类相似的度量级残余误差模式,但它们在仿射不变空间内的顺序深度感知中未能重现人类级别的准确性。这些发现强调了超越原始准确性评估误差模式的重要性,为人类和计算模型如何解决深度模糊性提供了新的见解。我们的数据集和方法提供了一个评估计算模型与人类感知偏差之间一致性的框架,从而推进我们对视觉空间表征的理解,并指导更忠实地捕捉人类深度感知的模型的开发。