Graduate Center for Vision Research, College of Optometry, State University of New York, New York, NY 10036.
Graduate Center for Vision Research, College of Optometry, State University of New York, New York, NY 10036
Proc Natl Acad Sci U S A. 2018 Jul 24;115(30):7807-7812. doi: 10.1073/pnas.1804873115. Epub 2018 Jul 9.
Pose estimation of objects in real scenes is critically important for biological and machine visual systems, but little is known of how humans infer 3D poses from 2D retinal images. We show unexpectedly remarkable agreement in the 3D poses different observers estimate from pictures. We further show that all observers apply the same inferential rule from all viewpoints, utilizing the geometrically derived back-transform from retinal images to actual 3D scenes. Pose estimations are altered by a fronto-parallel bias, and by image distortions that appear to tilt the ground plane. We used pictures of single sticks or pairs of joined sticks taken from different camera angles. Observers viewed these from five directions, and matched the perceived pose of each stick by rotating an arrow on a horizontal touchscreen. The projection of each 3D stick to the 2D picture, and then onto the retina, is described by an invertible trigonometric expression. The inverted expression yields the back-projection for each object pose, camera elevation, and observer viewpoint. We show that a model that uses the back-projection, modulated by just two free parameters, explains 560 pose estimates per observer. By considering changes in retinal image orientations due to position and elevation of limbs, the model also explains perceived limb poses in a complex scene of two bodies lying on the ground. The inferential rules simply explain both perceptual invariance and dramatic distortions in poses of real and pictured objects, and show the benefits of incorporating projective geometry of light into mental inferences about 3D scenes.
真实场景中物体的姿态估计对于生物和机器视觉系统至关重要,但人类如何从 2D 视网膜图像推断 3D 姿态知之甚少。我们发现,不同观察者从图像中估计的 3D 姿态惊人地一致。我们进一步表明,所有观察者都从所有视角应用相同的推理规则,利用从视网膜图像到实际 3D 场景的几何反变换。姿态估计受到正面平行偏差和图像扭曲的影响,这些扭曲似乎使地面平面倾斜。我们使用从不同摄像机角度拍摄的单个棒或成对连接棒的图片。观察者从五个方向观看这些图片,并通过在水平触摸屏上旋转箭头来匹配每个棒的感知姿态。每个 3D 棒在 2D 图片上的投影,然后在视网膜上的投影,由可反转的三角函数表达式描述。反转的表达式为每个物体姿态、摄像机高度和观察者视点生成反向投影。我们表明,使用反向投影并仅通过两个自由参数进行调制的模型可以解释每个观察者的 560 个姿态估计。通过考虑由于四肢位置和高度引起的视网膜图像方向的变化,该模型还可以解释在地面上躺着两个身体的复杂场景中感知到的肢体姿态。推理规则简单地解释了真实和图像物体姿态的感知不变性和显著扭曲,并展示了将光的投影几何纳入对 3D 场景的心理推断的好处。