Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi, Aichi, Japan.
Department of Experimental Psychology, Justus Liebig University Giessen, Giessen, Germany.
J Vis. 2022 Mar 2;22(4):4. doi: 10.1167/jov.22.4.4.
Distinguishing mirror from glass is a challenging visual inference, because both materials derive their appearance from their surroundings, yet we rarely experience difficulties in telling them apart. Very few studies have investigated how the visual system distinguishes reflections from refractions and to date, there is no image-computable model that emulates human judgments. Here we sought to develop a deep neural network that reproduces the patterns of visual judgments human observers make. To do this, we trained thousands of convolutional neural networks on more than 750,000 simulated mirror and glass objects, and compared their performance with human judgments, as well as alternative classifiers based on "hand-engineered" image features. For randomly chosen images, all classifiers and humans performed with high accuracy, and therefore correlated highly with one another. However, to assess how similar models are to humans, it is not sufficient to compare accuracy or correlation on random images. A good model should also predict the characteristic errors that humans make. We, therefore, painstakingly assembled a diagnostic image set for which humans make systematic errors, allowing us to isolate signatures of human-like performance. A large-scale, systematic search through feedforward neural architectures revealed that relatively shallow (three-layer) networks predicted human judgments better than any other models we tested. This is the first image-computable model that emulates human errors and succeeds in distinguishing mirror from glass, and hints that mid-level visual processing might be particularly important for the task.
区分镜子和玻璃是一种具有挑战性的视觉推断,因为这两种材料的外观都源自周围环境,但我们很少在分辨它们时遇到困难。很少有研究调查视觉系统如何区分反射和折射,到目前为止,还没有可模拟人类判断的图像可计算模型。在这里,我们试图开发一种深度神经网络,以再现人类观察者做出的视觉判断模式。为此,我们在超过 75 万个模拟的镜子和玻璃物体上训练了数千个卷积神经网络,并将其性能与人类判断以及基于“手工制作”图像特征的替代分类器进行了比较。对于随机选择的图像,所有分类器和人类的表现都非常准确,因此彼此高度相关。然而,要评估模型与人的相似程度,仅比较随机图像上的准确性或相关性是不够的。一个好的模型还应该预测人类会犯的典型错误。因此,我们煞费苦心地组装了一个诊断图像集,人类在这些图像上会犯系统错误,从而使我们能够隔离出类似人类的表现特征。通过前馈神经网络结构进行的大规模、系统搜索表明,相对较浅的(三层)网络比我们测试的任何其他模型都更能预测人类的判断。这是第一个可模拟人类错误并成功区分镜子和玻璃的图像可计算模型,这表明中层视觉处理可能对该任务特别重要。