Mukherjee Kushin, Rogers Timothy T
Department of Psychology & Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, USA.
Mem Cognit. 2025 Jan;53(1):219-241. doi: 10.3758/s13421-024-01580-1. Epub 2024 May 30.
Early in life and without special training, human beings discern resemblance between abstract visual stimuli, such as drawings, and the real-world objects they represent. We used this capacity for visual abstraction as a tool for evaluating deep neural networks (DNNs) as models of human visual perception. Contrasting five contemporary DNNs, we evaluated how well each explains human similarity judgments among line drawings of recognizable and novel objects. For object sketches, human judgments were dominated by semantic category information; DNN representations contributed little additional information. In contrast, such features explained significant unique variance perceived similarity of abstract drawings. In both cases, a vision transformer trained to blend representations of images and their natural language descriptions showed the greatest ability to explain human perceptual similarity-an observation consistent with contemporary views of semantic representation and processing in the human mind and brain. Together, the results suggest that the building blocks of visual similarity may arise within systems that learn to use visual information, not for specific classification, but in service of generating semantic representations of objects.
在生命早期且未经特殊训练的情况下,人类就能辨别抽象视觉刺激(如图画)与其所代表的现实世界物体之间的相似性。我们利用这种视觉抽象能力作为一种工具,来评估深度神经网络(DNN)作为人类视觉感知模型的性能。通过对比五个当代DNN,我们评估了每个模型在解释人类对可识别和新颖物体的线条画之间的相似性判断方面的表现。对于物体草图,人类的判断主要受语义类别信息主导;DNN表示几乎没有提供额外信息。相比之下,这些特征解释了抽象画感知相似性中显著的独特方差。在这两种情况下,一个经过训练以融合图像及其自然语言描述表示的视觉Transformer表现出最强的解释人类感知相似性的能力——这一观察结果与当代关于人类心智和大脑中语义表示与处理的观点一致。总之,结果表明视觉相似性的构建模块可能出现在那些学习使用视觉信息的系统中,这些系统并非用于特定分类,而是为生成物体的语义表示服务。