Conwell Colin, Graham Daniel, Boccagno Chelsea, Vessel Edward A
Department of Psychology, Harvard University, Cambridge, MA 02139.
Department of Psychological Science, Hobart and William Smith Colleges.
Proc Natl Acad Sci U S A. 2025 Jan 28;122(4):e2306025121. doi: 10.1073/pnas.2306025121. Epub 2025 Jan 23.
Looking at the world often involves not just seeing things, but feeling things. Modern feedforward machine vision systems that learn to perceive the world in the absence of active physiology, deliberative thought, or any form of feedback that resembles human affective experience offer tools to demystify the relationship between seeing and feeling, and to assess how much of visually evoked affective experiences may be a straightforward function of representation learning over natural image statistics. In this work, we deploy a diverse sample of 180 state-of-the-art deep neural network models trained only on canonical computer vision tasks to predict human ratings of arousal, valence, and beauty for images from multiple categories (objects, faces, landscapes, art) across two datasets. Importantly, we use the features of these models without additional learning, linearly decoding human affective responses from network activity in much the same way neuroscientists decode information from neural recordings. Aggregate analysis across our survey, demonstrates that predictions from purely perceptual models explain a majority of the explainable variance in average ratings of arousal, valence, and beauty alike. Finer-grained analysis within our survey (e.g. comparisons between shallower and deeper layers, or between randomly initialized, category-supervised, and self-supervised models) point to rich, preconceptual abstraction (learned from diversity of visual experience) as a key driver of these predictions. Taken together, these results provide further computational evidence for an information-processing account of visually evoked affect linked directly to efficient representation learning over natural image statistics, and hint at a computational locus of affective and aesthetic valuation immediately proximate to perception.
观察世界通常不仅涉及看到事物,还涉及感受事物。现代前馈机器视觉系统在没有主动生理活动、深思熟虑的思考或任何类似于人类情感体验的反馈形式的情况下学习感知世界,它提供了工具来揭开视觉与感受之间关系的神秘面纱,并评估视觉诱发的情感体验在多大程度上可能是基于自然图像统计的表征学习的直接函数。在这项工作中,我们部署了180个最先进的深度神经网络模型的多样样本,这些模型仅在标准计算机视觉任务上进行训练,以预测来自两个数据集的多类别(物体、面部、风景、艺术)图像的人类唤醒度、效价和美感评分。重要的是,我们使用这些模型的特征而不进行额外学习,以与神经科学家从神经记录中解码信息大致相同的方式从网络活动中线性解码人类情感反应。我们调查的综合分析表明,纯感知模型的预测解释了唤醒度、效价和美感平均评分中大部分可解释的方差。我们调查中的更细粒度分析(例如浅层和深层之间的比较,或随机初始化、类别监督和自监督模型之间的比较)指出,丰富的、前概念性的抽象(从视觉体验的多样性中学习)是这些预测的关键驱动因素。综上所述,这些结果为视觉诱发情感的信息处理解释提供了进一步的计算证据,该解释直接与基于自然图像统计的高效表征学习相关,并暗示了情感和审美评估的计算位置紧邻感知。