Dickson Billy, Maini Sahaj Singh, Sanders Craig, Nosofsky Robert, Tiganj Zoran
Department of Computer Science, Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, 700 N Woodlawn Ave, Bloomington, IN, 47408, USA.
Department of Psychological and Brain Sciences, Indiana University Bloomington, Bloomington, IN, USA.
Behav Res Methods. 2025 Jun 19;57(7):203. doi: 10.3758/s13428-025-02728-w.
Cognitive scientists commonly collect participants' judgments regarding perceptual characteristics of stimuli to develop and evaluate models of attention, memory, learning, and decision-making. For instance, to model human responses in tasks of category learning and item recognition, researchers often collect perceptual judgments of images in order to embed the images in multidimensional feature spaces. This process is time-consuming and costly. Recent advancements in large multimodal models (LMMs) provide a potential alternative because such models can respond to prompts that include both text and images and could potentially replace human participants. To test whether the available LMMs can indeed be useful for this purpose, we evaluated their judgments on a dataset consisting of rock images that has been widely used by cognitive scientists. The dataset includes human perceptual judgments along 10 dimensions considered important for classifying rock images. Among the LMMs that we investigated, GPT-4o exhibited the strongest positive correlation with human responses and demonstrated promising alignment with the mean ratings from human participants, particularly for elementary dimensions such as lightness, chromaticity, shininess, and fine/coarse grain texture. However, its correlations with human ratings were lower for more abstract and rock-specific emergent dimensions such as organization and pegmatitic structure. Although there is room for further improvement, the model already appears to be approaching the level of consensus observed across human groups for the perceptual features examined here. Our study provides a benchmark for evaluating future LMMs on human perceptual judgment data.
认知科学家通常会收集参与者对刺激物感知特征的判断,以开发和评估注意力、记忆、学习和决策模型。例如,为了模拟人类在类别学习和项目识别任务中的反应,研究人员经常收集图像的感知判断,以便将图像嵌入多维特征空间。这个过程既耗时又昂贵。大型多模态模型(LMM)的最新进展提供了一种潜在的替代方案,因为这类模型可以响应包含文本和图像的提示,并且有可能取代人类参与者。为了测试现有的LMM是否确实能用于此目的,我们在一个由岩石图像组成的数据集上评估了它们的判断,该数据集已被认知科学家广泛使用。该数据集包括沿着对岩石图像分类很重要的10个维度的人类感知判断。在我们研究的LMM中,GPT-4o与人类反应表现出最强的正相关,并与人类参与者的平均评分显示出有希望的一致性,特别是对于亮度、色度、光泽度和细/粗粒度纹理等基本维度。然而,对于更抽象和特定于岩石的新兴维度,如组织结构和伟晶岩结构,它与人类评分的相关性较低。尽管还有进一步改进的空间,但该模型在此处检查的感知特征方面似乎已经接近人类群体中观察到的共识水平。我们的研究为评估未来LMM在人类感知判断数据上的表现提供了一个基准。