Suppr超能文献

哪种人工智能的视觉与我们相似?通过人机交互中的眼动追踪研究语言和视觉模型的认知合理性。

Which AI Sees Like Us? Investigating the Cognitive Plausibility of Language and Vision Models via Eye-Tracking in Human-Robot Interaction.

作者信息

Ghamati Khashayar, Dehkordi Maryam Banitalebi, Zaraki Abolfazl

机构信息

School of Physics, Engineering and Computer Science (SPECS), University of Hertfordshire, Hatfield AL10 9AB, UK.

Robotics Research Group, University of Hertfordshire, Hatfield AL10 9AB, UK.

出版信息

Sensors (Basel). 2025 Jul 29;25(15):4687. doi: 10.3390/s25154687.

Abstract

As large language models (LLMs) and vision-language models (VLMs) become increasingly used in robotics area, a crucial question arises: to what extent do these models replicate human-like cognitive processes, particularly within socially interactive contexts? Whilst these models demonstrate impressive multimodal reasoning and perception capabilities, their cognitive plausibility remains underexplored. In this study, we address this gap by using human visual attention as a behavioural proxy for cognition in a naturalistic human-robot interaction (HRI) scenario. Eye-tracking data were previously collected from participants engaging in social human-human interactions, providing frame-level gaze fixations as a human attentional ground truth. We then prompted a state-of-the-art VLM (LLaVA) to generate scene descriptions, which were processed by four LLMs (DeepSeek-R1-Distill-Qwen-7B, Qwen1.5-7B-Chat, LLaMA-3.1-8b-instruct, and Gemma-7b-it) to infer saliency points. Critically, we evaluated each model in both stateless and memory-augmented (short-term memory, STM) modes to assess the influence of temporal context on saliency prediction. Our results presented that whilst stateless LLaVA most closely replicates human gaze patterns, STM confers measurable benefits only for DeepSeek, whose lexical anchoring mirrors human rehearsal mechanisms. Other models exhibited degraded performance with memory due to prompt interference or limited contextual integration. This work introduces a novel, empirically grounded framework for assessing cognitive plausibility in generative models and underscores the role of short-term memory in shaping human-like visual attention in robotic systems.

摘要

随着大语言模型(LLMs)和视觉语言模型(VLMs)在机器人领域的应用越来越广泛,一个关键问题出现了:这些模型在多大程度上复制了类人认知过程,尤其是在社交互动环境中?虽然这些模型展示了令人印象深刻的多模态推理和感知能力,但其认知合理性仍未得到充分探索。在本研究中,我们通过在自然主义的人机交互(HRI)场景中使用人类视觉注意力作为认知的行为代理来填补这一空白。眼动追踪数据此前是从参与社交性人际互动的参与者那里收集的,提供了帧级凝视注视点作为人类注意力的真实依据。然后,我们促使一个先进的VLM(LLaVA)生成场景描述,这些描述由四个LLMs(DeepSeek-R1-Distill-Qwen-7B、Qwen1.5-7B-Chat、LLaMA-3.1-8b-instruct和Gemma-7b-it)进行处理以推断显著点。至关重要的是,我们在无状态和记忆增强(短期记忆,STM)模式下对每个模型进行评估,以评估时间上下文对显著度预测的影响。我们的结果表明,虽然无状态的LLaVA最接近复制人类注视模式,但STM仅对DeepSeek有可测量的益处,其词汇锚定反映了人类复述机制。其他模型由于提示干扰或有限的上下文整合,在有记忆的情况下表现下降。这项工作引入了一个新颖的、基于实证的框架来评估生成模型中的认知合理性,并强调了短期记忆在塑造机器人系统中类人视觉注意力方面的作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4245/12349560/4f9ed6f932e1/sensors-25-04687-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验