Department of Neurology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
Cognition, Motion and Neuroscience, Italian Institute of Technology, Genoa, Italy.
Nat Hum Behav. 2024 Jul;8(7):1285-1295. doi: 10.1038/s41562-024-01882-z. Epub 2024 May 20.
At the core of what defines us as humans is the concept of theory of mind: the ability to track other people's mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences.
即追踪他人心理状态的能力。最近大型语言模型(如 ChatGPT)的发展引发了激烈的争论,即这些模型是否表现出与心理理论任务中人类行为无法区分的行为。在这里,我们比较了人类和 LLM 在一系列旨在测量不同心理理论能力的综合测量中的表现,从理解虚假信念到解释间接请求、识别反讽和失礼。我们反复用这些测量方法测试了两种 LLM(GPT 和 LLaMA2),并将它们的表现与 1907 名人类参与者的样本进行了比较。在整个心理理论测试中,我们发现 GPT-4 模型在识别间接请求、虚假信念和误导方面的表现与人类水平相当,甚至有时超过了人类水平,但在检测失礼方面却存在困难。然而,失礼是唯一一项 LLaMA2 表现优于人类的测试。对信念可能性的后续操纵表明,LLaMA2 的优越性是虚幻的,可能反映了一种倾向于将无知归因于他人的偏见。相比之下,GPT 的表现不佳源于其对得出结论的过度保守态度,而不是真正的推理失败。这些发现不仅表明 LLM 表现出与人类心理推理输出一致的行为,还强调了系统测试的重要性,以确保在人类和人工智能之间进行非表面的比较。