Graduate School of Business, Stanford University, Stanford, CA 94305.
Proc Natl Acad Sci U S A. 2024 Nov 5;121(45):e2405460121. doi: 10.1073/pnas.2405460121. Epub 2024 Oct 29.
Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by-product of LLMs' improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
11 个大型语言模型(LLMs)使用 40 个定制的错误信念任务进行了评估,这些任务被认为是测试人类心理理论(ToM)的黄金标准。每个任务都包括一个错误信念情景、三个与之密切匹配的真实信念控制情景,以及所有四个情景的反转版本。一个 LLM 必须解决所有八个情景才能解决单个任务。较旧的模型无法解决任何任务;生成式预训练转换器(GPT)-3-davinci-003(来自 2022 年 11 月)和 ChatGPT-3.5-turbo(来自 2023 年 3 月)解决了 20%的任务;ChatGPT-4(来自 2023 年 6 月)解决了 75%的任务,与过去研究中观察到的 6 岁儿童的表现相匹配。我们探讨了这些结果的潜在解释,包括一个有趣的可能性,即类似于心理理论的能力,以前被认为是人类独有的,可能是 LLM 语言技能提高的意外副产品。无论我们如何解释这些结果,它们都标志着更强大、更有社交技能的 AI 的出现,这带来了深远的积极和消极影响。