• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大型语言模型在心理论任务中的表现。

Evaluating large language models in theory of mind tasks.

机构信息

Graduate School of Business, Stanford University, Stanford, CA 94305.

出版信息

Proc Natl Acad Sci U S A. 2024 Nov 5;121(45):e2405460121. doi: 10.1073/pnas.2405460121. Epub 2024 Oct 29.

DOI:10.1073/pnas.2405460121
PMID:39471222
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11551352/
Abstract

Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by-product of LLMs' improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.

摘要

11 个大型语言模型(LLMs)使用 40 个定制的错误信念任务进行了评估,这些任务被认为是测试人类心理理论(ToM)的黄金标准。每个任务都包括一个错误信念情景、三个与之密切匹配的真实信念控制情景,以及所有四个情景的反转版本。一个 LLM 必须解决所有八个情景才能解决单个任务。较旧的模型无法解决任何任务;生成式预训练转换器(GPT)-3-davinci-003(来自 2022 年 11 月)和 ChatGPT-3.5-turbo(来自 2023 年 3 月)解决了 20%的任务;ChatGPT-4(来自 2023 年 6 月)解决了 75%的任务,与过去研究中观察到的 6 岁儿童的表现相匹配。我们探讨了这些结果的潜在解释,包括一个有趣的可能性,即类似于心理理论的能力,以前被认为是人类独有的,可能是 LLM 语言技能提高的意外副产品。无论我们如何解释这些结果,它们都标志着更强大、更有社交技能的 AI 的出现,这带来了深远的积极和消极影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/131a/11551352/19e62adbd314/pnas.2405460121fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/131a/11551352/1ba44feb79df/pnas.2405460121fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/131a/11551352/d823f7575b7c/pnas.2405460121fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/131a/11551352/19e62adbd314/pnas.2405460121fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/131a/11551352/1ba44feb79df/pnas.2405460121fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/131a/11551352/d823f7575b7c/pnas.2405460121fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/131a/11551352/19e62adbd314/pnas.2405460121fig03.jpg

相似文献

1
Evaluating large language models in theory of mind tasks.评估大型语言模型在心理论任务中的表现。
Proc Natl Acad Sci U S A. 2024 Nov 5;121(45):e2405460121. doi: 10.1073/pnas.2405460121. Epub 2024 Oct 29.
2
Exploring links between language and cognition in autism spectrum disorders: Complement sentences, false belief, and executive functioning.探索自闭症谱系障碍中语言与认知之间的联系:补语句、错误信念与执行功能。
J Commun Disord. 2015 Mar-Apr;54:15-31. doi: 10.1016/j.jcomdis.2014.12.001. Epub 2015 Jan 6.
3
Neural correlates of preschoolers' passive-viewing false belief: Insights into continuity and change and the function of right temporoparietal activity in theory of mind development.幼儿被动观看错误信念的神经关联:洞察连续性和变化以及右颞顶叶活动在心理理论发展中的作用。
Dev Sci. 2024 Nov;27(6):e13530. doi: 10.1111/desc.13530. Epub 2024 Jun 21.
4
Perceptual Access Reasoning (PAR) in Developing a Representational Theory of Mind.发展心理表象理论的知觉访问推理 (PAR)。
Monogr Soc Res Child Dev. 2021 Sep;86(3):7-154. doi: 10.1111/mono.12432.
5
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
6
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.
7
Testing theory of mind in large language models and humans.测试大语言模型和人类的心理理论。
Nat Hum Behav. 2024 Jul;8(7):1285-1295. doi: 10.1038/s41562-024-01882-z. Epub 2024 May 20.
8
The Impact of Multimodal Large Language Models on Health Care's Future.多模态大型语言模型对医疗保健未来的影响。
J Med Internet Res. 2023 Nov 2;25:e52865. doi: 10.2196/52865.
9
Culturally constituted universals: Evidential basis of belief matters.文化构成的普遍性:信念的证据基础很重要。
Dev Sci. 2024 Sep;27(5):e13398. doi: 10.1111/desc.13398. Epub 2023 Apr 16.
10
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

引用本文的文献

1
Can Large Language Models Simulate Spoken Human Conversations?大语言模型能模拟人类对话吗?
Cogn Sci. 2025 Sep;49(9):e70106. doi: 10.1111/cogs.70106.
2
Neural correlates of evaluative bias against artificial intelligence-labeled versus human-labeled artworks.针对人工智能标注与人类标注艺术品的评价偏差的神经关联。
Soc Cogn Affect Neurosci. 2025 Jan 18;20(1). doi: 10.1093/scan/nsaf071.
3
Reply to Pang et al.: Generalizable ability to track beliefs could be the most parsimonious explanation.回复庞等人:能够追踪信念的普遍能力可能是最简洁的解释。

本文引用的文献

1
Verbal behavior and the future of social science.言语行为与社会科学的未来。
Am Psychol. 2025 Apr;80(3):411-433. doi: 10.1037/amp0001319. Epub 2024 May 30.
2
Testing theory of mind in large language models and humans.测试大语言模型和人类的心理理论。
Nat Hum Behav. 2024 Jul;8(7):1285-1295. doi: 10.1038/s41562-024-01882-z. Epub 2024 May 20.
3
Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns.脑嵌入和自然语言中人工上下文嵌入的对齐指向共同的几何模式。
Proc Natl Acad Sci U S A. 2025 Jul 15;122(28):e2511485122. doi: 10.1073/pnas.2511485122. Epub 2025 Jul 3.
4
Do large language models have a theory of mind?大语言模型有心理理论吗?
Proc Natl Acad Sci U S A. 2025 Jul 15;122(28):e2507080122. doi: 10.1073/pnas.2507080122. Epub 2025 Jul 3.
5
Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts.没有基础的大语言模型能够恢复人类概念的非感觉运动特征,但无法恢复感觉运动特征。
Nat Hum Behav. 2025 Jun 4. doi: 10.1038/s41562-025-02203-8.
6
Large language models are proficient in solving and creating emotional intelligence tests.大型语言模型擅长解决和创建情商测试。
Commun Psychol. 2025 May 21;3(1):80. doi: 10.1038/s44271-025-00258-x.
7
The evolving field of digital mental health: current evidence and implementation issues for smartphone apps, generative artificial intelligence, and virtual reality.数字心理健康的发展领域:智能手机应用程序、生成式人工智能和虚拟现实的当前证据及实施问题
World Psychiatry. 2025 Jun;24(2):156-174. doi: 10.1002/wps.21299.
8
Kernels of selfhood: GPT-4o shows humanlike patterns of cognitive dissonance moderated by free choice.自我内核:GPT-4o展现出由自由选择调节的类似人类的认知失调模式。
Proc Natl Acad Sci U S A. 2025 May 20;122(20):e2501823122. doi: 10.1073/pnas.2501823122. Epub 2025 May 14.
9
Playing repeated games with large language models.与大语言模型进行重复博弈。
Nat Hum Behav. 2025 May 8. doi: 10.1038/s41562-025-02172-y.
10
Artificial intelligence and psychoanalysis: is it time for psychoanalyst.AI?人工智能与精神分析:精神分析人工智能的时代到了吗?
Front Psychiatry. 2025 Apr 7;16:1558513. doi: 10.3389/fpsyt.2025.1558513. eCollection 2025.
Nat Commun. 2024 Mar 30;15(1):2768. doi: 10.1038/s41467-024-46631-y.
4
Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT.大型语言模型中出现了类人直觉行为和推理偏差,但在 ChatGPT 中这些现象消失了。
Nat Comput Sci. 2023 Oct;3(10):833-838. doi: 10.1038/s43588-023-00527-x. Epub 2023 Oct 5.
5
Do Large Language Models Know What Humans Know?大语言模型了解人类的知识吗?
Cogn Sci. 2023 Jul;47(7):e13309. doi: 10.1111/cogs.13309.
6
Overlap in meaning is a stronger predictor of semantic activation in GPT-3 than in humans.在 GPT-3 中,意义重叠比人类更能预测语义激活。
Sci Rep. 2023 Mar 28;13(1):5035. doi: 10.1038/s41598-023-32248-6.
7
The knowledge ("true belief") error in 4- to 6-year-old children: When are agents aware of what they have in view?4 至 6 岁儿童的“知识(“真信念”)错误”:什么时候主体能意识到他们所看到的东西?
Cognition. 2023 Jan;230:105255. doi: 10.1016/j.cognition.2022.105255. Epub 2022 Sep 8.
8
Shared computational principles for language processing in humans and deep language models.人类和深度语言模型语言处理的共享计算原则。
Nat Neurosci. 2022 Mar;25(3):369-380. doi: 10.1038/s41593-022-01026-4. Epub 2022 Mar 7.
9
Visual behavior modelling for robotic theory of mind.机器人心理理论的视觉行为建模。
Sci Rep. 2021 Jan 11;11(1):424. doi: 10.1038/s41598-020-77918-x.
10
The grand challenges of .···的重大挑战。
Sci Robot. 2018 Jan 31;3(14). doi: 10.1126/scirobotics.aar7650.