• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个公开可用的基准,用于评估大语言模型预测人类如何平衡自身利益和他人利益的能力。

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others.

作者信息

Capraro Valerio, Di Paolo Roberto, Pizziol Veronica

机构信息

Department of Psychology, University of Milan Bicocca, 20126, Milan, Italy.

Department of Economics and Management, University of Parma, 43121, Parma, Italy.

出版信息

Sci Rep. 2025 Jul 1;15(1):21428. doi: 10.1038/s41598-025-01715-7.

DOI:10.1038/s41598-025-01715-7
PMID:40595689
Abstract

Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture how humans balance self-interest and the interest of others. Here we introduce a novel, publicly available, benchmark to test LLM's ability to predict how humans balance monetary self-interest and the interest of others. This benchmark consists of 106 textual instructions from dictator games experiments conducted with human participants from 12 countries, alongside with a compendium of actual human behavior in each experiment. We investigate the ability of four advanced chatbots against this benchmark. We find that none of these chatbots meet the benchmark. In particular, only GPT-4 and GPT-4o (not Bard nor Bing) correctly capture qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 and GPT-4o consistently underestimate self-interest, while overestimating altruistic behavior. In sum, this article introduces a publicly available resource for testing the capacity of LLMs to estimate human other-regarding preferences in economic decisions and reveals an "optimistic bias" in current versions of GPT.

摘要

大语言模型(LLMs)在协助人类进行决策过程中具有巨大潜力,涵盖从日常到高风险场景。然而,由于许多人类决策具有社会影响,要使大语言模型成为可靠的助手,一个必要前提是它们能够捕捉人类如何平衡自身利益和他人利益。在此,我们引入一个新颖的、公开可用的基准,以测试大语言模型预测人类如何平衡金钱方面的自身利益和他人利益的能力。这个基准包括来自对12个国家的人类参与者进行的独裁者博弈实验的106条文本指令,以及每个实验中实际人类行为的概要。我们针对这个基准研究了四个先进聊天机器人的能力。我们发现这些聊天机器人都未达到基准。特别是,只有GPT - 4和GPT - 4o(而非Bard和Bing)正确捕捉到了定性行为模式,识别出三类主要行为:利己、厌恶不平等和完全利他。尽管如此,GPT - 4和GPT - 4o始终低估利己行为,同时高估利他行为。总之,本文介绍了一个公开可用的资源,用于测试大语言模型在经济决策中估计人类他人导向偏好的能力,并揭示了当前版本GPT中的“乐观偏差”。

相似文献

1
A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others.一个公开可用的基准,用于评估大语言模型预测人类如何平衡自身利益和他人利益的能力。
Sci Rep. 2025 Jul 1;15(1):21428. doi: 10.1038/s41598-025-01715-7.
2
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.
3
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
4
Adapting Safety Plans for Autistic Adults with Involvement from the Autism Community.在自闭症群体的参与下为成年自闭症患者调整安全计划。
Autism Adulthood. 2025 May 28;7(3):293-302. doi: 10.1089/aut.2023.0124. eCollection 2025 Jun.
5
The experience of adults who choose watchful waiting or active surveillance as an approach to medical treatment: a qualitative systematic review.选择观察等待或主动监测作为治疗方法的成年人的经历:一项定性系统评价。
JBI Database System Rev Implement Rep. 2016 Feb;14(2):174-255. doi: 10.11124/jbisrir-2016-2270.
6
Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.ChatGPT与互联网搜索用于职业医学临床研究和决策的比较:随机对照试验
JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.
7
Parents' and informal caregivers' views and experiences of communication about routine childhood vaccination: a synthesis of qualitative evidence.父母及非正式照料者关于儿童常规疫苗接种沟通的观点与经历:定性证据综述
Cochrane Database Syst Rev. 2017 Feb 7;2(2):CD011787. doi: 10.1002/14651858.CD011787.pub2.
8
Quality assessment of large language models' output in maternal health.大语言模型在孕产妇健康方面输出内容的质量评估
Sci Rep. 2025 Jul 2;15(1):22474. doi: 10.1038/s41598-025-03501-x.
9
Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略
Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.
10
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

引用本文的文献

1
Evaluating the ability of large Language models to predict human social decisions.评估大语言模型预测人类社会决策的能力。
Sci Rep. 2025 Sep 2;15(1):32290. doi: 10.1038/s41598-025-17188-7.

本文引用的文献

1
Playing repeated games with large language models.与大语言模型进行重复博弈。
Nat Hum Behav. 2025 May 8. doi: 10.1038/s41562-025-02172-y.
2
GPT-3.5 altruistic advice is sensitive to reciprocal concerns but not to strategic risk.GPT-3.5 的利他主义建议对互惠性关注敏感,但对策略性风险不敏感。
Sci Rep. 2024 Sep 27;14(1):22274. doi: 10.1038/s41598-024-73306-x.
3
The political preferences of LLMs.大语言模型的政治倾向。
PLoS One. 2024 Jul 31;19(7):e0306621. doi: 10.1371/journal.pone.0306621. eCollection 2024.
4
The impact of generative artificial intelligence on socioeconomic inequalities and policy making.生成式人工智能对社会经济不平等和政策制定的影响。
PNAS Nexus. 2024 Jun 11;3(6):pgae191. doi: 10.1093/pnasnexus/pgae191. eCollection 2024 Jun.
5
Addressing climate change with behavioral science: A global intervention tournament in 63 countries.用行为科学应对气候变化:63 个国家的全球干预锦标赛。
Sci Adv. 2024 Feb 9;10(6):eadj5778. doi: 10.1126/sciadv.adj5778. Epub 2024 Feb 7.
6
The emergence of economic rationality of GPT.GPT 的经济理性的出现。
Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2316205120. doi: 10.1073/pnas.2316205120. Epub 2023 Dec 12.
7
Morality beyond the WEIRD: How the nomological network of morality varies across cultures.超越西方、受过良好教育、工业化、富裕和民主(WEIRD)人群的道德:道德的法则网络如何在不同文化中变化。
J Pers Soc Psychol. 2023 Nov;125(5):1157-1188. doi: 10.1037/pspp0000470. Epub 2023 Aug 17.
8
Experimental evidence on the productivity effects of generative artificial intelligence.关于生成式人工智能生产力效应的实验证据。
Science. 2023 Jul 14;381(6654):187-192. doi: 10.1126/science.adh2586. Epub 2023 Jul 13.
9
Mitigating bias in AI at the point of care.在医疗保健点减轻人工智能中的偏见。
Science. 2023 Jul 14;381(6654):150-152. doi: 10.1126/science.adh2713. Epub 2023 Jul 13.
10
Art and the science of generative AI.生成式人工智能的艺术与科学。
Science. 2023 Jun 16;380(6650):1110-1111. doi: 10.1126/science.adh4451. Epub 2023 Jun 15.