一个公开可用的基准，用于评估大语言模型预测人类如何平衡自身利益和他人利益的能力。

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others.

作者信息

Capraro Valerio, Di Paolo Roberto, Pizziol Veronica

机构信息

Department of Psychology, University of Milan Bicocca, 20126, Milan, Italy.

Department of Economics and Management, University of Parma, 43121, Parma, Italy.

出版信息

Sci Rep. 2025 Jul 1;15(1):21428. doi: 10.1038/s41598-025-01715-7.

DOI:10.1038/s41598-025-01715-7

PMID:40595689

Abstract

Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture how humans balance self-interest and the interest of others. Here we introduce a novel, publicly available, benchmark to test LLM's ability to predict how humans balance monetary self-interest and the interest of others. This benchmark consists of 106 textual instructions from dictator games experiments conducted with human participants from 12 countries, alongside with a compendium of actual human behavior in each experiment. We investigate the ability of four advanced chatbots against this benchmark. We find that none of these chatbots meet the benchmark. In particular, only GPT-4 and GPT-4o (not Bard nor Bing) correctly capture qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 and GPT-4o consistently underestimate self-interest, while overestimating altruistic behavior. In sum, this article introduces a publicly available resource for testing the capacity of LLMs to estimate human other-regarding preferences in economic decisions and reveals an "optimistic bias" in current versions of GPT.

摘要

大语言模型（LLMs）在协助人类进行决策过程中具有巨大潜力，涵盖从日常到高风险场景。然而，由于许多人类决策具有社会影响，要使大语言模型成为可靠的助手，一个必要前提是它们能够捕捉人类如何平衡自身利益和他人利益。在此，我们引入一个新颖的、公开可用的基准，以测试大语言模型预测人类如何平衡金钱方面的自身利益和他人利益的能力。这个基准包括来自对12个国家的人类参与者进行的独裁者博弈实验的106条文本指令，以及每个实验中实际人类行为的概要。我们针对这个基准研究了四个先进聊天机器人的能力。我们发现这些聊天机器人都未达到基准。特别是，只有GPT - 4和GPT - 4o（而非Bard和Bing）正确捕捉到了定性行为模式，识别出三类主要行为：利己、厌恶不平等和完全利他。尽管如此，GPT - 4和GPT - 4o始终低估利己行为，同时高估利他行为。总之，本文介绍了一个公开可用的资源，用于测试大语言模型在经济决策中估计人类他人导向偏好的能力，并揭示了当前版本GPT中的“乐观偏差”。

相似文献

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others.

Sci Rep. 2025 Jul 1;15(1):21428. doi: 10.1038/s41598-025-01715-7.

Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.

J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.

Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.

Adapting Safety Plans for Autistic Adults with Involvement from the Autism Community.

Autism Adulthood. 2025 May 28;7(3):293-302. doi: 10.1089/aut.2023.0124. eCollection 2025 Jun.

The experience of adults who choose watchful waiting or active surveillance as an approach to medical treatment: a qualitative systematic review.

JBI Database System Rev Implement Rep. 2016 Feb;14(2):174-255. doi: 10.11124/jbisrir-2016-2270.

Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.

JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.

Parents' and informal caregivers' views and experiences of communication about routine childhood vaccination: a synthesis of qualitative evidence.

Cochrane Database Syst Rev. 2017 Feb 7;2(2):CD011787. doi: 10.1002/14651858.CD011787.pub2.

Quality assessment of large language models' output in maternal health.

Sci Rep. 2025 Jul 2;15(1):22474. doi: 10.1038/s41598-025-03501-x.

Stigma Management Strategies of Autistic Social Media Users.

Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

引用本文的文献

Evaluating the ability of large Language models to predict human social decisions.

Sci Rep. 2025 Sep 2;15(1):32290. doi: 10.1038/s41598-025-17188-7.

本文引用的文献

Playing repeated games with large language models.

Nat Hum Behav. 2025 May 8. doi: 10.1038/s41562-025-02172-y.

GPT-3.5 altruistic advice is sensitive to reciprocal concerns but not to strategic risk.

Sci Rep. 2024 Sep 27;14(1):22274. doi: 10.1038/s41598-024-73306-x.

The political preferences of LLMs.

PLoS One. 2024 Jul 31;19(7):e0306621. doi: 10.1371/journal.pone.0306621. eCollection 2024.

The impact of generative artificial intelligence on socioeconomic inequalities and policy making.

PNAS Nexus. 2024 Jun 11;3(6):pgae191. doi: 10.1093/pnasnexus/pgae191. eCollection 2024 Jun.

Addressing climate change with behavioral science: A global intervention tournament in 63 countries.

Sci Adv. 2024 Feb 9;10(6):eadj5778. doi: 10.1126/sciadv.adj5778. Epub 2024 Feb 7.

The emergence of economic rationality of GPT.

Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2316205120. doi: 10.1073/pnas.2316205120. Epub 2023 Dec 12.

Morality beyond the WEIRD: How the nomological network of morality varies across cultures.

J Pers Soc Psychol. 2023 Nov;125(5):1157-1188. doi: 10.1037/pspp0000470. Epub 2023 Aug 17.

Experimental evidence on the productivity effects of generative artificial intelligence.

Science. 2023 Jul 14;381(6654):187-192. doi: 10.1126/science.adh2586. Epub 2023 Jul 13.

Mitigating bias in AI at the point of care.

Science. 2023 Jul 14;381(6654):150-152. doi: 10.1126/science.adh2713. Epub 2023 Jul 13.

Art and the science of generative AI.

Science. 2023 Jun 16;380(6650):1110-1111. doi: 10.1126/science.adh4451. Epub 2023 Jun 15.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一个公开可用的基准，用于评估大语言模型预测人类如何平衡自身利益和他人利益的能力。

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献