Capraro Valerio, Di Paolo Roberto, Pizziol Veronica
Department of Psychology, University of Milan Bicocca, 20126, Milan, Italy.
Department of Economics and Management, University of Parma, 43121, Parma, Italy.
Sci Rep. 2025 Jul 1;15(1):21428. doi: 10.1038/s41598-025-01715-7.
Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture how humans balance self-interest and the interest of others. Here we introduce a novel, publicly available, benchmark to test LLM's ability to predict how humans balance monetary self-interest and the interest of others. This benchmark consists of 106 textual instructions from dictator games experiments conducted with human participants from 12 countries, alongside with a compendium of actual human behavior in each experiment. We investigate the ability of four advanced chatbots against this benchmark. We find that none of these chatbots meet the benchmark. In particular, only GPT-4 and GPT-4o (not Bard nor Bing) correctly capture qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 and GPT-4o consistently underestimate self-interest, while overestimating altruistic behavior. In sum, this article introduces a publicly available resource for testing the capacity of LLMs to estimate human other-regarding preferences in economic decisions and reveals an "optimistic bias" in current versions of GPT.
大语言模型(LLMs)在协助人类进行决策过程中具有巨大潜力,涵盖从日常到高风险场景。然而,由于许多人类决策具有社会影响,要使大语言模型成为可靠的助手,一个必要前提是它们能够捕捉人类如何平衡自身利益和他人利益。在此,我们引入一个新颖的、公开可用的基准,以测试大语言模型预测人类如何平衡金钱方面的自身利益和他人利益的能力。这个基准包括来自对12个国家的人类参与者进行的独裁者博弈实验的106条文本指令,以及每个实验中实际人类行为的概要。我们针对这个基准研究了四个先进聊天机器人的能力。我们发现这些聊天机器人都未达到基准。特别是,只有GPT - 4和GPT - 4o(而非Bard和Bing)正确捕捉到了定性行为模式,识别出三类主要行为:利己、厌恶不平等和完全利他。尽管如此,GPT - 4和GPT - 4o始终低估利己行为,同时高估利他行为。总之,本文介绍了一个公开可用的资源,用于测试大语言模型在经济决策中估计人类他人导向偏好的能力,并揭示了当前版本GPT中的“乐观偏差”。