Gao Yuan, Lee Dokyun, Burtch Gordon, Fazelpour Sina
Department of Information Systems, Questrom School of Business, Boston University, Boston, MA 02215.
Faculty of Computing and Data Sciences, Boston University, Boston, MA 02215.
Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2501660122. doi: 10.1073/pnas.2501660122. Epub 2025 Jun 13.
Recent studies suggest large language models (LLMs) can generate human-like responses, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates or simulations for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Nearly all advanced approaches fail to replicate human behavior distributions across many models. The causes of failure are diverse and unpredictable, relating to input language, roles, safeguarding, and more. These results warrant caution in using LLMs as surrogates or for simulating human behavior in research.
近期研究表明,大语言模型(LLMs)能够生成类人反应,在经济实验、调查和政治话语中与人类行为保持一致。这使得许多人提议,在社会科学研究中,大语言模型可用作人类的替代物或模拟对象。然而,大语言模型与人类有着根本区别,它们依赖概率模式,缺乏塑造人类认知的具体体验或生存目标。我们使用11 - 20金钱请求游戏来评估大语言模型的推理深度。几乎所有先进方法都无法在众多模型中复制人类行为分布。失败原因多种多样且不可预测,涉及输入语言、角色、保护措施等诸多方面。这些结果警示我们,在研究中使用大语言模型作为替代物或模拟人类行为时需谨慎。