Suppr超能文献

人类和 GPT-4 在协调游戏中使用概率短语的比较。

A comparison of human and GPT-4 use of probabilistic phrases in a coordination game.

机构信息

Department of Psychology, New York University, 6 Washington Place, Room 574, New York, NY, 10012, USA.

Center for Neural Science, New York University, 6 Washington Place, New York, NY, 10012, USA.

出版信息

Sci Rep. 2024 Mar 21;14(1):6835. doi: 10.1038/s41598-024-56740-9.

Abstract

English speakers use probabilistic phrases such as likely to communicate information about the probability or likelihood of events. Communication is successful to the extent that the listener grasps what the speaker means to convey and, if communication is successful, individuals can potentially coordinate their actions based on shared knowledge about uncertainty. We first assessed human ability to estimate the probability and the ambiguity (imprecision) of twenty-three probabilistic phrases in a coordination game in two different contexts, investment advice and medical advice. We then had GPT-4 (OpenAI), a Large Language Model, complete the same tasks as the human participants. We found that GPT-4's estimates of probability both in the Investment and Medical Contexts were as close or closer to that of the human participants as the human participants' estimates were to one another. However, further analyses of residuals disclosed small but significant differences between human and GPT-4 performance. Human probability estimates were compressed relative to those of GPT-4. Estimates of probability for both the human participants and GPT-4 were little affected by context. We propose that evaluation methods based on coordination games provide a systematic way to assess what GPT-4 and similar programs can and cannot do.

摘要

英语使用者使用概率短语,如“likely”,来传达关于事件概率或可能性的信息。沟通是否成功取决于听众是否理解说话者想要传达的意思,如果沟通成功,个体可以基于对不确定性的共同了解来协调他们的行动。我们首先在两个不同的情境(投资建议和医疗建议)中评估了人类在协调游戏中估计二十三个概率短语的概率和模糊性(不准确性)的能力。然后,我们让 GPT-4(OpenAI),一个大型语言模型,完成与人类参与者相同的任务。我们发现,GPT-4 在投资和医疗情境下的概率估计与人类参与者的估计一样接近,甚至比人类参与者的估计更接近。然而,对残差的进一步分析揭示了人类和 GPT-4 表现之间的微小但显著的差异。人类的概率估计相对于 GPT-4 的估计被压缩了。人类参与者和 GPT-4 的概率估计都不受情境的影响。我们提出,基于协调游戏的评估方法为评估 GPT-4 和类似程序能做什么和不能做什么提供了一种系统的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5ee/10958015/6ac5a3b3ae3d/41598_2024_56740_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验