人类和 GPT-4 在协调游戏中使用概率短语的比较。

A comparison of human and GPT-4 use of probabilistic phrases in a coordination game.

机构信息

Department of Psychology, New York University, 6 Washington Place, Room 574, New York, NY, 10012, USA.

Center for Neural Science, New York University, 6 Washington Place, New York, NY, 10012, USA.

出版信息

Sci Rep. 2024 Mar 21;14(1):6835. doi: 10.1038/s41598-024-56740-9.

DOI:10.1038/s41598-024-56740-9

PMID:38514688

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10958015/

Abstract

English speakers use probabilistic phrases such as likely to communicate information about the probability or likelihood of events. Communication is successful to the extent that the listener grasps what the speaker means to convey and, if communication is successful, individuals can potentially coordinate their actions based on shared knowledge about uncertainty. We first assessed human ability to estimate the probability and the ambiguity (imprecision) of twenty-three probabilistic phrases in a coordination game in two different contexts, investment advice and medical advice. We then had GPT-4 (OpenAI), a Large Language Model, complete the same tasks as the human participants. We found that GPT-4's estimates of probability both in the Investment and Medical Contexts were as close or closer to that of the human participants as the human participants' estimates were to one another. However, further analyses of residuals disclosed small but significant differences between human and GPT-4 performance. Human probability estimates were compressed relative to those of GPT-4. Estimates of probability for both the human participants and GPT-4 were little affected by context. We propose that evaluation methods based on coordination games provide a systematic way to assess what GPT-4 and similar programs can and cannot do.

摘要

英语使用者使用概率短语，如“likely”，来传达关于事件概率或可能性的信息。沟通是否成功取决于听众是否理解说话者想要传达的意思，如果沟通成功，个体可以基于对不确定性的共同了解来协调他们的行动。我们首先在两个不同的情境（投资建议和医疗建议）中评估了人类在协调游戏中估计二十三个概率短语的概率和模糊性（不准确性）的能力。然后，我们让 GPT-4（OpenAI），一个大型语言模型，完成与人类参与者相同的任务。我们发现，GPT-4 在投资和医疗情境下的概率估计与人类参与者的估计一样接近，甚至比人类参与者的估计更接近。然而，对残差的进一步分析揭示了人类和 GPT-4 表现之间的微小但显著的差异。人类的概率估计相对于 GPT-4 的估计被压缩了。人类参与者和 GPT-4 的概率估计都不受情境的影响。我们提出，基于协调游戏的评估方法为评估 GPT-4 和类似程序能做什么和不能做什么提供了一种系统的方法。