• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型能否帮助预测复杂行为科学研究的结果?

Can large language models help predict results from a complex behavioural science study?

作者信息

Lippert Steffen, Dreber Anna, Johannesson Magnus, Tierney Warren, Cyrus-Lai Wilson, Uhlmann Eric Luis, Pfeiffer Thomas

机构信息

Department of Economics, University of Auckland, Auckland, New Zealand.

Department of Economics, Stockholm School of Economics, Stockholm, Sweden.

出版信息

R Soc Open Sci. 2024 Sep 25;11(9):240682. doi: 10.1098/rsos.240682. eCollection 2024 Sep.

DOI:10.1098/rsos.240682
PMID:39323554
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11421891/
Abstract

We tested whether large language models (LLMs) can help predict results from a complex behavioural science experiment. In study 1, we investigated the performance of the widely used LLMs GPT-3.5 and GPT-4 in forecasting the empirical findings of a large-scale experimental study of emotions, gender, and social perceptions. We found that GPT-4, but not GPT-3.5, matched the performance of a cohort of 119 human experts, with correlations of 0.89 (GPT-4), 0.07 (GPT-3.5) and 0.87 (human experts) between aggregated forecasts and realized effect sizes. In study 2, providing participants from a university subject pool the opportunity to query a GPT-4 powered chatbot significantly increased the accuracy of their forecasts. Results indicate promise for artificial intelligence (AI) to help anticipate-at scale and minimal cost-which claims about human behaviour will find empirical support and which ones will not. Our discussion focuses on avenues for human-AI collaboration in science.

摘要

我们测试了大语言模型(LLMs)是否有助于预测一项复杂行为科学实验的结果。在研究1中,我们调查了广泛使用的大语言模型GPT-3.5和GPT-4在预测一项关于情绪、性别和社会认知的大规模实验研究的实证结果方面的表现。我们发现,GPT-4而非GPT-3.5的表现与119名人类专家相当,汇总预测与实际效应大小之间的相关性分别为0.89(GPT-4)、0.07(GPT-3.5)和0.87(人类专家)。在研究2中,为大学受试者库中的参与者提供查询由GPT-4驱动的聊天机器人的机会,显著提高了他们预测的准确性。结果表明,人工智能有望以大规模且低成本的方式帮助预测哪些关于人类行为的说法将获得实证支持,哪些不会。我们的讨论重点是科学领域中人类与人工智能合作的途径。

相似文献

1
Can large language models help predict results from a complex behavioural science study?大语言模型能否帮助预测复杂行为科学研究的结果?
R Soc Open Sci. 2024 Sep 25;11(9):240682. doi: 10.1098/rsos.240682. eCollection 2024 Sep.
2
Diagnostic accuracy of large language models in psychiatry.精神科大语言模型的诊断准确性。
Asian J Psychiatr. 2024 Oct;100:104168. doi: 10.1016/j.ajp.2024.104168. Epub 2024 Jul 25.
3
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
4
A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.基于语言模型的模拟患者与自动化反馈的病史采集:前瞻性研究。
JMIR Med Educ. 2024 Aug 16;10:e59213. doi: 10.2196/59213.
5
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
6
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
7
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
8
Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.评估 GPT-4 提供医疗建议的表现:与人类专家的比较分析。
JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.
9
Challenging large language models' "" with human tools: A neuropsychological investigation in Italian language on prefrontal functioning.运用人类工具挑战大型语言模型的“”:一项关于意大利语前额叶功能的神经心理学研究。 注:原文中“Challenging large language models' "" with human tools”这里双引号里内容缺失,翻译可能不太准确,需结合完整原文进一步理解。
Heliyon. 2024 Oct 3;10(19):e38911. doi: 10.1016/j.heliyon.2024.e38911. eCollection 2024 Oct 15.
10
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

本文引用的文献

1
A Turing test of whether AI chatbots are behaviorally similar to humans.人工智能聊天机器人是否在行为上与人类相似的图灵测试。
Proc Natl Acad Sci U S A. 2024 Feb 27;121(9):e2313925121. doi: 10.1073/pnas.2313925121. Epub 2024 Feb 22.
2
The emergence of economic rationality of GPT.GPT 的经济理性的出现。
Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2316205120. doi: 10.1073/pnas.2316205120. Epub 2023 Dec 12.
3
Competition and moral behavior: A meta-analysis of forty-five crowd-sourced experimental designs.竞争与道德行为:四十五个众包实验设计的元分析。
Proc Natl Acad Sci U S A. 2023 Jun 6;120(23):e2215572120. doi: 10.1073/pnas.2215572120. Epub 2023 May 30.
4
Predicting reliability through structured expert elicitation with the repliCATS (Collaborative Assessments for Trustworthy Science) process.通过 repliCATS(可信赖科学的协作评估)过程进行结构化专家 elicitation 预测可靠性。
PLoS One. 2023 Jan 26;18(1):e0274429. doi: 10.1371/journal.pone.0274429. eCollection 2023.
5
Examining the generalizability of research findings from archival data.从档案数据中检验研究发现的可推广性。
Proc Natl Acad Sci U S A. 2022 Jul 26;119(30):e2120377119. doi: 10.1073/pnas.2120377119. Epub 2022 Jul 19.
6
Deciding what to replicate: A decision model for replication study selection under resource and knowledge constraints.在资源和知识有限的情况下,决定复制什么:复制研究选择的决策模型。
Psychol Methods. 2023 Apr;28(2):438-451. doi: 10.1037/met0000438. Epub 2021 Dec 20.
7
Replicability, Robustness, and Reproducibility in Psychological Science.心理科学中的可重复性、稳健性和再现性。
Annu Rev Psychol. 2022 Jan 4;73:719-748. doi: 10.1146/annurev-psych-020821-114157. Epub 2021 Oct 19.
8
Using prediction markets to predict the outcomes in the Defense Advanced Research Projects Agency's next-generation social science programme.利用预测市场预测美国国防高级研究计划局下一代社会科学项目的成果。
R Soc Open Sci. 2021 Jul 14;8(7):181308. doi: 10.1098/rsos.181308. eCollection 2021 Jul.
9
Predicting replicability-Analysis of survey and prediction market data from large-scale forecasting projects.预测可重复性-对大规模预测项目中的调查和预测市场数据的分析。
PLoS One. 2021 Apr 14;16(4):e0248780. doi: 10.1371/journal.pone.0248780. eCollection 2021.
10
Estimating the deep replicability of scientific findings using human and artificial intelligence.利用人和人工智能估计科学发现的深度可重复性。
Proc Natl Acad Sci U S A. 2020 May 19;117(20):10762-10768. doi: 10.1073/pnas.1909046117. Epub 2020 May 4.