• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

确立大语言模型研究的最佳实践:重复提示的应用

Establishing best practices in large language model research: an application to repeat prompting.

作者信息

Gallo Robert J, Baiocchi Michael, Savage Thomas R, Chen Jonathan H

机构信息

Center for Innovation to Implementation, VA Palo Alto Health Care System, Menlo Park, CA 94025, United States.

Department of Health Policy, Stanford University, Stanford, CA 94305, United States.

出版信息

J Am Med Inform Assoc. 2025 Feb 1;32(2):386-390. doi: 10.1093/jamia/ocae294.

DOI:10.1093/jamia/ocae294
PMID:39656836
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11756642/
Abstract

OBJECTIVES

We aimed to demonstrate the importance of establishing best practices in large language model research, using repeat prompting as an illustrative example.

MATERIALS AND METHODS

Using data from a prior study investigating potential model bias in peer review of medical abstracts, we compared methods that ignore correlation in model outputs from repeated prompting with a random effects method that accounts for this correlation.

RESULTS

High correlation within groups was found when repeatedly prompting the model, with intraclass correlation coefficient of 0.69. Ignoring the inherent correlation in the data led to over 100-fold inflation of effective sample size. After appropriately accounting for this issue, the authors' results reverse from a small but highly significant finding to no evidence of model bias.

DISCUSSION

The establishment of best practices for LLM research is urgently needed, as demonstrated in this case where accounting for repeat prompting in analyses was critical for accurate study conclusions.

摘要

目的

我们旨在以重复提示为例,证明在大语言模型研究中建立最佳实践的重要性。

材料与方法

利用先前一项调查医学摘要同行评审中潜在模型偏差的研究数据,我们将忽略重复提示模型输出中的相关性的方法与考虑这种相关性的随机效应方法进行了比较。

结果

对模型进行重复提示时,组内发现高度相关性,组内相关系数为0.69。忽略数据中固有的相关性导致有效样本量膨胀超过100倍。在适当考虑这个问题后,作者的结果从小而高度显著的发现转变为没有模型偏差的证据。

讨论

迫切需要建立大语言模型研究的最佳实践,如本例所示,在分析中考虑重复提示对于得出准确的研究结论至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ecf/11756642/254754f0afea/ocae294f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ecf/11756642/0484bda9f2a9/ocae294f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ecf/11756642/254754f0afea/ocae294f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ecf/11756642/0484bda9f2a9/ocae294f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ecf/11756642/254754f0afea/ocae294f2.jpg

相似文献

1
Establishing best practices in large language model research: an application to repeat prompting.确立大语言模型研究的最佳实践:重复提示的应用
J Am Med Inform Assoc. 2025 Feb 1;32(2):386-390. doi: 10.1093/jamia/ocae294.
2
Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.成人全身麻醉后预防术后恶心呕吐的药物:网状Meta分析
Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.
3
Audit and feedback: effects on professional practice.审核与反馈:对专业实践的影响
Cochrane Database Syst Rev. 2025 Mar 25;3(3):CD000259. doi: 10.1002/14651858.CD000259.pub4.
4
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
5
The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.样本采集部位和采集程序对严重急性呼吸综合征冠状病毒2(SARS-CoV-2)感染鉴定的影响。
Cochrane Database Syst Rev. 2024 Dec 16;12(12):CD014780. doi: 10.1002/14651858.CD014780.
6
In vitro maturation in subfertile women with polycystic ovarian syndrome undergoing assisted reproduction.多囊卵巢综合征不孕妇女在辅助生殖过程中的体外成熟。
Cochrane Database Syst Rev. 2025 Feb 6;2(2):CD006606. doi: 10.1002/14651858.CD006606.pub5.
7
Psychological therapies for treatment-resistant depression in adults.成人难治性抑郁症的心理治疗
Cochrane Database Syst Rev. 2018 May 14;5(5):CD010558. doi: 10.1002/14651858.CD010558.pub2.
8
Impact of residual disease as a prognostic factor for survival in women with advanced epithelial ovarian cancer after primary surgery.原发性手术后晚期上皮性卵巢癌患者残留病灶对生存预后的影响。
Cochrane Database Syst Rev. 2022 Sep 26;9(9):CD015048. doi: 10.1002/14651858.CD015048.pub2.
9
Control interventions in randomised trials among people with mental health disorders.精神障碍患者随机试验中的对照干预措施。
Cochrane Database Syst Rev. 2022 Apr 4;4(4):MR000050. doi: 10.1002/14651858.MR000050.pub2.
10
Pharmacotherapies for sleep disturbances in dementia.痴呆症睡眠障碍的药物治疗
Cochrane Database Syst Rev. 2016 Nov 16;11(11):CD009178. doi: 10.1002/14651858.CD009178.pub3.

引用本文的文献

1
The paradox of creativity in generative AI: high performance, human-like bias, and limited differential evaluation.生成式人工智能中的创造力悖论:高性能、类人偏见与有限的差异评估。
Front Psychol. 2025 Aug 7;16:1628486. doi: 10.3389/fpsyg.2025.1628486. eCollection 2025.
2
GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial.GPT-4辅助改善医生在患者护理任务中的表现:一项随机对照试验。
Nat Med. 2025 Apr;31(4):1233-1238. doi: 10.1038/s41591-024-03456-y. Epub 2025 Feb 5.

本文引用的文献

1
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.大语言模型对诊断推理的影响:一项随机临床试验。
JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.
2
Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.大语言模型不确定性代理:医学诊断与治疗中的辨别与校准
J Am Med Inform Assoc. 2025 Jan 1;32(1):139-149. doi: 10.1093/jamia/ocae254.
3
Evaluation and mitigation of the limitations of large language models in clinical decision-making.
评估和缓解大型语言模型在临床决策中的局限性。
Nat Med. 2024 Sep;30(9):2613-2622. doi: 10.1038/s41591-024-03097-1. Epub 2024 Jul 4.
4
Affiliation Bias in Peer Review of Abstracts.摘要同行评审中的附属机构偏见。
JAMA. 2024 Apr 9;331(14):1234-1235. doi: 10.1001/jama.2024.3520.
5
Affiliation Bias in Peer Review of Abstracts-Reply.摘要同行评审中的隶属关系偏差——回复
JAMA. 2024 Apr 9;331(14):1235-1236. doi: 10.1001/jama.2024.3523.
6
Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.提示工程在与大语言模型基于证据的指南保持一致性和可靠性方面。
NPJ Digit Med. 2024 Feb 20;7(1):41. doi: 10.1038/s41746-024-01029-4.
7
Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.诊断推理提示揭示了医学中大型语言模型可解释性的潜力。
NPJ Digit Med. 2024 Jan 24;7(1):20. doi: 10.1038/s41746-024-01010-1.
8
Affiliation Bias in Peer Review of Abstracts by a Large Language Model.大型语言模型对摘要进行同行评审时的隶属关系偏差。
JAMA. 2024 Jan 16;331(3):252-253. doi: 10.1001/jama.2023.24641.
9
Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.评估 GPT-4 在医疗保健中延续种族和性别偏见的潜力:一项模型评估研究。
Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.
10
Evaluating the Application of Large Language Models in Clinical Research Contexts.评估大语言模型在临床研究背景下的应用。
JAMA Netw Open. 2023 Oct 2;6(10):e2335924. doi: 10.1001/jamanetworkopen.2023.35924.