确立大语言模型研究的最佳实践：重复提示的应用

Establishing best practices in large language model research: an application to repeat prompting.

作者信息

Gallo Robert J, Baiocchi Michael, Savage Thomas R, Chen Jonathan H

机构信息

Center for Innovation to Implementation, VA Palo Alto Health Care System, Menlo Park, CA 94025, United States.

Department of Health Policy, Stanford University, Stanford, CA 94305, United States.

出版信息

J Am Med Inform Assoc. 2025 Feb 1;32(2):386-390. doi: 10.1093/jamia/ocae294.

DOI:10.1093/jamia/ocae294

PMID:39656836

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11756642/

Abstract

OBJECTIVES

We aimed to demonstrate the importance of establishing best practices in large language model research, using repeat prompting as an illustrative example.

MATERIALS AND METHODS

Using data from a prior study investigating potential model bias in peer review of medical abstracts, we compared methods that ignore correlation in model outputs from repeated prompting with a random effects method that accounts for this correlation.

RESULTS

High correlation within groups was found when repeatedly prompting the model, with intraclass correlation coefficient of 0.69. Ignoring the inherent correlation in the data led to over 100-fold inflation of effective sample size. After appropriately accounting for this issue, the authors' results reverse from a small but highly significant finding to no evidence of model bias.

DISCUSSION

The establishment of best practices for LLM research is urgently needed, as demonstrated in this case where accounting for repeat prompting in analyses was critical for accurate study conclusions.

摘要

目的

我们旨在以重复提示为例，证明在大语言模型研究中建立最佳实践的重要性。

材料与方法

利用先前一项调查医学摘要同行评审中潜在模型偏差的研究数据，我们将忽略重复提示模型输出中的相关性的方法与考虑这种相关性的随机效应方法进行了比较。

结果

对模型进行重复提示时，组内发现高度相关性，组内相关系数为0.69。忽略数据中固有的相关性导致有效样本量膨胀超过100倍。在适当考虑这个问题后，作者的结果从小而高度显著的发现转变为没有模型偏差的证据。

讨论

迫切需要建立大语言模型研究的最佳实践，如本例所示，在分析中考虑重复提示对于得出准确的研究结论至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ecf/11756642/0484bda9f2a9/ocae294f1.jpg

相似文献

Establishing best practices in large language model research: an application to repeat prompting.

J Am Med Inform Assoc. 2025 Feb 1;32(2):386-390. doi: 10.1093/jamia/ocae294.

Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.

Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.

Audit and feedback: effects on professional practice.

Cochrane Database Syst Rev. 2025 Mar 25;3(3):CD000259. doi: 10.1002/14651858.CD000259.pub4.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.

Cochrane Database Syst Rev. 2024 Dec 16;12(12):CD014780. doi: 10.1002/14651858.CD014780.

In vitro maturation in subfertile women with polycystic ovarian syndrome undergoing assisted reproduction.

Cochrane Database Syst Rev. 2025 Feb 6;2(2):CD006606. doi: 10.1002/14651858.CD006606.pub5.

Psychological therapies for treatment-resistant depression in adults.

Cochrane Database Syst Rev. 2018 May 14;5(5):CD010558. doi: 10.1002/14651858.CD010558.pub2.

Impact of residual disease as a prognostic factor for survival in women with advanced epithelial ovarian cancer after primary surgery.

Cochrane Database Syst Rev. 2022 Sep 26;9(9):CD015048. doi: 10.1002/14651858.CD015048.pub2.

Control interventions in randomised trials among people with mental health disorders.

Cochrane Database Syst Rev. 2022 Apr 4;4(4):MR000050. doi: 10.1002/14651858.MR000050.pub2.

Pharmacotherapies for sleep disturbances in dementia.

Cochrane Database Syst Rev. 2016 Nov 16;11(11):CD009178. doi: 10.1002/14651858.CD009178.pub3.

引用本文的文献

The paradox of creativity in generative AI: high performance, human-like bias, and limited differential evaluation.

Front Psychol. 2025 Aug 7;16:1628486. doi: 10.3389/fpsyg.2025.1628486. eCollection 2025.

GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial.

Nat Med. 2025 Apr;31(4):1233-1238. doi: 10.1038/s41591-024-03456-y. Epub 2025 Feb 5.

本文引用的文献

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.

JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

J Am Med Inform Assoc. 2025 Jan 1;32(1):139-149. doi: 10.1093/jamia/ocae254.

Evaluation and mitigation of the limitations of large language models in clinical decision-making.

Nat Med. 2024 Sep;30(9):2613-2622. doi: 10.1038/s41591-024-03097-1. Epub 2024 Jul 4.

Affiliation Bias in Peer Review of Abstracts.

JAMA. 2024 Apr 9;331(14):1234-1235. doi: 10.1001/jama.2024.3520.

Affiliation Bias in Peer Review of Abstracts-Reply.

JAMA. 2024 Apr 9;331(14):1235-1236. doi: 10.1001/jama.2024.3523.

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.

NPJ Digit Med. 2024 Feb 20;7(1):41. doi: 10.1038/s41746-024-01029-4.

Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.

NPJ Digit Med. 2024 Jan 24;7(1):20. doi: 10.1038/s41746-024-01010-1.

Affiliation Bias in Peer Review of Abstracts by a Large Language Model.

JAMA. 2024 Jan 16;331(3):252-253. doi: 10.1001/jama.2023.24641.

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.

Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.

Evaluating the Application of Large Language Models in Clinical Research Contexts.

JAMA Netw Open. 2023 Oct 2;6(10):e2335924. doi: 10.1001/jamanetworkopen.2023.35924.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

确立大语言模型研究的最佳实践：重复提示的应用

Establishing best practices in large language model research: an application to repeat prompting.

作者信息

Gallo Robert J, Baiocchi Michael, Savage Thomas R, Chen Jonathan H

机构信息

Center for Innovation to Implementation, VA Palo Alto Health Care System, Menlo Park, CA 94025, United States.

Department of Health Policy, Stanford University, Stanford, CA 94305, United States.