• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型不确定性代理:医学诊断与治疗中的辨别与校准

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

作者信息

Savage Thomas, Wang John, Gallo Robert, Boukil Abdessalem, Patel Vishwesh, Safavi-Naini Seyed Amir Ahmad, Soroush Ali, Chen Jonathan H

机构信息

Department of Medicine, Stanford University, Stanford, CA 94304, United States.

Division of Hospital Medicine, Stanford University, Stanford, CA 94304, United States.

出版信息

J Am Med Inform Assoc. 2025 Jan 1;32(1):139-149. doi: 10.1093/jamia/ocae254.

DOI:10.1093/jamia/ocae254
PMID:39396184
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11648734/
Abstract

INTRODUCTION

The inability of large language models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to estimate uncertainty in ways that are useful to physician-users.

OBJECTIVE

Evaluate the ability for uncertainty proxies to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.

METHODS

We examined confidence elicitation (CE), token-level probability (TLP), and sample consistency (SC) proxies across GPT3.5, GPT4, Llama2, and Llama3. Uncertainty proxies were evaluated against 3 datasets of open-ended patient scenarios.

RESULTS

SC discrimination outperformed TLP and CE methods. SC by sentence embedding achieved the highest discriminative performance (ROC AUC 0.68-0.79), yet with poor calibration. SC by GPT annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with accurate calibration. Verbalized confidence (CE) was found to consistently overestimate model confidence.

DISCUSSION AND CONCLUSIONS

SC is the most effective method for estimating LLM uncertainty of the proxies evaluated. SC by sentence embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while SC by GPT annotation is the more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence (CE).

摘要

引言

大语言模型(LLMs)无法传达不确定性是其在医学中应用的一个重大障碍。在将大语言模型整合到患者护理之前,该领域必须评估以对医生用户有用的方式估计不确定性的方法。

目的

通过评估区分和校准属性,评估不确定性代理在执行诊断和治疗选择任务时量化大语言模型置信度的能力。

方法

我们在GPT3.5、GPT4、Llama2和Llama3中检查了置信度诱导(CE)、令牌级概率(TLP)和样本一致性(SC)代理。针对3个开放式患者场景数据集评估不确定性代理。

结果

SC区分性能优于TLP和CE方法。通过句子嵌入的SC实现了最高的区分性能(ROC AUC 0.68 - 0.79),但校准效果不佳。通过GPT注释的SC实现了第二好的区分(ROC AUC 0.66 - 0.74),校准准确。发现语言化置信度(CE)始终高估模型置信度。

讨论与结论

SC是评估的代理中估计大语言模型不确定性的最有效方法。如果用户有一组参考案例来重新校准结果,通过句子嵌入的SC可以有效估计不确定性,而如果用户没有参考案例且需要准确的原始校准,通过GPT注释的SC是更有效的方法。我们的结果证实,大语言模型在表达其置信度(CE)时始终过度自信。

相似文献

1
Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.大语言模型不确定性代理:医学诊断与治疗中的辨别与校准
J Am Med Inform Assoc. 2025 Jan 1;32(1):139-149. doi: 10.1093/jamia/ocae254.
2
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
3
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
4
Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗?
Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.
5
Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.染色体臂 1p 和 19q 缺失的检测在胶质瘤患者中的诊断准确性和成本效益。
Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.
6
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.
7
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
8
Algorithmic Classification of Psychiatric Disorder-Related Spontaneous Communication Using Large Language Model Embeddings: Algorithm Development and Validation.使用大语言模型嵌入对精神障碍相关自发交流进行算法分类:算法开发与验证
JMIR AI. 2025 May 30;4:e67369. doi: 10.2196/67369.
9
Surveillance of Barrett's oesophagus: exploring the uncertainty through systematic review, expert workshop and economic modelling.巴雷特食管的监测:通过系统评价、专家研讨会和经济模型探索不确定性
Health Technol Assess. 2006 Mar;10(8):1-142, iii-iv. doi: 10.3310/hta10080.
10
Can We Enhance Shared Decision-making for Periacetabular Osteotomy Surgery? A Qualitative Study of Patient Experiences.我们能否加强髋臼周围截骨术的共同决策?一项关于患者体验的定性研究。
Clin Orthop Relat Res. 2025 Jan 1;483(1):120-136. doi: 10.1097/CORR.0000000000003198. Epub 2024 Jul 23.

引用本文的文献

1
Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study.减轻大语言模型在回答医学问题时过度自信的令牌概率:定量研究
J Med Internet Res. 2025 Aug 29;27:e64348. doi: 10.2196/64348.
2
Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.五个解决欧洲放射学文凭(EDiR)基于文本问题的先进聊天机器人:性能和一致性的差异。
Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.
3
A study of calibration as a measurement of trustworthiness of large language models in biomedical natural language processing.一项关于校准作为生物医学自然语言处理中大型语言模型可信度衡量标准的研究。
JAMIA Open. 2025 Jul 11;8(4):ooaf058. doi: 10.1093/jamiaopen/ooaf058. eCollection 2025 Aug.
4
From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis.从工具到协作伙伴:临床医生与人工智能协作诊断工作流程的随机对照试验
medRxiv. 2025 Jun 8:2025.06.07.25329176. doi: 10.1101/2025.06.07.25329176.
5
Current Applications, Challenges, and Future Directions of Artificial Intelligence in Emergency Medicine: A Narrative Review.人工智能在急诊医学中的当前应用、挑战及未来方向:一篇叙述性综述
Arch Acad Emerg Med. 2025 Apr 15;13(1):e45. doi: 10.22037/aaemj.v13i1.2712. eCollection 2025.
6
Identifying Deprescribing Opportunities With Large Language Models in Older Adults: Retrospective Cohort Study.利用大语言模型识别老年人的药物停用机会:回顾性队列研究。
JMIR Aging. 2025 Apr 11;8:e69504. doi: 10.2196/69504.
7
Employing large language models safely and effectively as a practicing neurosurgeon.作为一名执业神经外科医生,安全有效地使用大语言模型。
Acta Neurochir (Wien). 2025 Apr 9;167(1):101. doi: 10.1007/s00701-025-06515-6.
8
Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.大语言模型诊断生成中的不确定性估计:下一个词的概率并非预测试概率。
JAMIA Open. 2025 Jan 10;8(1):ooae154. doi: 10.1093/jamiaopen/ooae154. eCollection 2025 Feb.
9
Leveraging artificial intelligence to reduce diagnostic errors in emergency medicine: Challenges, opportunities, and future directions.利用人工智能减少急诊医学中的诊断错误:挑战、机遇与未来方向。
Acad Emerg Med. 2025 Mar;32(3):327-339. doi: 10.1111/acem.15066. Epub 2024 Dec 15.
10
Establishing best practices in large language model research: an application to repeat prompting.确立大语言模型研究的最佳实践:重复提示的应用
J Am Med Inform Assoc. 2025 Feb 1;32(2):386-390. doi: 10.1093/jamia/ocae294.

本文引用的文献

1
Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.诊断推理提示揭示了医学中大型语言模型可解释性的潜力。
NPJ Digit Med. 2024 Jan 24;7(1):20. doi: 10.1038/s41746-024-01010-1.
2
Use of GPT-4 to Analyze Medical Records of Patients With Extensive Investigations and Delayed Diagnosis.使用GPT-4分析进行了广泛检查且诊断延迟的患者的病历。
JAMA Netw Open. 2023 Aug 1;6(8):e2325000. doi: 10.1001/jamanetworkopen.2023.25000.
3
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
4
Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations.聊天机器人与医学生在自由应答临床推理考试中的表现对比
JAMA Intern Med. 2023 Sep 1;183(9):1028-1030. doi: 10.1001/jamainternmed.2023.2909.
5
Large language models will not replace healthcare professionals: curbing popular fears and hype.大语言模型不会取代医疗保健专业人员:抑制普遍的恐惧和炒作。
J R Soc Med. 2023 May;116(5):181-182. doi: 10.1177/01410768231173123. Epub 2023 May 18.
6
Tackling prediction uncertainty in machine learning for healthcare.解决医疗保健机器学习中的预测不确定性。
Nat Biomed Eng. 2023 Jun;7(6):711-718. doi: 10.1038/s41551-022-00988-x. Epub 2022 Dec 29.
7
Second opinion needed: communicating uncertainty in medical machine learning.需要第二种观点:传达医学机器学习中的不确定性
NPJ Digit Med. 2021 Jan 5;4(1):4. doi: 10.1038/s41746-020-00367-3.
8
Calibration: the Achilles heel of predictive analytics.校准:预测分析的阿喀琉斯之踵。
BMC Med. 2019 Dec 16;17(1):230. doi: 10.1186/s12916-019-1466-7.
9
Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk.超越歧视:再入院风险预测模型的校准方法和临床实用性比较。
J Biomed Inform. 2017 Dec;76:9-18. doi: 10.1016/j.jbi.2017.10.008. Epub 2017 Oct 24.
10
Discrimination and Calibration of Clinical Prediction Models: Users' Guides to the Medical Literature.临床预测模型的判别与校准:医学文献的使用者指南。
JAMA. 2017 Oct 10;318(14):1377-1384. doi: 10.1001/jama.2017.12126.