Suppr超能文献

大语言模型不确定性代理:医学诊断与治疗中的辨别与校准

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

作者信息

Savage Thomas, Wang John, Gallo Robert, Boukil Abdessalem, Patel Vishwesh, Safavi-Naini Seyed Amir Ahmad, Soroush Ali, Chen Jonathan H

机构信息

Department of Medicine, Stanford University, Stanford, CA 94304, United States.

Division of Hospital Medicine, Stanford University, Stanford, CA 94304, United States.

出版信息

J Am Med Inform Assoc. 2025 Jan 1;32(1):139-149. doi: 10.1093/jamia/ocae254.

Abstract

INTRODUCTION

The inability of large language models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to estimate uncertainty in ways that are useful to physician-users.

OBJECTIVE

Evaluate the ability for uncertainty proxies to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.

METHODS

We examined confidence elicitation (CE), token-level probability (TLP), and sample consistency (SC) proxies across GPT3.5, GPT4, Llama2, and Llama3. Uncertainty proxies were evaluated against 3 datasets of open-ended patient scenarios.

RESULTS

SC discrimination outperformed TLP and CE methods. SC by sentence embedding achieved the highest discriminative performance (ROC AUC 0.68-0.79), yet with poor calibration. SC by GPT annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with accurate calibration. Verbalized confidence (CE) was found to consistently overestimate model confidence.

DISCUSSION AND CONCLUSIONS

SC is the most effective method for estimating LLM uncertainty of the proxies evaluated. SC by sentence embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while SC by GPT annotation is the more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence (CE).

摘要

引言

大语言模型(LLMs)无法传达不确定性是其在医学中应用的一个重大障碍。在将大语言模型整合到患者护理之前,该领域必须评估以对医生用户有用的方式估计不确定性的方法。

目的

通过评估区分和校准属性,评估不确定性代理在执行诊断和治疗选择任务时量化大语言模型置信度的能力。

方法

我们在GPT3.5、GPT4、Llama2和Llama3中检查了置信度诱导(CE)、令牌级概率(TLP)和样本一致性(SC)代理。针对3个开放式患者场景数据集评估不确定性代理。

结果

SC区分性能优于TLP和CE方法。通过句子嵌入的SC实现了最高的区分性能(ROC AUC 0.68 - 0.79),但校准效果不佳。通过GPT注释的SC实现了第二好的区分(ROC AUC 0.66 - 0.74),校准准确。发现语言化置信度(CE)始终高估模型置信度。

讨论与结论

SC是评估的代理中估计大语言模型不确定性的最有效方法。如果用户有一组参考案例来重新校准结果,通过句子嵌入的SC可以有效估计不确定性,而如果用户没有参考案例且需要准确的原始校准,通过GPT注释的SC是更有效的方法。我们的结果证实,大语言模型在表达其置信度(CE)时始终过度自信。

相似文献

引用本文的文献

本文引用的文献

3
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
6
Tackling prediction uncertainty in machine learning for healthcare.解决医疗保健机器学习中的预测不确定性。
Nat Biomed Eng. 2023 Jun;7(6):711-718. doi: 10.1038/s41551-022-00988-x. Epub 2022 Dec 29.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验