大语言模型不确定性代理：医学诊断与治疗中的辨别与校准

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

作者信息

Savage Thomas, Wang John, Gallo Robert, Boukil Abdessalem, Patel Vishwesh, Safavi-Naini Seyed Amir Ahmad, Soroush Ali, Chen Jonathan H

机构信息

Department of Medicine, Stanford University, Stanford, CA 94304, United States.

Division of Hospital Medicine, Stanford University, Stanford, CA 94304, United States.

出版信息

J Am Med Inform Assoc. 2025 Jan 1;32(1):139-149. doi: 10.1093/jamia/ocae254.

DOI:10.1093/jamia/ocae254

PMID:39396184

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11648734/

Abstract

INTRODUCTION

The inability of large language models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to estimate uncertainty in ways that are useful to physician-users.

OBJECTIVE

Evaluate the ability for uncertainty proxies to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.

METHODS

We examined confidence elicitation (CE), token-level probability (TLP), and sample consistency (SC) proxies across GPT3.5, GPT4, Llama2, and Llama3. Uncertainty proxies were evaluated against 3 datasets of open-ended patient scenarios.

RESULTS

SC discrimination outperformed TLP and CE methods. SC by sentence embedding achieved the highest discriminative performance (ROC AUC 0.68-0.79), yet with poor calibration. SC by GPT annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with accurate calibration. Verbalized confidence (CE) was found to consistently overestimate model confidence.

DISCUSSION AND CONCLUSIONS

SC is the most effective method for estimating LLM uncertainty of the proxies evaluated. SC by sentence embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while SC by GPT annotation is the more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence (CE).

摘要

引言

大语言模型（LLMs）无法传达不确定性是其在医学中应用的一个重大障碍。在将大语言模型整合到患者护理之前，该领域必须评估以对医生用户有用的方式估计不确定性的方法。

目的

通过评估区分和校准属性，评估不确定性代理在执行诊断和治疗选择任务时量化大语言模型置信度的能力。

方法

我们在GPT3.5、GPT4、Llama2和Llama3中检查了置信度诱导（CE）、令牌级概率（TLP）和样本一致性（SC）代理。针对3个开放式患者场景数据集评估不确定性代理。

结果

SC区分性能优于TLP和CE方法。通过句子嵌入的SC实现了最高的区分性能（ROC AUC 0.68 - 0.79），但校准效果不佳。通过GPT注释的SC实现了第二好的区分（ROC AUC 0.66 - 0.74），校准准确。发现语言化置信度（CE）始终高估模型置信度。

讨论与结论

SC是评估的代理中估计大语言模型不确定性的最有效方法。如果用户有一组参考案例来重新校准结果，通过句子嵌入的SC可以有效估计不确定性，而如果用户没有参考案例且需要准确的原始校准，通过GPT注释的SC是更有效的方法。我们的结果证实，大语言模型在表达其置信度（CE）时始终过度自信。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大语言模型不确定性代理：医学诊断与治疗中的辨别与校准

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

作者信息

机构信息

出版信息

INTRODUCTION

OBJECTIVE

METHODS

RESULTS

DISCUSSION AND CONCLUSIONS

引言

目的

方法

结果

讨论与结论

相似文献

引用本文的文献

本文引用的文献

大语言模型不确定性代理：医学诊断与治疗中的辨别与校准

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

作者信息

机构信息

出版信息

INTRODUCTION

OBJECTIVE

METHODS

RESULTS

DISCUSSION AND CONCLUSIONS

引言

目的

方法

结果

讨论与结论

相似文献

引用本文的文献

本文引用的文献