Savage Thomas, Wang John, Gallo Robert, Boukil Abdessalem, Patel Vishwesh, Safavi-Naini Seyed Amir Ahmad, Soroush Ali, Chen Jonathan H
Department of Medicine, Stanford University, Stanford, CA 94304, United States.
Division of Hospital Medicine, Stanford University, Stanford, CA 94304, United States.
J Am Med Inform Assoc. 2025 Jan 1;32(1):139-149. doi: 10.1093/jamia/ocae254.
The inability of large language models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to estimate uncertainty in ways that are useful to physician-users.
Evaluate the ability for uncertainty proxies to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.
We examined confidence elicitation (CE), token-level probability (TLP), and sample consistency (SC) proxies across GPT3.5, GPT4, Llama2, and Llama3. Uncertainty proxies were evaluated against 3 datasets of open-ended patient scenarios.
SC discrimination outperformed TLP and CE methods. SC by sentence embedding achieved the highest discriminative performance (ROC AUC 0.68-0.79), yet with poor calibration. SC by GPT annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with accurate calibration. Verbalized confidence (CE) was found to consistently overestimate model confidence.
SC is the most effective method for estimating LLM uncertainty of the proxies evaluated. SC by sentence embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while SC by GPT annotation is the more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence (CE).
大语言模型(LLMs)无法传达不确定性是其在医学中应用的一个重大障碍。在将大语言模型整合到患者护理之前,该领域必须评估以对医生用户有用的方式估计不确定性的方法。
通过评估区分和校准属性,评估不确定性代理在执行诊断和治疗选择任务时量化大语言模型置信度的能力。
我们在GPT3.5、GPT4、Llama2和Llama3中检查了置信度诱导(CE)、令牌级概率(TLP)和样本一致性(SC)代理。针对3个开放式患者场景数据集评估不确定性代理。
SC区分性能优于TLP和CE方法。通过句子嵌入的SC实现了最高的区分性能(ROC AUC 0.68 - 0.79),但校准效果不佳。通过GPT注释的SC实现了第二好的区分(ROC AUC 0.66 - 0.74),校准准确。发现语言化置信度(CE)始终高估模型置信度。
SC是评估的代理中估计大语言模型不确定性的最有效方法。如果用户有一组参考案例来重新校准结果,通过句子嵌入的SC可以有效估计不确定性,而如果用户没有参考案例且需要准确的原始校准,通过GPT注释的SC是更有效的方法。我们的结果证实,大语言模型在表达其置信度(CE)时始终过度自信。