Suppr超能文献

大语言模型诊断生成中的不确定性估计:下一个词的概率并非预测试概率。

Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.

作者信息

Gao Yanjun, Myers Skatje, Chen Shan, Dligach Dmitriy, Miller Timothy, Bitterman Danielle S, Chen Guanhua, Mayampurath Anoop, Churpek Matthew M, Afshar Majid

机构信息

Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, United States.

Department of Medicine, University of Wisconsin-Madison, Madison, WI 53792, United States.

出版信息

JAMIA Open. 2025 Jan 10;8(1):ooae154. doi: 10.1093/jamiaopen/ooae154. eCollection 2025 Feb.

Abstract

OBJECTIVE

To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier.

MATERIALS AND METHODS

We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data. Performance metrics included AUROC and Pearson correlation between predicted probabilities.

RESULTS

The XGB classifier outperformed the LLM-based methods across all tasks. LLM Embedding+XGB showed the closest performance to the XGB baseline, while Verbalized Confidence and Token Logits underperformed.

DISCUSSION

These findings, consistent across multiple models and demographic groups, highlight the limitations of current LLMs in providing reliable pre-test probability estimations and underscore the need for improved calibration and bias mitigation strategies. Future work should explore hybrid approaches that integrate LLMs with numerical reasoning modules and calibrated embeddings to enhance diagnostic accuracy and ensure fairer predictions across diverse populations.

CONCLUSIONS

LLMs demonstrate potential but currently fall short in estimating diagnostic probabilities compared to traditional machine learning classifiers trained on structured EHR data. Further improvements are needed for reliable clinical use.

摘要

目的

评估大语言模型(LLMs)用于测试前诊断概率估计,并将其不确定性估计性能与传统机器学习分类器进行比较。

材料与方法

我们使用来自660名患者的电子健康记录(EHR)数据,评估了2个指令微调的大语言模型,即米斯特拉尔-7B-Instruct和Llama3-70B-chat-hf,用于预测脓毒症、心律失常和充血性心力衰竭(CHF)的二元结局。将三种不确定性估计方法——语言化置信度、令牌对数its和大语言模型嵌入+XGB——与在原始EHR数据上训练的极端梯度提升(XGB)分类器进行比较。性能指标包括预测概率之间的AUROC和皮尔逊相关性。

结果

XGB分类器在所有任务中均优于基于大语言模型的方法。大语言模型嵌入+XGB表现出与XGB基线最接近的性能,而语言化置信度和令牌对数its表现较差。

讨论

这些在多个模型和人口群体中一致的发现,凸显了当前大语言模型在提供可靠的测试前概率估计方面的局限性,并强调了改进校准和偏差缓解策略的必要性。未来的工作应探索将大语言模型与数值推理模块和校准嵌入相结合的混合方法,以提高诊断准确性,并确保在不同人群中进行更公平的预测。

结论

与在结构化EHR数据上训练的传统机器学习分类器相比,大语言模型显示出潜力,但目前在估计诊断概率方面仍存在不足。需要进一步改进以实现可靠的临床应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f852/11723528/c4200fdc1ade/ooae154f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验