• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型诊断生成中的不确定性估计:下一个词的概率并非预测试概率。

Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.

作者信息

Gao Yanjun, Myers Skatje, Chen Shan, Dligach Dmitriy, Miller Timothy, Bitterman Danielle S, Chen Guanhua, Mayampurath Anoop, Churpek Matthew M, Afshar Majid

机构信息

Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, United States.

Department of Medicine, University of Wisconsin-Madison, Madison, WI 53792, United States.

出版信息

JAMIA Open. 2025 Jan 10;8(1):ooae154. doi: 10.1093/jamiaopen/ooae154. eCollection 2025 Feb.

DOI:10.1093/jamiaopen/ooae154
PMID:39802674
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11723528/
Abstract

OBJECTIVE

To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier.

MATERIALS AND METHODS

We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data. Performance metrics included AUROC and Pearson correlation between predicted probabilities.

RESULTS

The XGB classifier outperformed the LLM-based methods across all tasks. LLM Embedding+XGB showed the closest performance to the XGB baseline, while Verbalized Confidence and Token Logits underperformed.

DISCUSSION

These findings, consistent across multiple models and demographic groups, highlight the limitations of current LLMs in providing reliable pre-test probability estimations and underscore the need for improved calibration and bias mitigation strategies. Future work should explore hybrid approaches that integrate LLMs with numerical reasoning modules and calibrated embeddings to enhance diagnostic accuracy and ensure fairer predictions across diverse populations.

CONCLUSIONS

LLMs demonstrate potential but currently fall short in estimating diagnostic probabilities compared to traditional machine learning classifiers trained on structured EHR data. Further improvements are needed for reliable clinical use.

摘要

目的

评估大语言模型(LLMs)用于测试前诊断概率估计,并将其不确定性估计性能与传统机器学习分类器进行比较。

材料与方法

我们使用来自660名患者的电子健康记录(EHR)数据,评估了2个指令微调的大语言模型,即米斯特拉尔-7B-Instruct和Llama3-70B-chat-hf,用于预测脓毒症、心律失常和充血性心力衰竭(CHF)的二元结局。将三种不确定性估计方法——语言化置信度、令牌对数its和大语言模型嵌入+XGB——与在原始EHR数据上训练的极端梯度提升(XGB)分类器进行比较。性能指标包括预测概率之间的AUROC和皮尔逊相关性。

结果

XGB分类器在所有任务中均优于基于大语言模型的方法。大语言模型嵌入+XGB表现出与XGB基线最接近的性能,而语言化置信度和令牌对数its表现较差。

讨论

这些在多个模型和人口群体中一致的发现,凸显了当前大语言模型在提供可靠的测试前概率估计方面的局限性,并强调了改进校准和偏差缓解策略的必要性。未来的工作应探索将大语言模型与数值推理模块和校准嵌入相结合的混合方法,以提高诊断准确性,并确保在不同人群中进行更公平的预测。

结论

与在结构化EHR数据上训练的传统机器学习分类器相比,大语言模型显示出潜力,但目前在估计诊断概率方面仍存在不足。需要进一步改进以实现可靠的临床应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f852/11723528/94c591be4503/ooae154f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f852/11723528/c4200fdc1ade/ooae154f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f852/11723528/4a62d7bba6f7/ooae154f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f852/11723528/94c591be4503/ooae154f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f852/11723528/c4200fdc1ade/ooae154f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f852/11723528/4a62d7bba6f7/ooae154f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f852/11723528/94c591be4503/ooae154f3.jpg

相似文献

1
Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.大语言模型诊断生成中的不确定性估计:下一个词的概率并非预测试概率。
JAMIA Open. 2025 Jan 10;8(1):ooae154. doi: 10.1093/jamiaopen/ooae154. eCollection 2025 Feb.
2
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
3
Algorithmic Classification of Psychiatric Disorder-Related Spontaneous Communication Using Large Language Model Embeddings: Algorithm Development and Validation.使用大语言模型嵌入对精神障碍相关自发交流进行算法分类:算法开发与验证
JMIR AI. 2025 May 30;4:e67369. doi: 10.2196/67369.
4
Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.利用大语言模型检测医院获得性疾病:关于肺栓塞的实证研究
J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.
5
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
6
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
7
Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study.将医学知识图谱融入大语言模型进行诊断预测:设计与应用研究
JMIR AI. 2025 Feb 24;4:e58670. doi: 10.2196/58670.
8
Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction.用于人格特质预测的大语言模型嵌入的心理测量评估
J Med Internet Res. 2025 Jul 8;27:e75347. doi: 10.2196/75347.
9
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.
10
On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models.在支持大型语言模型提出的诊断生成中 UMLS 的作用。
J Biomed Inform. 2024 Sep;157:104707. doi: 10.1016/j.jbi.2024.104707. Epub 2024 Aug 13.

引用本文的文献

1
Quo Vadis, AI-Empowered Doctor?人工智能赋能的医生,路在何方?
JMIR Med Educ. 2025 Aug 15;11:e70079. doi: 10.2196/70079.
2
Research progress and implications of the application of large language model in shared decision-making in China's healthcare field.大语言模型在中国医疗领域共享决策应用中的研究进展与启示
Front Public Health. 2025 Jul 10;13:1605212. doi: 10.3389/fpubh.2025.1605212. eCollection 2025.

本文引用的文献

1
Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.大语言模型不确定性代理:医学诊断与治疗中的辨别与校准
J Am Med Inform Assoc. 2025 Jan 1;32(1):139-149. doi: 10.1093/jamia/ocae254.
2
Causes, Diagnostic Testing, and Treatments Related to Clinical Deterioration Events Among High-Risk Ward Patients.高危病房患者临床病情恶化事件的相关病因、诊断检测及治疗
Crit Care Explor. 2024 Oct 1;6(10):e1161. doi: 10.1097/CCE.0000000000001161.
3
Development and Validation of a Machine Learning COVID-19 Veteran (COVet) Deterioration Risk Score.
机器学习 COVID-19 退伍军人(COVet)恶化风险评分的制定和验证。
Crit Care Explor. 2024 Jul 19;6(7):e1116. doi: 10.1097/CCE.0000000000001116. eCollection 2024 Jul 1.
4
Evaluation and mitigation of the limitations of large language models in clinical decision-making.评估和缓解大型语言模型在临床决策中的局限性。
Nat Med. 2024 Sep;30(9):2613-2622. doi: 10.1038/s41591-024-03097-1. Epub 2024 Jul 4.
5
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
6
Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.诊断推理提示揭示了医学中大型语言模型可解释性的潜力。
NPJ Digit Med. 2024 Jan 24;7(1):20. doi: 10.1038/s41746-024-01010-1.
7
Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.评估 GPT-4 在医疗保健中延续种族和性别偏见的潜力:一项模型评估研究。
Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.
8
Artificial Intelligence vs Clinician Performance in Estimating Probabilities of Diagnoses Before and After Testing.人工智能与临床医生在检测前后诊断概率估计方面的表现对比
JAMA Netw Open. 2023 Dec 1;6(12):e2347075. doi: 10.1001/jamanetworkopen.2023.47075.
9
Comparing Explainable Machine Learning Approaches With Traditional Statistical Methods for Evaluating Stroke Risk Models: Retrospective Cohort Study.比较可解释机器学习方法与传统统计方法用于评估中风风险模型:回顾性队列研究
JMIR Cardio. 2023 Jul 26;7:e47736. doi: 10.2196/47736.
10
Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge.生成式人工智能模型在复杂诊断挑战中的准确性。
JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.