• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型的概率医学预测

Probabilistic medical predictions of large language models.

作者信息

Gu Bowen, Desai Rishi J, Lin Kueiyu Joshua, Yang Jie

机构信息

Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.

Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA.

出版信息

NPJ Digit Med. 2024 Dec 19;7(1):367. doi: 10.1038/s41746-024-01366-4.

DOI:10.1038/s41746-024-01366-4
PMID:39702641
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11659327/
Abstract

Large Language Models (LLMs) have shown promise in clinical applications through prompt engineering, allowing flexible clinical predictions. However, they struggle to produce reliable prediction probabilities, which are crucial for transparency and decision-making. While explicit prompts can lead LLMs to generate probability estimates, their numerical reasoning limitations raise concerns about reliability. We compared explicit probabilities from text generation to implicit probabilities derived from the likelihood of predicting the correct label token. Across six advanced open-source LLMs and five medical datasets, explicit probabilities consistently underperformed implicit probabilities in discrimination, precision, and recall. This discrepancy is more pronounced with smaller LLMs and imbalanced datasets, highlighting the need for cautious interpretation, improved probability estimation methods, and further research for clinical use of LLMs.

摘要

大语言模型(LLMs)已通过提示工程在临床应用中展现出前景,实现了灵活的临床预测。然而,它们在生成可靠的预测概率方面存在困难,而预测概率对于透明度和决策至关重要。虽然明确的提示可使大语言模型生成概率估计值,但其数值推理局限性引发了对可靠性的担忧。我们将文本生成的明确概率与从预测正确标签令牌的可能性得出的隐式概率进行了比较。在六个先进的开源大语言模型和五个医学数据集上,明确概率在区分度、精度和召回率方面始终不如隐式概率。这种差异在较小的大语言模型和不平衡数据集上更为明显,凸显了谨慎解释、改进概率估计方法以及对大语言模型临床应用进行进一步研究的必要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6038/11659327/0e650c853397/41746_2024_1366_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6038/11659327/dd423338c8c5/41746_2024_1366_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6038/11659327/1be1b4aa8274/41746_2024_1366_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6038/11659327/0e650c853397/41746_2024_1366_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6038/11659327/dd423338c8c5/41746_2024_1366_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6038/11659327/1be1b4aa8274/41746_2024_1366_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6038/11659327/0e650c853397/41746_2024_1366_Fig3_HTML.jpg

相似文献

1
Probabilistic medical predictions of large language models.大语言模型的概率医学预测
NPJ Digit Med. 2024 Dec 19;7(1):367. doi: 10.1038/s41746-024-01366-4.
2
Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.大语言模型诊断生成中的不确定性估计:下一个词的概率并非预测试概率。
JAMIA Open. 2025 Jan 10;8(1):ooae154. doi: 10.1093/jamiaopen/ooae154. eCollection 2025 Feb.
3
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
4
Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study.将医学知识图谱融入大语言模型进行诊断预测:设计与应用研究
JMIR AI. 2025 Feb 24;4:e58670. doi: 10.2196/58670.
5
Utilizing large language models for gastroenterology research: a conceptual framework.利用大语言模型进行胃肠病学研究:一个概念框架。
Therap Adv Gastroenterol. 2025 Apr 1;18:17562848251328577. doi: 10.1177/17562848251328577. eCollection 2025.
6
Biomedical knowledge graph-optimized prompt generation for large language models.生物医学知识图谱优化的大语言模型提示生成。
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae560.
7
An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.基于本体增强大语言模型的罕见病知识图谱构建自动端到端系统:开发研究
JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.
8
Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study.探究提示工程对用于产科诊断文本标准化的大语言模型性能的影响:比较研究
JMIR Form Res. 2024 Feb 8;8:e53216. doi: 10.2196/53216.
9
Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning.通过提示工程提高放射科对大语言模型的使用:从精准提示到零样本学习。
Rofo. 2024 Nov;196(11):1166-1170. doi: 10.1055/a-2264-5631. Epub 2024 Feb 26.
10
Using Generative Artificial Intelligence in Health Economics and Outcomes Research: A Primer on Techniques and Breakthroughs.在卫生经济学与结果研究中使用生成式人工智能:技术与突破入门
Pharmacoecon Open. 2025 Apr 29. doi: 10.1007/s41669-025-00580-4.

本文引用的文献

1
LCD benchmark: long clinical document benchmark on mortality prediction for language models.LCD基准:用于语言模型死亡率预测的长临床文档基准。
J Am Med Inform Assoc. 2025 Feb 1;32(2):285-295. doi: 10.1093/jamia/ocae287.
2
Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.评估大型语言模型(LLM)在既定乳腺分类系统上的性能。
Diagnostics (Basel). 2024 Jul 11;14(14):1491. doi: 10.3390/diagnostics14141491.
3
Bias in artificial intelligence for medical imaging: fundamentals, detection, avoidance, mitigation, challenges, ethics, and prospects.
医学成像人工智能中的偏差:基础、检测、避免、缓解、挑战、伦理及前景
Diagn Interv Radiol. 2025 Mar 3;31(2):75-88. doi: 10.4274/dir.2024.242854. Epub 2024 Jul 2.
4
Large language models leverage external knowledge to extend clinical insight beyond language boundaries.大语言模型利用外部知识将临床洞察力扩展到语言边界之外。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2054-2064. doi: 10.1093/jamia/ocae079.
5
Large language models for biomedicine: foundations, opportunities, challenges, and best practices.大型语言模型在生物医学领域的应用:基础、机遇、挑战和最佳实践。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2114-2124. doi: 10.1093/jamia/ocae074.
6
A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks.对基准生物医学文本处理任务中大型语言模型的全面评估。
Comput Biol Med. 2024 Mar;171:108189. doi: 10.1016/j.compbiomed.2024.108189. Epub 2024 Feb 20.
7
Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study.揭示 ChatGPT 在视网膜血管疾病分类上的语言差异:一项横断面研究。
J Med Internet Res. 2024 Jan 22;26:e51926. doi: 10.2196/51926.
8
Explainability in medicine in an era of AI-based clinical decision support systems.基于人工智能的临床决策支持系统时代的医学可解释性。
Front Genet. 2022 Sep 19;13:903600. doi: 10.3389/fgene.2022.903600. eCollection 2022.
9
Development and Validation of a Deep Learning Model for Detection of Allergic Reactions Using Safety Event Reports Across Hospitals.利用医院安全事件报告开发和验证一种用于检测过敏反应的深度学习模型
JAMA Netw Open. 2020 Nov 2;3(11):e2022836. doi: 10.1001/jamanetworkopen.2020.22836.
10
Machine learning for patient risk stratification for acute respiratory distress syndrome.机器学习在急性呼吸窘迫综合征患者风险分层中的应用。
PLoS One. 2019 Mar 28;14(3):e0214465. doi: 10.1371/journal.pone.0214465. eCollection 2019.