• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大型语言模型回答临床问题的可信度:横断面评估研究

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.

作者信息

Omar Mahmud, Agbareia Reem, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal

机构信息

Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500.

Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel.

出版信息

JMIR Med Inform. 2025 May 16;13:e66917. doi: 10.2196/66917.

DOI:10.2196/66917
PMID:40378406
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12101789/
Abstract

BACKGROUND

The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.

OBJECTIVE

This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses.

METHODS

We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests.

RESULTS

The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003).

CONCLUSIONS

Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.

摘要

背景

大语言模型(LLMs)在生物医学领域自我评估回答问题的信心的能力仍未得到充分探索。

目的

本研究评估了12个大语言模型在5个医学专业领域的信心水平,以评估大语言模型准确判断自身回答的能力。

方法

我们使用了1965道多项选择题,这些题目评估了以下领域的临床知识:内科、妇产科、精神病学、儿科和普通外科。模型被要求提供答案,并对正确答案给出信心评分(分数范围:0%-100%)。我们计算了每个模型正确答案的平均信心评分与所有问题上每个模型的总体准确率之间的相关性。还对正确和错误答案的信心评分进行了分析,使用双样本双侧t检验来确定信心的平均差异。

结果

正确答案的平均信心评分与模型准确率之间呈负相关且具有统计学意义(r = -0.40;P = 0.001),这表明表现较差的模型表现出反常的更高信心。例如,表现最佳的模型GPT-4o的平均准确率为74%(标准差9.4%),平均信心为63%(标准差8.3%),而表现较差的模型文心一言2-7B的平均准确率为46%(标准差10.5%),但平均信心为76%(标准差11.7%)。所有模型正确和错误回答之间的信心平均差异都很低,范围从0.6%到5.4%,GPT-4o的平均差异最高(5.4%,标准差2.3%;P = 0.003)。

结论

表现较好的大语言模型显示出更一致的总体信心水平。然而,即使是最准确的模型在正确和错误答案之间的信心差异仍然很小。这可能会限制它们在临床环境中的安全使用。解决过度自信问题可能涉及改进校准方法、进行特定领域的微调以及在决策风险较高时引入人工监督。在大语言模型更广泛地应用于临床之前,需要进一步研究来改进这些策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d14/12101789/dbb1065400d0/medinform-v13-e66917-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d14/12101789/be4135a2ea29/medinform-v13-e66917-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d14/12101789/dbb1065400d0/medinform-v13-e66917-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d14/12101789/be4135a2ea29/medinform-v13-e66917-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d14/12101789/dbb1065400d0/medinform-v13-e66917-g002.jpg

相似文献

1
Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.评估大型语言模型回答临床问题的可信度:横断面评估研究
JMIR Med Inform. 2025 May 16;13:e66917. doi: 10.2196/66917.
2
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
3
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
4
Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study.基础医学考试中与大语言模型准确性相关的因素:横断面研究
JMIR Med Educ. 2025 Jan 13;11:e58898. doi: 10.2196/58898.
5
Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study.大型语言模型在专家级重症监护问题上的比较评估与性能:一项基准研究。
Crit Care. 2025 Feb 10;29(1):72. doi: 10.1186/s13054-025-05302-0.
6
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
7
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
8
Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.ChatGPT-4o和谷歌Gemini在基于图像的神经外科委员会问题上的表现准确性和质量。
Neurosurg Rev. 2025 Mar 25;48(1):320. doi: 10.1007/s10143-025-03472-7.
9
Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。
JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.
10
Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.在回答患者问题方面,大型语言模型聊天机器人的表现是否优于成熟的患者信息资源?一项关于黑色素瘤的比较研究。
Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.

本文引用的文献

1
Evaluating and addressing demographic disparities in medical large language models: a systematic review.评估和解决医学大语言模型中的人口统计学差异:一项系统综述。
Int J Equity Health. 2025 Feb 26;24(1):57. doi: 10.1186/s12939-025-02419-0.
2
Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions.通过迭代后续问题改进医学领域的检索增强生成
Pac Symp Biocomput. 2025;30:199-214. doi: 10.1142/9789819807024_0015.
3
Generating credible referenced medical research: A comparative study of openAI's GPT-4 and Google's gemini.
生成可信的引用医学研究:OpenAI的GPT-4与谷歌的Gemini的比较研究
Comput Biol Med. 2025 Feb;185:109545. doi: 10.1016/j.compbiomed.2024.109545. Epub 2024 Dec 12.
4
Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room.评估最先进的大型语言模型在预测急诊入院方面的准确性。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1921-1928. doi: 10.1093/jamia/ocae103.
5
Large language models streamline automated machine learning for clinical studies.大型语言模型简化了临床研究的自动化机器学习。
Nat Commun. 2024 Feb 21;15(1):1603. doi: 10.1038/s41467-024-45879-8.
6
One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering.一个语言模型是不够的:利用集成学习的力量进行医学问答。
medRxiv. 2023 Dec 24:2023.12.21.23300380. doi: 10.1101/2023.12.21.23300380.
7
Critical analysis of the AI impact on the patient-physician relationship: A multi-stakeholder qualitative study.人工智能对医患关系影响的批判性分析:一项多利益相关者定性研究。
Digit Health. 2023 Dec 19;9:20552076231220833. doi: 10.1177/20552076231220833. eCollection 2023 Jan-Dec.
8
Black Box Warning: Large Language Models and the Future of Infectious Diseases Consultation.黑框警告:大型语言模型与传染病咨询的未来。
Clin Infect Dis. 2024 Apr 10;78(4):860-866. doi: 10.1093/cid/ciad633.
9
The future landscape of large language models in medicine.医学领域大语言模型的未来前景。
Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.
10
Creation and Adoption of Large Language Models in Medicine.医学领域中大型语言模型的创建与采用。
JAMA. 2023 Sep 5;330(9):866-869. doi: 10.1001/jama.2023.14217.