Suppr超能文献

评估大型语言模型回答临床问题的可信度:横断面评估研究

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.

作者信息

Omar Mahmud, Agbareia Reem, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal

机构信息

Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500.

Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel.

出版信息

JMIR Med Inform. 2025 May 16;13:e66917. doi: 10.2196/66917.

Abstract

BACKGROUND

The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.

OBJECTIVE

This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses.

METHODS

We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests.

RESULTS

The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003).

CONCLUSIONS

Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.

摘要

背景

大语言模型(LLMs)在生物医学领域自我评估回答问题的信心的能力仍未得到充分探索。

目的

本研究评估了12个大语言模型在5个医学专业领域的信心水平,以评估大语言模型准确判断自身回答的能力。

方法

我们使用了1965道多项选择题,这些题目评估了以下领域的临床知识:内科、妇产科、精神病学、儿科和普通外科。模型被要求提供答案,并对正确答案给出信心评分(分数范围:0%-100%)。我们计算了每个模型正确答案的平均信心评分与所有问题上每个模型的总体准确率之间的相关性。还对正确和错误答案的信心评分进行了分析,使用双样本双侧t检验来确定信心的平均差异。

结果

正确答案的平均信心评分与模型准确率之间呈负相关且具有统计学意义(r = -0.40;P = 0.001),这表明表现较差的模型表现出反常的更高信心。例如,表现最佳的模型GPT-4o的平均准确率为74%(标准差9.4%),平均信心为63%(标准差8.3%),而表现较差的模型文心一言2-7B的平均准确率为46%(标准差10.5%),但平均信心为76%(标准差11.7%)。所有模型正确和错误回答之间的信心平均差异都很低,范围从0.6%到5.4%,GPT-4o的平均差异最高(5.4%,标准差2.3%;P = 0.003)。

结论

表现较好的大语言模型显示出更一致的总体信心水平。然而,即使是最准确的模型在正确和错误答案之间的信心差异仍然很小。这可能会限制它们在临床环境中的安全使用。解决过度自信问题可能涉及改进校准方法、进行特定领域的微调以及在决策风险较高时引入人工监督。在大语言模型更广泛地应用于临床之前,需要进一步研究来改进这些策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d14/12101789/be4135a2ea29/medinform-v13-e66917-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验