Griot Maxime, Hemptinne Coralie, Vanderdonckt Jean, Yuksel Demet
Institute of NeuroScience, Université catholique de Louvain, Brussels, Belgium.
Louvain Research Institute in Management and Organizations, Université catholique de Louvain, Louvain-la-Neuve, Belgium.
Nat Commun. 2025 Jan 14;16(1):642. doi: 10.1038/s41467-024-55628-6.
Large Language Models have demonstrated expert-level accuracy on medical board examinations, suggesting potential for clinical decision support systems. However, their metacognitive abilities, crucial for medical decision-making, remain largely unexplored. To address this gap, we developed MetaMedQA, a benchmark incorporating confidence scores and metacognitive tasks into multiple-choice medical questions. We evaluated twelve models on dimensions including confidence-based accuracy, missing answer recall, and unknown recall. Despite high accuracy on multiple-choice questions, our study revealed significant metacognitive deficiencies across all tested models. Models consistently failed to recognize their knowledge limitations and provided confident answers even when correct options were absent. In this work, we show that current models exhibit a critical disconnect between perceived and actual capabilities in medical reasoning, posing significant risks in clinical settings. Our findings emphasize the need for more robust evaluation frameworks that incorporate metacognitive abilities, essential for developing reliable Large Language Model enhanced clinical decision support systems.
大型语言模型在医学委员会考试中展现出了专家级的准确率,这表明其在临床决策支持系统方面具有潜力。然而,它们对于医学决策至关重要的元认知能力在很大程度上仍未得到探索。为了填补这一空白,我们开发了MetaMedQA,这是一个将置信度分数和元认知任务纳入多项选择医学问题的基准。我们在基于置信度的准确率、缺失答案召回率和未知召回率等维度上评估了12个模型。尽管在多项选择题上准确率很高,但我们的研究揭示了所有测试模型都存在显著的元认知缺陷。模型始终无法识别自身的知识局限性,即使在没有正确选项的情况下也会给出自信的答案。在这项工作中,我们表明当前模型在医学推理中表现出感知能力与实际能力之间的严重脱节,这在临床环境中构成了重大风险。我们的研究结果强调了需要更强大的评估框架,该框架应纳入元认知能力,这对于开发可靠的大型语言模型增强型临床决策支持系统至关重要。