Dinc Mehmed T, Bardak Ali E, Bahar Furkan, Noronha Craig
Department of Medicine, Boston Medical Center, Boston, MA 02118, United States.
Department of Medicine, St Elizabeth's Medical Center, Boston, MA 02135, United States.
JAMIA Open. 2025 Jun 12;8(3):ooaf055. doi: 10.1093/jamiaopen/ooaf055. eCollection 2025 Jun.
This study aimed to systematically evaluate and compare the diagnostic performance of leading large language models (LLMs) in common and complex clinical scenarios, assessing their potential for enhancing clinical reasoning and diagnostic accuracy in authentic clinical decision-making processes.
Diagnostic capabilities of advanced LLMs (Anthropic's Claude, OpenAI's GPT variants, Google's Gemini) were assessed using 60 common cases and 104 complex, real-world cases from Clinical Problem Solvers' morning rounds. Clinical details were disclosed in stages, mirroring authentic clinical decision-making. Models were evaluated on primary and differential diagnosis accuracy at each stage.
Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Smaller models notably performed well in common scenarios, matching the performance of larger models.
This study evaluated leading LLMs for diagnostic accuracy using staged information disclosure, mirroring real-world practice. Notably, Claude 3.7 Sonnet was the top performer. Employing a novel LLM-based evaluation method for large-scale analysis, the research highlights artificial intelligence's (AI's) potential to enhance diagnostics. It underscores the need for useful frameworks to translate accuracy into clinical impact and integrate AI into medical education.
Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases. To fully realize their potential for improving patient care, we must now focus on creating practical implementation frameworks and translational research to integrate these powerful AI tools into medicine.
本研究旨在系统评估和比较领先的大语言模型(LLMs)在常见和复杂临床场景中的诊断性能,评估它们在真实临床决策过程中增强临床推理和诊断准确性的潜力。
使用来自临床问题解决者早查房的60个常见病例和104个复杂的真实世界病例,评估先进大语言模型(Anthropic公司的Claude、OpenAI公司的GPT变体、谷歌的Gemini)的诊断能力。临床细节分阶段披露,模拟真实临床决策。在每个阶段对模型的初步诊断和鉴别诊断准确性进行评估。
先进的大语言模型在常见场景中显示出较高的诊断准确性(>90%),Claude 3.7在某些情况下达到了完美准确性(100%)。在复杂病例中,Claude 3.7在最终诊断阶段达到了最高准确性(83.3%),显著优于较小的模型。较小的模型在常见场景中表现出色,与较大模型的性能相当。
本研究使用分阶段信息披露评估了领先的大语言模型的诊断准确性,模拟了实际临床实践。值得注意的是,Claude 3.7 Sonnet表现最佳。该研究采用了一种基于大语言模型的新型评估方法进行大规模分析,突出了人工智能(AI)在增强诊断方面的潜力。它强调了需要有用的框架将准确性转化为临床影响,并将AI整合到医学教育中。
领先的大语言模型在各种临床病例中显示出显著的诊断准确性。为了充分实现它们改善患者护理的潜力,我们现在必须专注于创建实际应用框架和转化研究,以将这些强大的AI工具整合到医学中。