大语言模型在临床诊断中的比较分析：常见和复杂医疗病例的性能评估

Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.

作者信息

Dinc Mehmed T, Bardak Ali E, Bahar Furkan, Noronha Craig

机构信息

Department of Medicine, Boston Medical Center, Boston, MA 02118, United States.

Department of Medicine, St Elizabeth's Medical Center, Boston, MA 02135, United States.

出版信息

JAMIA Open. 2025 Jun 12;8(3):ooaf055. doi: 10.1093/jamiaopen/ooaf055. eCollection 2025 Jun.

DOI:10.1093/jamiaopen/ooaf055

PMID:40510808

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12161448/

Abstract

OBJECTIVES

This study aimed to systematically evaluate and compare the diagnostic performance of leading large language models (LLMs) in common and complex clinical scenarios, assessing their potential for enhancing clinical reasoning and diagnostic accuracy in authentic clinical decision-making processes.

MATERIALS AND METHODS

Diagnostic capabilities of advanced LLMs (Anthropic's Claude, OpenAI's GPT variants, Google's Gemini) were assessed using 60 common cases and 104 complex, real-world cases from Clinical Problem Solvers' morning rounds. Clinical details were disclosed in stages, mirroring authentic clinical decision-making. Models were evaluated on primary and differential diagnosis accuracy at each stage.

RESULTS

Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Smaller models notably performed well in common scenarios, matching the performance of larger models.

DISCUSSION

This study evaluated leading LLMs for diagnostic accuracy using staged information disclosure, mirroring real-world practice. Notably, Claude 3.7 Sonnet was the top performer. Employing a novel LLM-based evaluation method for large-scale analysis, the research highlights artificial intelligence's (AI's) potential to enhance diagnostics. It underscores the need for useful frameworks to translate accuracy into clinical impact and integrate AI into medical education.

CONCLUSION

Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases. To fully realize their potential for improving patient care, we must now focus on creating practical implementation frameworks and translational research to integrate these powerful AI tools into medicine.

摘要

目的

本研究旨在系统评估和比较领先的大语言模型（LLMs）在常见和复杂临床场景中的诊断性能，评估它们在真实临床决策过程中增强临床推理和诊断准确性的潜力。

材料与方法

使用来自临床问题解决者早查房的60个常见病例和104个复杂的真实世界病例，评估先进大语言模型（Anthropic公司的Claude、OpenAI公司的GPT变体、谷歌的Gemini）的诊断能力。临床细节分阶段披露，模拟真实临床决策。在每个阶段对模型的初步诊断和鉴别诊断准确性进行评估。

结果

先进的大语言模型在常见场景中显示出较高的诊断准确性（>90%），Claude 3.7在某些情况下达到了完美准确性（100%）。在复杂病例中，Claude 3.7在最终诊断阶段达到了最高准确性（83.3%），显著优于较小的模型。较小的模型在常见场景中表现出色，与较大模型的性能相当。

讨论

本研究使用分阶段信息披露评估了领先的大语言模型的诊断准确性，模拟了实际临床实践。值得注意的是，Claude 3.7 Sonnet表现最佳。该研究采用了一种基于大语言模型的新型评估方法进行大规模分析，突出了人工智能（AI）在增强诊断方面的潜力。它强调了需要有用的框架将准确性转化为临床影响，并将AI整合到医学教育中。

结论

领先的大语言模型在各种临床病例中显示出显著的诊断准确性。为了充分实现它们改善患者护理的潜力，我们现在必须专注于创建实际应用框架和转化研究，以将这些强大的AI工具整合到医学中。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大语言模型在临床诊断中的比较分析：常见和复杂医疗病例的性能评估

Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

本文引用的文献

大语言模型在临床诊断中的比较分析：常见和复杂医疗病例的性能评估

Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

本文引用的文献