• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型在临床诊断中的比较分析:常见和复杂医疗病例的性能评估

Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.

作者信息

Dinc Mehmed T, Bardak Ali E, Bahar Furkan, Noronha Craig

机构信息

Department of Medicine, Boston Medical Center, Boston, MA 02118, United States.

Department of Medicine, St Elizabeth's Medical Center, Boston, MA 02135, United States.

出版信息

JAMIA Open. 2025 Jun 12;8(3):ooaf055. doi: 10.1093/jamiaopen/ooaf055. eCollection 2025 Jun.

DOI:10.1093/jamiaopen/ooaf055
PMID:40510808
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12161448/
Abstract

OBJECTIVES

This study aimed to systematically evaluate and compare the diagnostic performance of leading large language models (LLMs) in common and complex clinical scenarios, assessing their potential for enhancing clinical reasoning and diagnostic accuracy in authentic clinical decision-making processes.

MATERIALS AND METHODS

Diagnostic capabilities of advanced LLMs (Anthropic's Claude, OpenAI's GPT variants, Google's Gemini) were assessed using 60 common cases and 104 complex, real-world cases from Clinical Problem Solvers' morning rounds. Clinical details were disclosed in stages, mirroring authentic clinical decision-making. Models were evaluated on primary and differential diagnosis accuracy at each stage.

RESULTS

Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Smaller models notably performed well in common scenarios, matching the performance of larger models.

DISCUSSION

This study evaluated leading LLMs for diagnostic accuracy using staged information disclosure, mirroring real-world practice. Notably, Claude 3.7 Sonnet was the top performer. Employing a novel LLM-based evaluation method for large-scale analysis, the research highlights artificial intelligence's (AI's) potential to enhance diagnostics. It underscores the need for useful frameworks to translate accuracy into clinical impact and integrate AI into medical education.

CONCLUSION

Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases. To fully realize their potential for improving patient care, we must now focus on creating practical implementation frameworks and translational research to integrate these powerful AI tools into medicine.

摘要

目的

本研究旨在系统评估和比较领先的大语言模型(LLMs)在常见和复杂临床场景中的诊断性能,评估它们在真实临床决策过程中增强临床推理和诊断准确性的潜力。

材料与方法

使用来自临床问题解决者早查房的60个常见病例和104个复杂的真实世界病例,评估先进大语言模型(Anthropic公司的Claude、OpenAI公司的GPT变体、谷歌的Gemini)的诊断能力。临床细节分阶段披露,模拟真实临床决策。在每个阶段对模型的初步诊断和鉴别诊断准确性进行评估。

结果

先进的大语言模型在常见场景中显示出较高的诊断准确性(>90%),Claude 3.7在某些情况下达到了完美准确性(100%)。在复杂病例中,Claude 3.7在最终诊断阶段达到了最高准确性(83.3%),显著优于较小的模型。较小的模型在常见场景中表现出色,与较大模型的性能相当。

讨论

本研究使用分阶段信息披露评估了领先的大语言模型的诊断准确性,模拟了实际临床实践。值得注意的是,Claude 3.7 Sonnet表现最佳。该研究采用了一种基于大语言模型的新型评估方法进行大规模分析,突出了人工智能(AI)在增强诊断方面的潜力。它强调了需要有用的框架将准确性转化为临床影响,并将AI整合到医学教育中。

结论

领先的大语言模型在各种临床病例中显示出显著的诊断准确性。为了充分实现它们改善患者护理的潜力,我们现在必须专注于创建实际应用框架和转化研究,以将这些强大的AI工具整合到医学中。

相似文献

1
Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.大语言模型在临床诊断中的比较分析:常见和复杂医疗病例的性能评估
JAMIA Open. 2025 Jun 12;8(3):ooaf055. doi: 10.1093/jamiaopen/ooaf055. eCollection 2025 Jun.
2
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
3
Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.在耳鼻喉科委员会考试中利用先进的大语言模型:使用Python和应用程序编程接口的调查
Eur Arch Otorhinolaryngol. 2025 Apr 25. doi: 10.1007/s00405-025-09404-x.
4
Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis.大型语言模型防范生成健康类虚假信息的现行保障措施、风险缓解措施和透明度措施:重复横断面分析。
BMJ. 2024 Mar 20;384:e078538. doi: 10.1136/bmj-2023-078538.
5
Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivity of Large Language Models to Surgical Patient Questions: Cross-Sectional Study.大型语言模型针对外科手术患者问题的临床准确性、相关性、清晰度及情感敏感度:横断面研究
JMIR Form Res. 2024 Jun 7;8:e56165. doi: 10.2196/56165.
6
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
7
Evaluating large language and large reasoning models as decision support tools in emergency internal medicine.评估大语言和大推理模型作为急诊内科决策支持工具的作用。
Comput Biol Med. 2025 Jun;192(Pt B):110351. doi: 10.1016/j.compbiomed.2025.110351. Epub 2025 May 12.
8
Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients.人工智能能回答我的问题吗?腹部整形手术患者围手术期评估中人工智能的应用。
Aesthetic Plast Surg. 2024 Nov;48(22):4712-4724. doi: 10.1007/s00266-024-04157-0. Epub 2024 Jun 19.
9
Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists.公开可用的大语言模型在角膜疾病中的诊断性能:与人类专家的比较
Diagnostics (Basel). 2025 May 13;15(10):1221. doi: 10.3390/diagnostics15101221.
10
Comparing Large Language Model and Human Reader Accuracy with Image Challenge Case Image Inputs.比较大语言模型和人类读者在图像挑战病例图像输入方面的准确性。
Radiology. 2024 Dec;313(3):e241668. doi: 10.1148/radiol.241668.

本文引用的文献

1
Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions.ChatGPT-4o真的能通过医学科学考试吗?使用新颖问题的务实分析。
Med Sci Educ. 2025 Feb 4;35(2):721-729. doi: 10.1007/s40670-025-02293-z. eCollection 2025 Apr.
2
Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study.ChatGPT 和 Bard 在医学执照考试中的表现因文化而异:一项比较研究。
BMC Med Educ. 2024 Nov 26;24(1):1372. doi: 10.1186/s12909-024-06309-x.
3
The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.
人工智能解决方案在医疗检查和证书中的准确性和能力:系统评价和荟萃分析。
J Med Internet Res. 2024 Nov 5;26:e56532. doi: 10.2196/56532.
4
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.大语言模型对诊断推理的影响:一项随机临床试验。
JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.
5
Large language models in biomedicine and health: current research landscape and future directions.生物医学与健康领域的大语言模型:当前研究现状与未来方向
J Am Med Inform Assoc. 2024 Sep 1;31(9):1801-1811. doi: 10.1093/jamia/ocae202.
6
Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.Claude 3 Opus 和 Claude 3.5 Sonnet 基于病史和放射科“诊断请”病例关键图像的诊断性能。
Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.
7
Evaluation of large language models as a diagnostic aid for complex medical cases.评估大型语言模型作为复杂医疗病例诊断辅助工具的作用。
Front Med (Lausanne). 2024 Jun 20;11:1380148. doi: 10.3389/fmed.2024.1380148. eCollection 2024.
8
Vision of the future: large language models in ophthalmology.未来展望:大语言模型在眼科学中的应用。
Curr Opin Ophthalmol. 2024 Sep 1;35(5):391-402. doi: 10.1097/ICU.0000000000001062. Epub 2024 May 30.
9
The application of large language models in medicine: A scoping review.大语言模型在医学中的应用:一项范围综述。
iScience. 2024 Apr 23;27(5):109713. doi: 10.1016/j.isci.2024.109713. eCollection 2024 May 17.
10
Assessing the research landscape and clinical utility of large language models: a scoping review.评估大型语言模型的研究现状和临床实用性:范围综述。
BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.