• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大语言模型诊断能力的准确性。

Assessing the Accuracy of Diagnostic Capabilities of Large Language Models.

作者信息

Urda-Cîmpean Andrada Elena, Leucuța Daniel-Corneliu, Drugan Cristina, Duțu Alina-Gabriela, Călinici Tudor, Drugan Tudor

机构信息

Department of Medical Informatics and Biostatistics, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania.

Department of Medical Biochemistry, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania.

出版信息

Diagnostics (Basel). 2025 Jun 29;15(13):1657. doi: 10.3390/diagnostics15131657.

DOI:10.3390/diagnostics15131657
PMID:40647657
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12248924/
Abstract

In recent years, numerous artificial intelligence applications, especially generative large language models, have evolved in the medical field. This study conducted a structured comparative analysis of four leading generative large language models (LLMs)-ChatGPT-4o (OpenAI), Grok-3 (xAI), Gemini-2.0 Flash (Google), and DeepSeek-V3 (DeepSeek)-to evaluate their diagnostic performance in clinical case scenarios. We assessed medical knowledge recall and clinical reasoning capabilities through staged, progressively complex cases, with responses graded by expert raters using a 0-5 scale. All models performed better on knowledge-based questions than on reasoning tasks, highlighting the ongoing limitations in contextual diagnostic synthesis. Overall, DeepSeek outperformed the other models, achieving significantly higher scores across all evaluation dimensions ( < 0.05), particularly in regards to medical reasoning tasks. While these findings support the feasibility of using LLMs for medical training and decision support, the study emphasizes the need for improved interpretability, prompt optimization, and rigorous benchmarking to ensure clinical reliability. This structured, comparative approach contributes to ongoing efforts to establish standardized evaluation frameworks for integrating LLMs into diagnostic workflows.

摘要

近年来,众多人工智能应用,尤其是生成式大语言模型,已在医学领域得到发展。本研究对四种领先的生成式大语言模型(LLMs)——ChatGPT-4o(OpenAI)、Grok-3(xAI)、Gemini-2.0 Flash(谷歌)和DeepSeek-V3(DeepSeek)——进行了结构化比较分析,以评估它们在临床病例场景中的诊断性能。我们通过分阶段、逐步复杂的病例评估医学知识回忆和临床推理能力,专家评分者使用0至5分制对回答进行评分。所有模型在基于知识的问题上的表现均优于推理任务,凸显了情境诊断综合方面仍存在的局限性。总体而言,DeepSeek的表现优于其他模型,在所有评估维度上均取得了显著更高的分数(<0.05),尤其是在医学推理任务方面。虽然这些发现支持了使用大语言模型进行医学培训和决策支持的可行性,但该研究强调需要改进可解释性、优化提示并进行严格的基准测试,以确保临床可靠性。这种结构化的比较方法有助于为将大语言模型整合到诊断工作流程中建立标准化评估框架的持续努力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0691/12248924/1a2871e6170a/diagnostics-15-01657-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0691/12248924/524af6c60682/diagnostics-15-01657-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0691/12248924/d403a821c559/diagnostics-15-01657-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0691/12248924/1a2871e6170a/diagnostics-15-01657-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0691/12248924/524af6c60682/diagnostics-15-01657-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0691/12248924/d403a821c559/diagnostics-15-01657-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0691/12248924/1a2871e6170a/diagnostics-15-01657-g003.jpg

相似文献

1
Assessing the Accuracy of Diagnostic Capabilities of Large Language Models.评估大语言模型诊断能力的准确性。
Diagnostics (Basel). 2025 Jun 29;15(13):1657. doi: 10.3390/diagnostics15131657.
2
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
3
Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.错误的恶臭还是潜力的光辉:言语病理学中(不)负责任地使用ChatGPT的挑战。
Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088.
4
A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。
BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.
5
Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models.使用QUADAS-2对大型语言模型进行诊断准确性研究的偏倚风险评估
Diagnostics (Basel). 2025 Jun 6;15(12):1451. doi: 10.3390/diagnostics15121451.
6
Assessment of Recommendations Provided to Athletes Regarding Sleep Education by GPT-4o and Google Gemini: Comparative Evaluation Study.GPT-4o和谷歌Gemini向运动员提供的关于睡眠教育的建议评估:比较评估研究
JMIR Form Res. 2025 Jul 8;9:e71358. doi: 10.2196/71358.
7
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
8
A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。
Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.
9
Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.ChatGPT 及其他会话型大型语言模型在医疗保健中的应用及关注:系统评价。
J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.
10
Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.大语言模型Chatgpt-4o、OpenAI O1和OpenAI O3 mini在肺炎临床治疗中的性能分析:一项对比研究。
Clin Exp Med. 2025 Jun 20;25(1):213. doi: 10.1007/s10238-025-01743-7.

本文引用的文献

1
Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis.比较临床专业人员和大语言模型的诊断准确性:系统评价与荟萃分析
JMIR Med Inform. 2025 Apr 25;13:e64963. doi: 10.2196/64963.
2
Extracting Pulmonary Embolism Diagnoses From Radiology Impressions Using GPT-4o: Large Language Model Evaluation Study.使用GPT-4o从放射学诊断印象中提取肺栓塞诊断:大语言模型评估研究
JMIR Med Inform. 2025 Apr 9;13:e67706. doi: 10.2196/67706.
3
Large Language Models for Pediatric Differential Diagnoses in Rural Health Care: Multicenter Retrospective Cohort Study Comparing GPT-3 With Pediatrician Performance.
用于农村医疗保健中儿科鉴别诊断的大语言模型:比较GPT-3与儿科医生表现的多中心回顾性队列研究
JMIRx Med. 2025 Mar 19;6:e65263. doi: 10.2196/65263.
4
Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models.人工智能辅助自我诊断中的医学错误信息:一种用于分析大语言模型的方法(EvalPrompt)的开发
JMIR Form Res. 2025 Mar 10;9:e66207. doi: 10.2196/66207.
5
Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study.将医学知识图谱融入大语言模型进行诊断预测:设计与应用研究
JMIR AI. 2025 Feb 24;4:e58670. doi: 10.2196/58670.
6
Developing Effective Frameworks for Large Language Model-Based Medical Chatbots: Insights From Radiotherapy Education With ChatGPT.为基于大语言模型的医学聊天机器人开发有效框架:放疗教育中使用ChatGPT的见解
JMIR Cancer. 2025 Feb 18;11:e66633. doi: 10.2196/66633.
7
Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.评估大型语言模型的性能以支持原发性免疫疾病患者的诊断和管理。
J Allergy Clin Immunol. 2025 Feb 14. doi: 10.1016/j.jaci.2025.02.004.
8
Towards evaluating and building versatile large language models for medicine.迈向评估和构建通用的医学大语言模型。
NPJ Digit Med. 2025 Jan 27;8(1):58. doi: 10.1038/s41746-024-01390-4.
9
Evaluation of the ability of large language models to self-diagnose oral diseases.评估大语言模型自我诊断口腔疾病的能力。
iScience. 2024 Nov 29;27(12):111495. doi: 10.1016/j.isci.2024.111495. eCollection 2024 Dec 20.
10
Application of Large Language Models in Medical Training Evaluation-Using ChatGPT as a Standardized Patient: Multimetric Assessment.大语言模型在医学培训评估中的应用——以ChatGPT作为标准化病人:多指标评估
J Med Internet Res. 2025 Jan 1;27:e59435. doi: 10.2196/59435.