生成式人工智能与医生诊断性能比较的系统评价与荟萃分析

A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.

作者信息

Takita Hirotaka, Kabata Daijiro, Walston Shannon L, Tatekawa Hiroyuki, Saito Kenichi, Tsujimoto Yasushi, Miki Yukio, Ueda Daiju

机构信息

Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.

Center for Mathematical and Data Science, Kobe University, Kobe, Japan.

出版信息

NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z.

DOI:10.1038/s41746-025-01543-z

PMID:40121370

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11929846/

Abstract

While generative artificial intelligence (AI) has shown potential in medical diagnostics, comprehensive evaluation of its diagnostic performance and comparison with physicians has not been extensively explored. We conducted a systematic review and meta-analysis of studies validating generative AI models for diagnostic tasks published between June 2018 and June 2024. Analysis of 83 studies revealed an overall diagnostic accuracy of 52.1%. No significant performance difference was found between AI models and physicians overall (p = 0.10) or non-expert physicians (p = 0.93). However, AI models performed significantly worse than expert physicians (p = 0.007). Several models demonstrated slightly higher performance compared to non-experts, although the differences were not significant. Generative AI demonstrates promising diagnostic capabilities with accuracy varying by model. Although it has not yet achieved expert-level reliability, these findings suggest potential for enhancing healthcare delivery and medical education when implemented with appropriate understanding of its limitations.

摘要

虽然生成式人工智能（AI）在医学诊断中已显示出潜力，但对其诊断性能的全面评估以及与医生的比较尚未得到广泛探索。我们对2018年6月至2024年6月期间发表的验证生成式AI模型用于诊断任务的研究进行了系统综述和荟萃分析。对83项研究的分析显示总体诊断准确率为52.1%。总体而言，AI模型与医生（p = 0.10）或非专家医生（p = 0.93）之间未发现显著的性能差异。然而，AI模型的表现明显不如专家医生（p = 0.007）。与非专家相比，一些模型表现略好，尽管差异不显著。生成式AI显示出有前景的诊断能力，准确率因模型而异。虽然它尚未达到专家级的可靠性，但这些发现表明，在适当理解其局限性的情况下实施时，生成式AI有增强医疗服务和医学教育的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc7e/11929846/347499b03276/41746_2025_1543_Fig1_HTML.jpg

相似文献

A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.生成式人工智能与医生诊断性能比较的系统评价与荟萃分析

NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗：一项网状Meta分析。

Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

Are Artificial Intelligence Models Reliable for Clinical Application in Pediatric Fracture Detection on Radiographs? A Systematic Review and Meta-analysis.人工智能模型在儿科骨折X线片检测中的临床应用是否可靠？一项系统评价和荟萃分析。

Clin Orthop Relat Res. 2025 Aug 20. doi: 10.1097/CORR.0000000000003660.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病：网络荟萃分析。

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗：一项网状荟萃分析。

Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.

Single-incision sling operations for urinary incontinence in women.女性尿失禁的单切口吊带手术

Cochrane Database Syst Rev. 2017 Jul 26;7(7):CD008709. doi: 10.1002/14651858.CD008709.pub3.

[Volume and health outcomes: evidence from systematic reviews and from evaluation of Italian hospital data].[容量与健康结果：来自系统评价和意大利医院数据评估的证据]

Epidemiol Prev. 2013 Mar-Jun;37(2-3 Suppl 2):1-100.

Single-incision sling operations for urinary incontinence in women.女性尿失禁的单切口吊带手术

Cochrane Database Syst Rev. 2014 Jun 1(6):CD008709. doi: 10.1002/14651858.CD008709.pub2.

Utility of Generative Artificial Intelligence for Japanese Medical Interview Training: Randomized Crossover Pilot Study.生成式人工智能在日本医学面试培训中的效用：随机交叉试点研究。

JMIR Med Educ. 2025 Aug 1;11:e77332. doi: 10.2196/77332.

引用本文的文献

Beyond the Growth: A Registry-Based Analysis of Global Imbalances in Artificial Intelligence Clinical Trials.增长之外：基于注册库的人工智能临床试验全球失衡分析

Healthcare (Basel). 2025 Aug 16;13(16):2018. doi: 10.3390/healthcare13162018.

Quo Vadis, AI-Empowered Doctor?人工智能赋能的医生，路在何方？

JMIR Med Educ. 2025 Aug 15;11:e70079. doi: 10.2196/70079.

Artificial intelligence, health empowerment, and the general practitioner scheme.人工智能、健康赋权与全科医生计划。

Digit Health. 2025 Jul 29;11:20552076251365006. doi: 10.1177/20552076251365006. eCollection 2025 Jan-Dec.

[Implementation of artificial intelligence (AI) in healthcare: historical development, current technologies and challenges].[人工智能在医疗保健中的应用：历史发展、当前技术与挑战]

Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2025 Jun 25. doi: 10.1007/s00103-025-04086-6.

Human-AI collectives most accurately diagnose clinical vignettes.人类与人工智能的协作能最准确地诊断临床案例。

Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2426153122. doi: 10.1073/pnas.2426153122. Epub 2025 Jun 13.

Evaluating User Interactions and Adoption Patterns of Generative AI in Health Care Occupations Using Claude: Cross-Sectional Study.使用Claude评估医疗保健职业中生成式人工智能的用户交互和采用模式：横断面研究

J Med Internet Res. 2025 May 30;27:e73918. doi: 10.2196/73918.

Hearts, Data, and Artificial Intelligence Wizardry: From Imitation to Innovation in Cardiovascular Care.心脏、数据与人工智能魔法：心血管护理从模仿到创新

Biomedicines. 2025 Apr 23;13(5):1019. doi: 10.3390/biomedicines13051019.

Performance of the Large Language Models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence.大语言模型在非洲风湿病学中的表现：ChatGPT-4、Gemini、Copilot和Claude人工智能的诊断测试准确性研究

BMC Rheumatol. 2025 May 16;9(1):54. doi: 10.1186/s41927-025-00512-z.

Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.大型语言模型在构建头部CT放射学报告中的比较性能：日本的多机构验证研究

Jpn J Radiol. 2025 May 14. doi: 10.1007/s11604-025-01799-1.

A Practical Guide to the Utilization of ChatGPT in the Emergency Department: A Systematic Review of Current Applications, Future Directions, and Limitations.急诊科使用ChatGPT实用指南：当前应用、未来方向及局限性的系统评价

Cureus. 2025 Apr 6;17(4):e81802. doi: 10.7759/cureus.81802. eCollection 2025 Apr.

本文引用的文献

ChatGPT Assisting Diagnosis of Neuro-Ophthalmology Diseases Based on Case Reports.基于病例报告的ChatGPT辅助诊断神经眼科疾病

J Neuroophthalmol. 2024 Oct 10;45(3):301-306. doi: 10.1097/WNO.0000000000002274.

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases.多模态提示元素对GPT-4V在具有挑战性的脑部MRI病例诊断性能的影响。

Radiology. 2025 Jan;314(1):e240689. doi: 10.1148/radiol.240689.

APPLICATIONS OF MULTIMODAL GENERATIVE ARTIFICIAL INTELLIGENCE IN A REAL-WORLD RETINA CLINIC SETTING.多模态生成式人工智能在真实世界的视网膜临床环境中的应用。

Retina. 2024 Oct 1;44(10):1732-1740. doi: 10.1097/IAE.0000000000004204.

Diagnostic performance of generative artificial intelligences for a series of complex case reports.生成式人工智能对一系列复杂病例报告的诊断性能

Digit Health. 2024 Jul 21;10:20552076241265215. doi: 10.1177/20552076241265215. eCollection 2024 Jan-Dec.

Assessing GPT-4 multimodal performance in radiological image analysis.评估GPT-4在放射图像分析中的多模态性能。

Eur Radiol. 2025 Apr;35(4):1959-1965. doi: 10.1007/s00330-024-11035-5. Epub 2024 Aug 30.

Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors.基于GPT-4的ChatGPT与放射科医生在使用脑肿瘤真实世界放射学报告方面的诊断性能比较分析。

Eur Radiol. 2025 Apr;35(4):1938-1947. doi: 10.1007/s00330-024-11032-8. Epub 2024 Aug 28.

Interpretation of Clinical Retinal Images Using an Artificial Intelligence Chatbot.使用人工智能聊天机器人解读临床视网膜图像。

Ophthalmol Sci. 2024 May 23;4(6):100556. doi: 10.1016/j.xops.2024.100556. eCollection 2024 Nov-Dec.

Diagnostic accuracy of large language models in psychiatry.精神科大语言模型的诊断准确性。

Asian J Psychiatr. 2024 Oct;100:104168. doi: 10.1016/j.ajp.2024.104168. Epub 2024 Jul 25.

Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis.用于黑色素瘤诊断的皮肤镜图像分析中Claude 3 Opus和配备GPT-4的ChatGPT：比较性能分析

JMIR Med Inform. 2024 Aug 6;12:e59273. doi: 10.2196/59273.

Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds.通过提示工程和置信阈值优化GPT-4 Turbo在神经放射学中的诊断准确性。

Diagnostics (Basel). 2024 Jul 17;14(14):1541. doi: 10.3390/diagnostics14141541.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

生成式人工智能与医生诊断性能比较的系统评价与荟萃分析

A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献