• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

医学领域多模态GPT-4视觉专家级准确性背后的隐藏缺陷。

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.

作者信息

Jin Qiao, Chen Fangyuan, Zhou Yiliang, Xu Ziyang, Cheung Justin M, Chen Robert, Summers Ronald M, Rousseau Justin F, Ni Peiyun, Landsman Marc J, Baxter Sally L, Al'Aref Subhi J, Li Yijia, Chen Alexander, Brejt Josef A, Chiang Michael F, Peng Yifan, Lu Zhiyong

机构信息

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

University of Pittsburgh, Pittsburgh, PA, USA.

出版信息

NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.

DOI:10.1038/s41746-024-01185-7
PMID:39043988
原文链接:
https://pmc.ncbi.nlm.nih.gov/articles/PMC11266508/
Abstract

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges-an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

摘要

最近的研究表明,视觉生成预训练变换器4(GPT-4V)在医学挑战任务中表现优于人类医生。然而,这些评估主要仅关注多项选择题的准确性。我们的研究通过对GPT-4V在解决《新英格兰医学杂志》(NEJM)图像挑战(一项旨在测试医学专业人员知识和诊断能力的影像学测验)时的图像理解原理、医学知识回忆以及逐步多模态推理进行全面分析,扩展了当前的研究范围。评估结果证实,在多项选择准确性方面,GPT-4V与人类医生表现相当(81.6%对77.8%)。在医生回答错误的情况下,GPT-4V也表现出色,准确率超过78%。然而,我们发现GPT-4V在做出正确最终选择的情况下(35.5%),其推理依据经常存在缺陷,在图像理解方面最为突出(27.2%)。尽管GPT-4V在多项选择题中准确率很高,但我们的研究结果强调,在将这种多模态人工智能模型整合到临床工作流程之前,有必要对其推理依据进行进一步深入评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a303/11266508/24ab3bd4ed47/41746_2024_1185_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a303/11266508/9555b10ed3e3/41746_2024_1185_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a303/11266508/24ab3bd4ed47/41746_2024_1185_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a303/11266508/9555b10ed3e3/41746_2024_1185_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a303/11266508/24ab3bd4ed47/41746_2024_1185_Fig2_HTML.jpg

相似文献

1
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.医学领域多模态GPT-4视觉专家级准确性背后的隐藏缺陷。
NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.
2
Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine.医学领域中多模态GPT-4视觉专家级准确性背后的隐藏缺陷。
ArXiv. 2024 Aug 31:arXiv:2401.08396v4.
3
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.
4
Glaucoma Detection and Feature Identification via GPT-4V Fundus Image Analysis.通过GPT-4V眼底图像分析进行青光眼检测与特征识别
Ophthalmol Sci. 2024 Nov 29;5(2):100667. doi: 10.1016/j.xops.2024.100667. eCollection 2025 Mar-Apr.
5
Integrating Text and Image Analysis: Exploring GPT-4V's Capabilities in Advanced Radiological Applications Across Subspecialties.文本与图像分析整合:探索 GPT-4V 在跨亚专科先进放射学应用中的能力。
J Med Internet Res. 2024 May 1;26:e54948. doi: 10.2196/54948.
6
Assessing GPT-4 multimodal performance in radiological image analysis.评估GPT-4在放射图像分析中的多模态性能。
Eur Radiol. 2025 Apr;35(4):1959-1965. doi: 10.1007/s00330-024-11035-5. Epub 2024 Aug 30.
7
Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool.探索ChatGPT在骨科环境中的表现及其作为教育工具的潜在用途。
JB JS Open Access. 2024 Nov 26;9(4). doi: 10.2106/JBJS.OA.24.00081. eCollection 2024 Oct-Dec.
8
Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V(视觉)在日本国家医师资格考试中的能力:评估研究。
JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.
9
Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.评估ChatGPT-4的诊断准确性:视觉数据整合的影响。
JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.
10
Evaluating the Diagnostic and Treatment Capabilities of GPT-4 Vision in Dermatology: A Pilot Study.评估GPT-4视觉模型在皮肤科的诊断和治疗能力:一项初步研究。
J Cutan Med Surg. 2025 May 6:12034754251336238. doi: 10.1177/12034754251336238.

引用本文的文献

1
RiskPath: Explainable deep learning for multistep biomedical prediction in longitudinal data.RiskPath:用于纵向数据中多步生物医学预测的可解释深度学习
Patterns (N Y). 2025 Apr 28;6(8):101240. doi: 10.1016/j.patter.2025.101240. eCollection 2025 Aug 8.
2
Adoption and perception of LLM-based chatbots in health care: an exploratory cross-sectional survey of individuals with rheumatic diseases.基于大语言模型的聊天机器人在医疗保健中的应用与认知:对风湿病患者的探索性横断面调查
Rheumatol Adv Pract. 2025 Jul 12;9(3):rkaf083. doi: 10.1093/rap/rkaf083. eCollection 2025.
3
What will the future role for large language models (LLMs) be in managing patients with glaucoma?

本文引用的文献

1
Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗?
Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.
2
PubMed and beyond: biomedical literature search in the age of artificial intelligence.PubMed 及其以外:人工智能时代的生物医学文献检索。
EBioMedicine. 2024 Feb;100:104988. doi: 10.1016/j.ebiom.2024.104988. Epub 2024 Feb 1.
3
Opportunities and challenges for ChatGPT and large language models in biomedicine and health.ChatGPT 和大型语言模型在生物医学和健康领域的机遇与挑战。
大型语言模型(LLMs)在青光眼患者管理中未来将扮演什么角色?
Expert Rev Ophthalmol. 2025 Jun;20(3):123-126. doi: 10.1080/17469899.2025.2487532. Epub 2025 Apr 1.
4
One Year On: Assessing Progress of Multimodal Large Language Model Performance on RSNA 2024 Case of the Day Questions.一年之后:评估多模态大语言模型在RSNA 2024每日病例问题上的性能进展。
Radiology. 2025 Aug;316(2):e250617. doi: 10.1148/radiol.250617.
5
SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data.合成电子健康记录-驱逐:利用大语言模型增强的合成电子健康记录数据改进驱逐相关健康社会决定因素的检测
medRxiv. 2025 Jul 14:2025.07.10.25331237. doi: 10.1101/2025.07.10.25331237.
6
Exploring GPT-4o's multimodal reasoning capabilities with panoramic radiograph: the role of prompt engineering.利用全景X线片探索GPT-4o的多模态推理能力:提示工程的作用。
Clin Oral Investig. 2025 Aug 12;29(9):405. doi: 10.1007/s00784-025-06498-9.
7
A large language model digital patient system enhances ophthalmology history taking skills.一个大语言模型数字患者系统提升了眼科病史采集技能。
NPJ Digit Med. 2025 Aug 4;8(1):502. doi: 10.1038/s41746-025-01841-6.
8
A perspective for adapting generalist AI to specialized medical AI applications and their challenges.将通用人工智能应用于专业医学人工智能应用的前景及其挑战。
NPJ Digit Med. 2025 Jul 11;8(1):429. doi: 10.1038/s41746-025-01789-7.
9
Benchmarking vision-language models for diagnostics in emergency and critical care settings.用于急诊和重症监护环境诊断的视觉语言模型基准测试。
NPJ Digit Med. 2025 Jul 10;8(1):423. doi: 10.1038/s41746-025-01837-2.
10
Small language models learn enhanced reasoning skills from medical textbooks.小型语言模型从医学教科书中学习增强的推理技能。
NPJ Digit Med. 2025 May 2;8(1):240. doi: 10.1038/s41746-025-01653-8.
Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad493.
4
As artificial intelligence goes multimodal, medical applications multiply.随着人工智能向多模态发展,其在医疗领域的应用也日益增多。
Science. 2023 Sep 15;381(6663):adk6139. doi: 10.1126/science.adk6139.
5
Evaluating large language models on medical evidence summarization.基于医学证据总结对大语言模型进行评估。
NPJ Digit Med. 2023 Aug 24;6(1):158. doi: 10.1038/s41746-023-00896-7.
6
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
7
Retrieve, Summarize, and Verify: How Will ChatGPT Affect Information Seeking from the Medical Literature?检索、总结与验证:ChatGPT将如何影响从医学文献中获取信息?
J Am Soc Nephrol. 2023 Aug 1;34(8):1302-1304. doi: 10.1681/ASN.0000000000000166. Epub 2023 May 31.
8
Encapsulating Peritoneal Sclerosis.包裹性腹膜硬化症
N Engl J Med. 2023 Mar 2;388(9):833. doi: 10.1056/NEJMicm2206513.
9
Multimodal biomedical AI.多模态生物医学人工智能。
Nat Med. 2022 Sep;28(9):1773-1784. doi: 10.1038/s41591-022-01981-2. Epub 2022 Sep 15.