Suppr超能文献

医学领域多模态GPT-4视觉专家级准确性背后的隐藏缺陷。

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.

作者信息

Jin Qiao, Chen Fangyuan, Zhou Yiliang, Xu Ziyang, Cheung Justin M, Chen Robert, Summers Ronald M, Rousseau Justin F, Ni Peiyun, Landsman Marc J, Baxter Sally L, Al'Aref Subhi J, Li Yijia, Chen Alexander, Brejt Josef A, Chiang Michael F, Peng Yifan, Lu Zhiyong

机构信息

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

University of Pittsburgh, Pittsburgh, PA, USA.

出版信息

NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.

Abstract

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges-an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

摘要

最近的研究表明,视觉生成预训练变换器4(GPT-4V)在医学挑战任务中表现优于人类医生。然而,这些评估主要仅关注多项选择题的准确性。我们的研究通过对GPT-4V在解决《新英格兰医学杂志》(NEJM)图像挑战(一项旨在测试医学专业人员知识和诊断能力的影像学测验)时的图像理解原理、医学知识回忆以及逐步多模态推理进行全面分析,扩展了当前的研究范围。评估结果证实,在多项选择准确性方面,GPT-4V与人类医生表现相当(81.6%对77.8%)。在医生回答错误的情况下,GPT-4V也表现出色,准确率超过78%。然而,我们发现GPT-4V在做出正确最终选择的情况下(35.5%),其推理依据经常存在缺陷,在图像理解方面最为突出(27.2%)。尽管GPT-4V在多项选择题中准确率很高,但我们的研究结果强调,在将这种多模态人工智能模型整合到临床工作流程之前,有必要对其推理依据进行进一步深入评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a303/11266508/9555b10ed3e3/41746_2024_1185_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验