文献检索，用中文搜 PubMed

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

Jin Qiao, Chen Fangyuan, Zhou Yiliang, Xu Ziyang, Cheung Justin M, Chen Robert, Summers Ronald M, Rousseau Justin F, Ni Peiyun, Landsman Marc J, Baxter Sally L, Al'Aref Subhi J, Li Yijia, Chen Alexander, Brejt Josef A, Chiang Michael F, Peng Yifan, Lu Zhiyong

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

University of Pittsburgh, Pittsburgh, PA, USA.

NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

University of Pittsburgh, Pittsburgh, PA, USA.

NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges-an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

最近的研究表明，视觉生成预训练变换器4（GPT-4V）在医学挑战任务中表现优于人类医生。然而，这些评估主要仅关注多项选择题的准确性。我们的研究通过对GPT-4V在解决《新英格兰医学杂志》（NEJM）图像挑战（一项旨在测试医学专业人员知识和诊断能力的影像学测验）时的图像理解原理、医学知识回忆以及逐步多模态推理进行全面分析，扩展了当前的研究范围。评估结果证实，在多项选择准确性方面，GPT-4V与人类医生表现相当（81.6%对77.8%）。在医生回答错误的情况下，GPT-4V也表现出色，准确率超过78%。然而，我们发现GPT-4V在做出正确最终选择的情况下（35.5%），其推理依据经常存在缺陷，在图像理解方面最为突出（27.2%）。尽管GPT-4V在多项选择题中准确率很高，但我们的研究结果强调，在将这种多模态人工智能模型整合到临床工作流程之前，有必要对其推理依据进行进一步深入评估。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

医学领域多模态GPT-4视觉专家级准确性背后的隐藏缺陷。

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

医学领域多模态GPT-4视觉专家级准确性背后的隐藏缺陷。

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献