Jin Qiao, Chen Fangyuan, Zhou Yiliang, Xu Ziyang, Cheung Justin M, Chen Robert, Summers Ronald M, Rousseau Justin F, Ni Peiyun, Landsman Marc J, Baxter Sally L, Al'Aref Subhi J, Li Yijia, Chen Alexander, Brejt Josef A, Chiang Michael F, Peng Yifan, Lu Zhiyong
National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
University of Pittsburgh, Pittsburgh, PA, USA.
NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges-an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.
最近的研究表明,视觉生成预训练变换器4(GPT-4V)在医学挑战任务中表现优于人类医生。然而,这些评估主要仅关注多项选择题的准确性。我们的研究通过对GPT-4V在解决《新英格兰医学杂志》(NEJM)图像挑战(一项旨在测试医学专业人员知识和诊断能力的影像学测验)时的图像理解原理、医学知识回忆以及逐步多模态推理进行全面分析,扩展了当前的研究范围。评估结果证实,在多项选择准确性方面,GPT-4V与人类医生表现相当(81.6%对77.8%)。在医生回答错误的情况下,GPT-4V也表现出色,准确率超过78%。然而,我们发现GPT-4V在做出正确最终选择的情况下(35.5%),其推理依据经常存在缺陷,在图像理解方面最为突出(27.2%)。尽管GPT-4V在多项选择题中准确率很高,但我们的研究结果强调,在将这种多模态人工智能模型整合到临床工作流程之前,有必要对其推理依据进行进一步深入评估。