Roos Jonas, Martin Ron, Kaczmarczyk Robert
Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Venusberg-Campus 1, 53127, Bonn, Germany, 49 228-287-14170.
Department of Plastic and Hand Surgery, Burn Center, BG Clinic Bergmannstrost, Halle (Saale), Germany.
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
The rapid development of large language models (LLMs) such as OpenAI's ChatGPT has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing examination assistance. Recently, LLMs have been enhanced with image recognition capabilities.
This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations.
This study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. Student performance data were obtained from AMBOSS, including metrics such as the "student passed mean" and "majority vote." Statistical analysis was conducted using Python (Python Software Foundation), with key libraries for data manipulation and visualization.
GPT-4 1106 Vision Preview (OpenAI) outperformed Bard Gemini Pro (Google), correctly answering 56.9% (609/1070) of questions compared to Bard's 44.6% (477/1070), a statistically significant difference (χ2₁=32.1, P<.001). However, GPT-4 1106 left 16.1% (172/1070) of questions unanswered, significantly higher than Bard's 4.1% (44/1070; χ2₁=83.1, P<.001). When considering only answered questions, GPT-4 1106's accuracy increased to 67.8% (609/898), surpassing both Bard (477/1026, 46.5%; χ2₁=87.7, P<.001) and the student passed mean of 63% (674/1070, SE 1.48%; χ2₁=4.8, P=.03). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (282/465, 60.65% vs 327/605, 54.1%; χ2₁=4.4, P=.04) and Bard Gemini Pro exhibiting a similar trend (255/465, 54.8% vs 222/605, 36.7%; χ2₁=34.3, P<.001). The student majority vote achieved an overall accuracy of 94.5% (1011/1070), significantly outperforming both artificial intelligence models (GPT-4 1106: χ2₁=408.5, P<.001; Bard Gemini Pro: χ2₁=626.6, P<.001).
Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical.
诸如OpenAI的ChatGPT等大语言模型(LLMs)的快速发展对医学研究和教育产生了重大影响。这些模型在从放射影像解读到医学执照考试辅助等领域都显示出了潜力。最近,大语言模型已增强了图像识别能力。
本研究旨在通过评估大语言模型在回答医学执照考试中基于图像的问题时的准确性和实用性,来批判性地检验其在医学诊断和培训中的有效性。
本研究分析了来自AMBOSS学习平台的1070道基于图像的多项选择题,其中605道为英文,465道为德文。用两种语言定制的提示引导模型解读医学图像并提供最可能的诊断。学生成绩数据从AMBOSS获得,包括“学生通过平均分”和“多数投票”等指标。使用Python(Python软件基金会)进行统计分析,使用关键库进行数据处理和可视化。
GPT-4 1106 Vision Preview(OpenAI)的表现优于Bard Gemini Pro(谷歌),正确回答了56.9%(609/1070)的问题,而Bard的正确率为44.6%(477/1070),差异具有统计学意义(χ2₁=32.1,P<.001)。然而,GPT-4 1106有16.1%(172/1070)的问题未作答,显著高于Bard的4.1%(44/1070;χ2₁=83.1,P<.001)。仅考虑已作答的问题时,GPT-4 1106的准确率提高到67.8%(609/898),超过了Bard(477/1026,46.5%;χ2₁=87.7,P<.001)以及学生通过平均分63%(674/1070,标准误1.48%;χ2₁=4.8,P=.03)。特定语言分析表明,两个模型在德语问题上的表现均优于英语问题,GPT-4 1106在德语问题上的准确率更高(282/465,60.65%对327/605,54.1%;χ2₁=4.4,P=.04),Bard Gemini Pro也呈现类似趋势(255/465,54.8%对222/605,36.7%;χ2₁=34.3,P<.001)。学生多数投票的总体准确率为94.5%(1011/1070),显著优于两个人工智能模型(GPT-4 1106:χ2₁=408.5,P<.001;Bard Gemini Pro:χ2₁=626.6,P<.001)。
我们的研究表明,GPT-4 1106 Vision Preview和Bard Gemini Pro在医学视觉问答任务中有潜力,并可为学生提供支持。然而,它们的表现因使用的语言而异,更倾向于德语。它们在处理非英语内容时也有局限性。准确率,特别是与学生的回答相比,凸显了这些模型在医学教育中的潜力,但在不同语言背景下进一步优化并理解其局限性仍然至关重要。