• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

作者信息

Roos Jonas, Martin Ron, Kaczmarczyk Robert

机构信息

Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Venusberg-Campus 1, 53127, Bonn, Germany, 49 228-287-14170.

Department of Plastic and Hand Surgery, Burn Center, BG Clinic Bergmannstrost, Halle (Saale), Germany.

出版信息

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

DOI:10.2196/57592
PMID:39714199
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11683658/
Abstract

BACKGROUND

The rapid development of large language models (LLMs) such as OpenAI's ChatGPT has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing examination assistance. Recently, LLMs have been enhanced with image recognition capabilities.

OBJECTIVE

This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations.

METHODS

This study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. Student performance data were obtained from AMBOSS, including metrics such as the "student passed mean" and "majority vote." Statistical analysis was conducted using Python (Python Software Foundation), with key libraries for data manipulation and visualization.

RESULTS

GPT-4 1106 Vision Preview (OpenAI) outperformed Bard Gemini Pro (Google), correctly answering 56.9% (609/1070) of questions compared to Bard's 44.6% (477/1070), a statistically significant difference (χ2₁=32.1, P<.001). However, GPT-4 1106 left 16.1% (172/1070) of questions unanswered, significantly higher than Bard's 4.1% (44/1070; χ2₁=83.1, P<.001). When considering only answered questions, GPT-4 1106's accuracy increased to 67.8% (609/898), surpassing both Bard (477/1026, 46.5%; χ2₁=87.7, P<.001) and the student passed mean of 63% (674/1070, SE 1.48%; χ2₁=4.8, P=.03). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (282/465, 60.65% vs 327/605, 54.1%; χ2₁=4.4, P=.04) and Bard Gemini Pro exhibiting a similar trend (255/465, 54.8% vs 222/605, 36.7%; χ2₁=34.3, P<.001). The student majority vote achieved an overall accuracy of 94.5% (1011/1070), significantly outperforming both artificial intelligence models (GPT-4 1106: χ2₁=408.5, P<.001; Bard Gemini Pro: χ2₁=626.6, P<.001).

CONCLUSIONS

Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fde5/11683658/918128e4a8da/formative-v8-e57592-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fde5/11683658/74d2792c5c0d/formative-v8-e57592-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fde5/11683658/d29483d6aee3/formative-v8-e57592-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fde5/11683658/918128e4a8da/formative-v8-e57592-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fde5/11683658/74d2792c5c0d/formative-v8-e57592-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fde5/11683658/d29483d6aee3/formative-v8-e57592-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fde5/11683658/918128e4a8da/formative-v8-e57592-g003.jpg
摘要

背景

诸如OpenAI的ChatGPT等大语言模型(LLMs)的快速发展对医学研究和教育产生了重大影响。这些模型在从放射影像解读到医学执照考试辅助等领域都显示出了潜力。最近,大语言模型已增强了图像识别能力。

目的

本研究旨在通过评估大语言模型在回答医学执照考试中基于图像的问题时的准确性和实用性,来批判性地检验其在医学诊断和培训中的有效性。

方法

本研究分析了来自AMBOSS学习平台的1070道基于图像的多项选择题,其中605道为英文,465道为德文。用两种语言定制的提示引导模型解读医学图像并提供最可能的诊断。学生成绩数据从AMBOSS获得,包括“学生通过平均分”和“多数投票”等指标。使用Python(Python软件基金会)进行统计分析,使用关键库进行数据处理和可视化。

结果

GPT-4 1106 Vision Preview(OpenAI)的表现优于Bard Gemini Pro(谷歌),正确回答了56.9%(609/1070)的问题,而Bard的正确率为44.6%(477/1070),差异具有统计学意义(χ2₁=32.1,P<.001)。然而,GPT-4 1106有16.1%(172/1070)的问题未作答,显著高于Bard的4.1%(44/1070;χ2₁=83.1,P<.001)。仅考虑已作答的问题时,GPT-4 1106的准确率提高到67.8%(609/898),超过了Bard(477/1026,46.5%;χ2₁=87.7,P<.001)以及学生通过平均分63%(674/1070,标准误1.48%;χ2₁=4.8,P=.03)。特定语言分析表明,两个模型在德语问题上的表现均优于英语问题,GPT-4 1106在德语问题上的准确率更高(282/465,60.65%对327/605,54.1%;χ2₁=4.4,P=.04),Bard Gemini Pro也呈现类似趋势(255/465,54.8%对222/605,36.7%;χ2₁=34.3,P<.001)。学生多数投票的总体准确率为94.5%(1011/1070),显著优于两个人工智能模型(GPT-4 1106:χ2₁=408.5,P<.001;Bard Gemini Pro:χ2₁=626.6,P<.001)。

结论

我们的研究表明,GPT-4 1106 Vision Preview和Bard Gemini Pro在医学视觉问答任务中有潜力,并可为学生提供支持。然而,它们的表现因使用的语言而异,更倾向于德语。它们在处理非英语内容时也有局限性。准确率,特别是与学生的回答相比,凸显了这些模型在医学教育中的潜力,但在不同语言背景下进一步优化并理解其局限性仍然至关重要。

相似文献

1
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
2
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.
3
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
4
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
5
Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用:性能评估研究
JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.
6
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
7
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
8
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
9
Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment.谷歌 Gemini 和巴德人工智能聊天机器人在眼科知识评估中的表现。
Eye (Lond). 2024 Sep;38(13):2530-2535. doi: 10.1038/s41433-024-03067-4. Epub 2024 Apr 13.
10
Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study.医学生、ChatGPT-3.5和ChatGPT-4.0在回答巴西国家医学考试问题中的表现比较:横断面问卷调查研究
JMIR AI. 2025 May 8;4:e66552. doi: 10.2196/66552.

引用本文的文献

1
Evaluation of deepseek, gemini, ChatGPT-4o, and perplexity in responding to salivary gland cancer.评估DeepSeek、Gemini、ChatGPT-4o和Perplexity对涎腺癌的回答。
BMC Oral Health. 2025 Aug 23;25(1):1358. doi: 10.1186/s12903-025-06726-4.
2
Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models.提高人类表型本体识别的准确性:多模态大语言模型的比较评估
J Med Internet Res. 2025 Jun 2;27:e73233. doi: 10.2196/73233.
3
[Potential applications of large language models in trauma surgery : Opportunities, risks and perspectives].

本文引用的文献

1
Image Recognition Performance of GPT-4V(ision) and GPT-4o in Ophthalmology: Use of Images in Clinical Questions.GPT-4V(ision)和GPT-4o在眼科的图像识别性能:临床问题中图像的应用
Clin Ophthalmol. 2025 May 8;19:1557-1564. doi: 10.2147/OPTH.S494480. eCollection 2025.
2
Early automated detection system for skin cancer diagnosis using artificial intelligent techniques.基于人工智能技术的皮肤癌早期自动诊断系统。
Sci Rep. 2024 Apr 28;14(1):9749. doi: 10.1038/s41598-024-59783-0.
3
Can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study.
[大语言模型在创伤外科中的潜在应用:机遇、风险与展望]
Unfallchirurgie (Heidelb). 2025 May 12. doi: 10.1007/s00113-025-01581-y.
ChatGPT视觉能否诊断黑色素瘤?一项探索性诊断准确性研究。
J Am Acad Dermatol. 2024 May;90(5):1057-1059. doi: 10.1016/j.jaad.2023.12.062. Epub 2024 Jan 19.
4
The role of large language models in medical image processing: a narrative review.大语言模型在医学图像处理中的作用:一项叙述性综述。
Quant Imaging Med Surg. 2024 Jan 3;14(1):1108-1121. doi: 10.21037/qims-23-892. Epub 2023 Nov 23.
5
The Use of ChatGPT for Education Modules on Integrated Pharmacotherapy of Infectious Disease: Educators' Perspectives.利用 ChatGPT 制作传染病综合药物治疗教育模块:教育者的观点。
JMIR Med Educ. 2024 Jan 12;10:e47339. doi: 10.2196/47339.
6
Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access exam to specialized medical training.利用 ChatGPT 和 GPT-4 评估西班牙专科医学培训准入考试的风湿病学问题。
Sci Rep. 2023 Dec 13;13(1):22129. doi: 10.1038/s41598-023-49483-6.
7
Automatic Skin Cancer Detection Using Clinical Images: A Comprehensive Review.利用临床图像进行皮肤癌自动检测:全面综述。
Life (Basel). 2023 Oct 26;13(11):2123. doi: 10.3390/life13112123.
8
Analysis of Artificial Intelligence-Based Approaches Applied to Non-Invasive Imaging for Early Detection of Melanoma: A Systematic Review.基于人工智能的方法在黑色素瘤早期检测的非侵入性成像中的应用分析:一项系统综述。
Cancers (Basel). 2023 Sep 23;15(19):4694. doi: 10.3390/cancers15194694.
9
Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany.医学教育中的人工智能:德国ChatGPT、必应与医学生的比较分析
JMIR Med Educ. 2023 Sep 4;9:e46482. doi: 10.2196/46482.
10
Using ChatGPT and Google Bard to improve the readability of written patient information: a proof of concept.利用 ChatGPT 和 Google Bard 提高书面患者信息的可读性:概念验证。
Eur J Cardiovasc Nurs. 2024 Mar 12;23(2):122-126. doi: 10.1093/eurjcn/zvad087.