• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用:性能评估研究

Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.

作者信息

Maruyama Hiroki, Toyama Yoshitaka, Takanami Kentaro, Takase Kei, Kamei Takashi

机构信息

Department of Surgery, Tohoku University Graduate School of Medicine, Sendai, Japan.

Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, Japan, Sendai, 980-8575, Japan, 81 227177312.

出版信息

JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.

DOI:10.2196/69313
PMID:40737609
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12310146/
Abstract

BACKGROUND

Artificial intelligence and large language models (LLMs)-particularly GPT-4 and GPT-4o-have demonstrated high correct-answer rates in medical examinations. GPT-4o has enhanced diagnostic capabilities, advanced image processing, and updated knowledge. Japanese surgeons face critical challenges, including a declining workforce, regional health care disparities, and work-hour-related challenges. Nonetheless, although LLMs could be beneficial in surgical education, no studies have yet assessed GPT-4o's surgical knowledge or its performance in the field of surgery.

OBJECTIVE

This study aims to evaluate the potential of GPT-4 and GPT-4o in surgical education by using them to take the Japan Surgical Board Examination (JSBE), which includes both textual questions and medical images-such as surgical and computed tomography scans-to comprehensively assess their surgical knowledge.

METHODS

We used 297 multiple-choice questions from the 2021-2023 JSBEs. The questions were in Japanese, and 104 of them included images. First, the GPT-4 and GPT-4o responses to only the textual questions were collected via OpenAI's application programming interface to evaluate their correct-answer rate. Subsequently, the correct-answer rate of their responses to questions that included images was assessed by inputting both text and images.

RESULTS

The overall correct-answer rates of GPT-4o and GPT-4 for the text-only questions were 78% (231/297) and 55% (163/297), respectively, with GPT-4o outperforming GPT-4 by 23% (P=<.01). By contrast, there was no significant improvement in the correct-answer rate for questions that included images compared with the results for the text-only questions.

CONCLUSIONS

GPT-4o outperformed GPT-4 on the JSBE. However, the results of the LLMs were lower than those of the examinees. Despite the capabilities of LLMs, image recognition remains a challenge for them, and their clinical application requires caution owing to the potential inaccuracy of their results.

摘要

背景

人工智能和大语言模型(LLMs)——尤其是GPT - 4和GPT - 4o——在医学考试中已展现出较高的正确答案率。GPT - 4o具有增强的诊断能力、先进的图像处理技术以及更新的知识。日本外科医生面临着严峻挑战,包括劳动力减少、地区医疗保健差异以及与工作时长相关的问题。尽管如此,虽然大语言模型在外科教育中可能有益,但尚无研究评估GPT - 4o的外科知识或其在外科领域的表现。

目的

本研究旨在通过让GPT - 4和GPT - 4o参加日本外科医师资格考试(JSBE)来评估它们在外科教育中的潜力,该考试包括文本问题和医学图像(如手术和计算机断层扫描),以全面评估它们的外科知识。

方法

我们使用了2021 - 2023年JSBE中的297道多项选择题。这些问题为日语,其中104道包含图像。首先,通过OpenAI的应用程序编程接口收集GPT - 4和GPT - 4o仅对文本问题的回答,以评估其正确答案率。随后,通过输入文本和图像来评估它们对包含图像问题的回答的正确答案率。

结果

对于仅文本问题,GPT - 4o和GPT - 4的总体正确答案率分别为78%(231/297)和55%(163/297),GPT - 4o比GPT - 4表现好23%(P = <.01)。相比之下,与仅文本问题的结果相比,包含图像问题的正确答案率没有显著提高。

结论

在JSBE中,GPT - 4o的表现优于GPT - 4。然而,大语言模型的结果低于考生。尽管大语言模型有能力,但图像识别对它们来说仍然是一个挑战,并且由于其结果可能不准确,其临床应用需要谨慎。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ee8d/12310146/4b3b30c2b5ad/mededu-v11-e69313-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ee8d/12310146/4b3b30c2b5ad/mededu-v11-e69313-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ee8d/12310146/4b3b30c2b5ad/mededu-v11-e69313-g001.jpg

相似文献

1
Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用:性能评估研究
JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.
2
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
3
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
4
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
5
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.
6
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
7
Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.评估推理大型语言模型在日本放射学委员会考试问题上的表现。
Acad Radiol. 2025 May 17. doi: 10.1016/j.acra.2025.04.060.
8
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
9
Enhancing Magnetic Resonance Imaging (MRI) Report Comprehension in Spinal Trauma: Readability Analysis of AI-Generated Explanations for Thoracolumbar Fractures.提高脊柱创伤磁共振成像(MRI)报告的理解:胸腰椎骨折人工智能生成解释的可读性分析
JMIR AI. 2025 Jul 1;4:e69654. doi: 10.2196/69654.
10
Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework.使用标准化框架评估大型语言模型在将患者指导说明翻译成西班牙语方面的表现。
JAMA Pediatr. 2025 Jul 7. doi: 10.1001/jamapediatrics.2025.1729.

本文引用的文献

1
Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan.GPT-4在多种语言中的回答准确性:来自日本专家级诊断放射学考试的见解。
Jpn J Radiol. 2025 Feb;43(2):319-329. doi: 10.1007/s11604-024-01673-6. Epub 2024 Oct 28.
2
Performance of ChatGPT 4.0 on Japan's National Physical Therapist Examination: A Comprehensive Analysis of Text and Visual Question Handling.ChatGPT 4.0在日本国家物理治疗师考试中的表现:文本和视觉问题处理的综合分析
Cureus. 2024 Aug 20;16(8):e67347. doi: 10.7759/cureus.67347. eCollection 2024 Aug.
3
Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination.
与GPT-3.5、GPT-4和GPT-4o相比,定制生成式预训练变换器(Custom GPTs)在提升性能和证据方面如何?一项关于急诊医学专科考试的研究。
Healthcare (Basel). 2024 Aug 30;12(17):1726. doi: 10.3390/healthcare12171726.
4
Analysis of Responses of GPT-4 V to the Japanese National Clinical Engineer Licensing Examination.GPT-4V 对日本全国临床工程师执照考试的反应分析。
J Med Syst. 2024 Sep 11;48(1):83. doi: 10.1007/s10916-024-02103-w.
5
Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions.GPT-4 在基于文本和图像的放射科住院医师诊断考试中的表现。
Radiology. 2024 Sep;312(3):e240153. doi: 10.1148/radiol.240153.
6
Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。
Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.
7
OpenAI's GPT-4o in surgical oncology: Revolutionary advances in generative artificial intelligence.OpenAI的GPT-4o在外科肿瘤学中的应用:生成式人工智能的革命性进展。
Eur J Cancer. 2024 Jul;206:114132. doi: 10.1016/j.ejca.2024.114132. Epub 2024 May 26.
8
Performance of ChatGPT on American Board of Surgery In-Training Examination Preparation Questions.ChatGPT 在美外科学院住院医师考试备考问题上的表现。
J Surg Res. 2024 Jul;299:329-335. doi: 10.1016/j.jss.2024.04.060. Epub 2024 May 23.
9
The Performance of ChatGPT-4V in Interpreting Images and Tables in the Japanese Medical Licensing Exam.ChatGPT-4V在日本医师执照考试中对图像和表格的解读表现。
JMIR Med Educ. 2024 May 23;10:e54283. doi: 10.2196/54283.
10
Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination.GPT-3.5 和 GPT-4 在放射学 Board 式考试中的可靠性、可重复性、稳健性和置信度评估。
Radiology. 2024 May;311(2):e232715. doi: 10.1148/radiol.232715.