• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型能通过美国外科医师委员会培训考试吗?Gemini、Copilot和ChatGPT的比较评估。

Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.

作者信息

Sanli Ahmet Necati, Tekcan Sanli Deniz Esin, Karabulut Ali

机构信息

Department of General Surgery, Abdulkadir Yuksel State Hospital, Gaziantep, Turkey.

Department of Radiology, School of Medicine, Gazianep University, Gaziantep, Turkey.

出版信息

Am Surg. 2025 May 12:31348251341956. doi: 10.1177/00031348251341956.

DOI:10.1177/00031348251341956
PMID:40353502
Abstract

ObjectiveThis study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).MethodsMultiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.ResultsThe correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs ( < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower ( = 0.005 and = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower ( < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference ( = 0.236).ConclusionIn the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.

摘要

目的

本研究旨在评估大语言模型(LLMs)回答美国外科委员会住院医师培训考试(ABSITE)问题的表现。

方法

将多项选择题形式的ABSITE测验作为提示输入最流行的大语言模型。本研究使用了ChatGPT-4(OpenAI)、Copilot(微软)和Gemini(谷歌)。该研究包含2017年至2022年的170道问题,这些问题被分为四个子组:定义、生物化学/药理学、病例场景以及治疗与外科手术。所有问题于2024年10月1日至2024年10月5日在大语言模型中进行查询。评估大语言模型的正确答案率。

结果

所有问题的正确回答率分别为

ChatGPT为79.4%,Copilot为77.6%,Gemini为52.9%,Gemini显著低于其他两个大语言模型(<0.001)。在定义类别中,正确回答率分别为:ChatGPT为93.5%,Copilot为90.3%,Gemini为64.5%,Gemini显著较低(分别为=0.005和=0.015)。在生物化学/药理学问题类别中,三组的正确回答率相同(83.3%)。在病例场景类别中,正确回答率分别为:ChatGPT为76.3%,Copilot为72.8%,Gemini为46.5%,Gemini显著较低(<0.001)。在治疗与外科手术类别中,正确回答率分别为:ChatGPT为69.2%,Copilot为84.6%,Gemini为53.8%。尽管Gemini的准确率最低,但无统计学显著差异(=0.236)。

结论

在ABSITE测验中,ChatGPT和Copilot表现相近,而Gemini明显落后。

相似文献

1
Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.大型语言模型能通过美国外科医师委员会培训考试吗?Gemini、Copilot和ChatGPT的比较评估。
Am Surg. 2025 May 12:31348251341956. doi: 10.1177/00031348251341956.
2
Comparison of ChatGPT-4o, Google Gemini 1.5 Pro, Microsoft Copilot Pro, and Ophthalmologists in the management of uveitis and ocular inflammation: A comparative study of large language models.ChatGPT-4o、谷歌Gemini 1.5 Pro、微软Copilot Pro与眼科医生在葡萄膜炎和眼部炎症管理中的比较:大型语言模型的对比研究
J Fr Ophtalmol. 2025 Apr;48(4):104468. doi: 10.1016/j.jfo.2025.104468. Epub 2025 Mar 13.
3
Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.在大体解剖学课程中使用大语言模型(ChatGPT、Copilot、PaLM、Bard和Gemini):比较分析
Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.
4
Can Artificial Intelligence Language Models Effectively Address Dental Trauma Questions?人工智能语言模型能否有效解决牙齿创伤问题?
Dent Traumatol. 2025 Apr 1. doi: 10.1111/edt.13063.
5
Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy.评估大语言模型(ChatGPT-4、Gemini和Microsoft Copilot)对乳腺成像常见问题的回答:可读性和准确性研究
Cureus. 2024 May 9;16(5):e59960. doi: 10.7759/cureus.59960. eCollection 2024 May.
6
Microsoft Copilot Provides More Accurate and Reliable Information About Anterior Cruciate Ligament Injury and Repair Than ChatGPT and Google Gemini; However, No Resource Was Overall the Best.与ChatGPT和谷歌Gemini相比,微软Copilot能提供关于前交叉韧带损伤与修复的更准确、更可靠的信息;然而,没有一种资源在各方面都是最佳的。
Arthrosc Sports Med Rehabil. 2024 Nov 19;7(2):101043. doi: 10.1016/j.asmr.2024.101043. eCollection 2025 Apr.
7
Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study.ChatGPT-4、微软 Copilot 和谷歌 Gemini 在意大利医疗科学学位入学考试中的比较准确性:一项横断面研究。
BMC Med Educ. 2024 Jun 26;24(1):694. doi: 10.1186/s12909-024-05630-9.
8
Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis.人工智能在回答口腔病理学选择题方面的表现:一项对比分析。
BMC Oral Health. 2025 Apr 15;25(1):573. doi: 10.1186/s12903-025-05926-2.
9
Comparison of ChatGPT-4, Copilot, Bard and Gemini Ultra on an Otolaryngology Question Bank.在一个耳鼻喉科题库上对ChatGPT-4、Copilot、Bard和Gemini Ultra的比较。
Clin Otolaryngol. 2025 Jul;50(4):704-711. doi: 10.1111/coa.14302. Epub 2025 Mar 13.
10
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.