Suppr超能文献

大型语言模型能通过美国外科医师委员会培训考试吗?Gemini、Copilot和ChatGPT的比较评估。

Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.

作者信息

Sanli Ahmet Necati, Tekcan Sanli Deniz Esin, Karabulut Ali

机构信息

Department of General Surgery, Abdulkadir Yuksel State Hospital, Gaziantep, Turkey.

Department of Radiology, School of Medicine, Gazianep University, Gaziantep, Turkey.

出版信息

Am Surg. 2025 May 12:31348251341956. doi: 10.1177/00031348251341956.

Abstract

ObjectiveThis study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).MethodsMultiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.ResultsThe correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs ( < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower ( = 0.005 and = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower ( < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference ( = 0.236).ConclusionIn the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.

摘要

目的

本研究旨在评估大语言模型(LLMs)回答美国外科委员会住院医师培训考试(ABSITE)问题的表现。

方法

将多项选择题形式的ABSITE测验作为提示输入最流行的大语言模型。本研究使用了ChatGPT-4(OpenAI)、Copilot(微软)和Gemini(谷歌)。该研究包含2017年至2022年的170道问题,这些问题被分为四个子组:定义、生物化学/药理学、病例场景以及治疗与外科手术。所有问题于2024年10月1日至2024年10月5日在大语言模型中进行查询。评估大语言模型的正确答案率。

结果

所有问题的正确回答率分别为

ChatGPT为79.4%,Copilot为77.6%,Gemini为52.9%,Gemini显著低于其他两个大语言模型(<0.001)。在定义类别中,正确回答率分别为:ChatGPT为93.5%,Copilot为90.3%,Gemini为64.5%,Gemini显著较低(分别为=0.005和=0.015)。在生物化学/药理学问题类别中,三组的正确回答率相同(83.3%)。在病例场景类别中,正确回答率分别为:ChatGPT为76.3%,Copilot为72.8%,Gemini为46.5%,Gemini显著较低(<0.001)。在治疗与外科手术类别中,正确回答率分别为:ChatGPT为69.2%,Copilot为84.6%,Gemini为53.8%。尽管Gemini的准确率最低,但无统计学显著差异(=0.236)。

结论

在ABSITE测验中,ChatGPT和Copilot表现相近,而Gemini明显落后。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验