• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在大体解剖学课程中使用大语言模型(ChatGPT、Copilot、PaLM、Bard和Gemini):比较分析

Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.

作者信息

Mavrych Volodymyr, Ganguly Paul, Bolgova Olena

机构信息

College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia.

出版信息

Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.

DOI:10.1002/ca.24244
PMID:39573871
Abstract

The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% ± 1.9% of questions accurately, then Copilot (42.0% ± 0.0%) and ChatGPT-3.5 (41.0% ± 5.3%), followed by ChatGPT-3.5-turbo (38.5% ± 5.7%). Google PaLM 2 (34.5% ± 4.4%) and Bard (33.5% ± 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.

摘要

生成式人工智能大语言模型(LLMs)在包括医学教育在内的各个领域的应用日益增加,这引发了人们对其准确性的质疑。我们研究的主要目的是对六种不同的大语言模型(ChatGPT-4、ChatGPT-3.5-turbo、ChatGPT-3.5、Copilot、PaLM、Bard和Gemini)在回答医学多项选择题(MCQs)以及为医学生的大体解剖课程上肢主题生成临床场景和多项选择题方面的能力和准确性进行详细的比较分析。对选定的聊天机器人进行测试,让它们回答50道美国医师执照考试(USMLE)风格的多项选择题。这些问题是从医学生大体解剖课程考试数据库中随机选取的,并由三位独立专家进行审核。评估聊天机器人连续五次尝试回答每组问题的结果,评估指标包括准确性、相关性和全面性。ChatGPT-4的表现最佳,准确回答了60.5%±1.9%的问题,其次是Copilot(42.0%±0.0%)和ChatGPT-3.5(41.0%±5.3%),随后是ChatGPT-3.5-turbo(38.5%±5.7%)。谷歌PaLM 2(34.5%±4.4%)和Bard(33.5%±3.0%)的结果最差。GPT-4的总体表现与Copilot、GPT-3.5、GPT-Turbo、PaLM2和Bard相比,在统计学上具有显著优势(p<0.05),分别高出18.6%、19.5%、22%、26%和27%。然后要求每个聊天机器人为三个随机选择的主题——解剖鼻烟壶、肱骨髁上骨折和肘窝——分别生成一个临床场景以及三道相关的多项选择题,每题有五个选项,并指出正确答案。两位独立专家对收到的216条记录(0 - 5分制)进行了分析和评分。ChatGPT-4的结果最佳,其次是Gemini、ChatGPT-3.5和ChatGPT-3.5-turbo、Copilot,谷歌PaLM 2的评分最低。尽管技术在进步,但大语言模型尚未成熟到足以在大体解剖课程中完全取代教师或辅助者的角色;然而,它们可以成为医学教育工作者的宝贵工具。

相似文献

1
Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.在大体解剖学课程中使用大语言模型(ChatGPT、Copilot、PaLM、Bard和Gemini):比较分析
Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.
2
Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.克劳德、ChatGPT、Copilot和Gemini在神经科学不同主题上与学生的表现对比。
Adv Physiol Educ. 2025 Jun 1;49(2):430-437. doi: 10.1152/advan.00093.2024. Epub 2025 Jan 17.
3
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.生物化学教育中的大语言模型:性能的比较评估
JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.
4
Comparison of ChatGPT-4, Copilot, Bard and Gemini Ultra on an Otolaryngology Question Bank.在一个耳鼻喉科题库上对ChatGPT-4、Copilot、Bard和Gemini Ultra的比较。
Clin Otolaryngol. 2025 Jul;50(4):704-711. doi: 10.1111/coa.14302. Epub 2025 Mar 13.
5
Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.评估 ChatGPT®、BARD®、 Gemini®、Copilot®、Perplexity® 在姑息治疗方面的可读性、可靠性和质量。
Medicine (Baltimore). 2024 Aug 16;103(33):e39305. doi: 10.1097/MD.0000000000039305.
6
Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.三种基于人工智能(AI)的大语言模型在标准化测试中的表现;对人工智能辅助牙科教育的启示。
J Periodontal Res. 2025 Feb;60(2):121-133. doi: 10.1111/jre.13323. Epub 2024 Jul 18.
7
Comparison of ChatGPT-4o, Google Gemini 1.5 Pro, Microsoft Copilot Pro, and Ophthalmologists in the management of uveitis and ocular inflammation: A comparative study of large language models.ChatGPT-4o、谷歌Gemini 1.5 Pro、微软Copilot Pro与眼科医生在葡萄膜炎和眼部炎症管理中的比较:大型语言模型的对比研究
J Fr Ophtalmol. 2025 Apr;48(4):104468. doi: 10.1016/j.jfo.2025.104468. Epub 2025 Mar 13.
8
Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.大语言模型在医学胚胎学中的性能比较分析:ChatGPT、Claude、Gemini和Copilot的跨平台研究
Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.
9
Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.大型语言模型在新冠肺炎对妊娠影响方面的熟练度、清晰度和客观性与专家知识对比:横断面试点研究
JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126.
10
Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.大型语言模型能通过美国外科医师委员会培训考试吗?Gemini、Copilot和ChatGPT的比较评估。
Am Surg. 2025 May 12:31348251341956. doi: 10.1177/00031348251341956.

引用本文的文献

1
Comparative evaluation of AI platforms "Google Gemini 2.5 Flash, Google Gemini 2.0 Flash, DeepSeek V3 and ChatGPT 4o" in solving multiple-choice questions from different subtopics of anatomy.人工智能平台“谷歌Gemini 2.5 Flash、谷歌Gemini 2.0 Flash、DeepSeek V3和ChatGPT 4o”在解答解剖学不同子主题多项选择题方面的比较评估
Surg Radiol Anat. 2025 Aug 30;47(1):193. doi: 10.1007/s00276-025-03707-8.
2
Comparative evaluation of large language models performance in medical education using urinary system histology assessment.使用泌尿系统组织学评估对大型语言模型在医学教育中的表现进行比较评估。
Sci Rep. 2025 Aug 29;15(1):31933. doi: 10.1038/s41598-025-17571-4.
3
Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.
评估大型语言模型作为医学简答题评分者:与专家人工评分者的比较分析。
Med Educ Online. 2025 Dec;30(1):2550751. doi: 10.1080/10872981.2025.2550751. Epub 2025 Aug 24.
4
Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.大型语言模型在欧洲普通外科医师资格考试中表现不佳:与专家及外科住院医师的比较研究
BMC Med Educ. 2025 Aug 23;25(1):1193. doi: 10.1186/s12909-025-07856-7.
5
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型:回答组织学问题的比较性跨平台评估
Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.
6
The Roles of Artificial Intelligence in Teaching Anatomy: A Systematic Review.人工智能在解剖学教学中的作用:一项系统综述
Clin Anat. 2025 Apr 23. doi: 10.1002/ca.24272.
7
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.生物化学教育中的大语言模型:性能的比较评估
JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.
8
Artificial intelligence-large language models (AI-LLMs) for reliable and accurate cardiotocography (CTG) interpretation in obstetric practice.用于产科实践中可靠且准确解读胎心监护(CTG)的人工智能大语言模型(AI-LLMs)。
Comput Struct Biotechnol J. 2025 Mar 18;27:1140-1147. doi: 10.1016/j.csbj.2025.03.026. eCollection 2025.
9
Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.大语言模型对脊髓损伤的反应:性能比较研究
J Med Syst. 2025 Mar 25;49(1):39. doi: 10.1007/s10916-025-02170-7.