在大体解剖学课程中使用大语言模型（ChatGPT、Copilot、PaLM、Bard和Gemini）：比较分析

Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.

作者信息

Mavrych Volodymyr, Ganguly Paul, Bolgova Olena

机构信息

College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia.

出版信息

Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.

DOI:10.1002/ca.24244

PMID:39573871

Abstract

The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% ± 1.9% of questions accurately, then Copilot (42.0% ± 0.0%) and ChatGPT-3.5 (41.0% ± 5.3%), followed by ChatGPT-3.5-turbo (38.5% ± 5.7%). Google PaLM 2 (34.5% ± 4.4%) and Bard (33.5% ± 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.

摘要

生成式人工智能大语言模型（LLMs）在包括医学教育在内的各个领域的应用日益增加，这引发了人们对其准确性的质疑。我们研究的主要目的是对六种不同的大语言模型（ChatGPT-4、ChatGPT-3.5-turbo、ChatGPT-3.5、Copilot、PaLM、Bard和Gemini）在回答医学多项选择题（MCQs）以及为医学生的大体解剖课程上肢主题生成临床场景和多项选择题方面的能力和准确性进行详细的比较分析。对选定的聊天机器人进行测试，让它们回答50道美国医师执照考试（USMLE）风格的多项选择题。这些问题是从医学生大体解剖课程考试数据库中随机选取的，并由三位独立专家进行审核。评估聊天机器人连续五次尝试回答每组问题的结果，评估指标包括准确性、相关性和全面性。ChatGPT-4的表现最佳，准确回答了60.5%±1.9%的问题，其次是Copilot（42.0%±0.0%）和ChatGPT-3.5（41.0%±5.3%），随后是ChatGPT-3.5-turbo（38.5%±5.7%）。谷歌PaLM 2（34.5%±4.4%）和Bard（33.5%±3.0%）的结果最差。GPT-4的总体表现与Copilot、GPT-3.5、GPT-Turbo、PaLM2和Bard相比，在统计学上具有显著优势（p<0.05），分别高出18.6%、19.5%、22%、26%和27%。然后要求每个聊天机器人为三个随机选择的主题——解剖鼻烟壶、肱骨髁上骨折和肘窝——分别生成一个临床场景以及三道相关的多项选择题，每题有五个选项，并指出正确答案。两位独立专家对收到的216条记录（0 - 5分制）进行了分析和评分。ChatGPT-4的结果最佳，其次是Gemini、ChatGPT-3.5和ChatGPT-3.5-turbo、Copilot，谷歌PaLM 2的评分最低。尽管技术在进步，但大语言模型尚未成熟到足以在大体解剖课程中完全取代教师或辅助者的角色；然而，它们可以成为医学教育工作者的宝贵工具。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

在大体解剖学课程中使用大语言模型（ChatGPT、Copilot、PaLM、Bard和Gemini）：比较分析

Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

在大体解剖学课程中使用大语言模型（ChatGPT、Copilot、PaLM、Bard和Gemini）：比较分析

Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献