Suppr超能文献

医学教育中的大语言模型:回答组织学问题的比较性跨平台评估

Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.

作者信息

Mavrych Volodymyr, Yousef Einas M, Yaqinuddin Ahmed, Bolgova Olena

机构信息

College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia.

出版信息

Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.

Abstract

Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey's tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems ( > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.

摘要

大语言模型(LLMs)在各个医学学科中都展现出了令人期待的能力,但其在基础医学科学方面的表现仍未得到充分描述。医学组织学需要事实性知识和解释技能,为评估人工智能在医学教育中的能力提供了一个独特的领域。为了评估和比较当前五个大语言模型:GPT - 4.1、Claude 3.7 Sonnet、Gemini 2.0 Flash、Copilot和DeepSeek R1在正确回答医学组织学多项选择题(MCQs)方面的表现。这项横断面比较研究使用了涵盖20个主题的200道美国医师执照考试(USMLE)风格的组织学多项选择题。每个大语言模型分三次独立尝试完成所有问题。性能指标包括准确率、重测信度(ICC)和特定主题分析。统计分析采用方差分析(ANOVA)以及事后的Tukey检验和用于系统 - 主题交互的双向混合方差分析。所有大语言模型都取得了极高的准确率(均值91.1%,标准差7.2)。Gemini表现最佳(92.0%),其次是Claude(91.5%)、Copilot(91.0%)、GPT - 4(90.8%)和DeepSeek(90.3%),各系统之间无显著差异(>0.05)。Claude显示出最高的信度(ICC = 0.931),其次是GPT - 4(ICC = 0.882)。在组织学方法、血液和造血以及循环系统方面检测到完全准确和可重复性(100%),而肌肉组织(76.0%)和淋巴系统(84.7%)带来了最大挑战。大语言模型在回答组织学多项选择题时表现出极高的准确性和信度,显著优于其他医学学科。系统间差异极小表明技术已经成熟,不过特定主题的挑战和信度问题表明仍持续需要人类专业知识。这些发现反映了人工智能的快速发展,并确定组织学特别适合人工智能辅助的医学教育。:该临床试验编号与本研究无关,因为本研究不涉及药品或治疗干预。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/465f/12258195/e466631a96b6/ZMEO_A_2534065_F0001_OC.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验