Suppr超能文献

人工智能聊天机器人在英国医学考试问题上有前景但也有限制:一项比较性能研究。

AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study.

机构信息

Misr University for Science and Technology, 6th of October, Egypt.

Medical Research Platform (MRP), Giza, Egypt.

出版信息

Sci Rep. 2024 Aug 14;14(1):18859. doi: 10.1038/s41598-024-68996-2.

Abstract

Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or "choose N" questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.

摘要

大型语言模型(LLMs),如 ChatGPT,在医学教育中具有潜在的应用,例如通过与学生讨论不清楚的问题来帮助他们准备执照考试。然而,它们需要在这些复杂任务上进行评估。本研究旨在评估公共可用的 LLM 在模拟英国医学委员会考试问题上的表现如何。从 9 项英国考试(MRCS、MRCP 等)中抽取了 423 道选择题、13 道是非题和 4 道“选择 N”题,由 7 个 LLM(ChatGPT-3.5、ChatGPT-4、Bard、Perplexity、Claude、Bing、Claude Instant)回答。涵盖了外科、儿科和其他学科的主题。输出的准确性进行了评分。使用统计学分析了 LLM 之间的差异。主要分析排除了泄露的问题。ChatGPT 4.0 得分(78.2%)、Bing(67.2%)、Claude(64.4%)和 Claude Instant(62.9%)。Perplexity 的得分最低(56.1%)。总体而言,LLM 之间的得分存在显著差异(p<0.001),并且在两两比较中也是如此。所有 LLM 在多项选择题上的得分均高于是非题或“选择 N”题。LLM 在回答某些问题时表现出局限性,这表明在医学教育中主要依赖之前需要进行改进。然而,它们不断扩大的能力表明,如果精心实施,有可能提高培训效果。进一步的研究应该探索专业特定的 LLM 以及最佳整合到医学课程中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b698/11324724/f5ae6571e69b/41598_2024_68996_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验