• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

公开可用的大语言模型在内科医师资格考试风格问题上的表现。

Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions.

作者信息

Tarabanis Constantine, Zahid Sohail, Mamalis Marios, Zhang Kevin, Kalampokis Evangelos, Jankelson Lior

机构信息

Leon H. Charney Division of Cardiology, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America.

Information Systems Laboratory, University of Macedonia, Thessaloniki, Greece.

出版信息

PLOS Digit Health. 2024 Sep 17;3(9):e0000604. doi: 10.1371/journal.pdig.0000604. eCollection 2024 Sep.

DOI:10.1371/journal.pdig.0000604
PMID:39288137
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11407633/
Abstract

Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.

摘要

正在进行的研究试图通过评估大语言模型(LLM)在医学考试中的表现,将其与医生的知识储备进行对比。此前尚无研究评估LLM在内科(IM)委员会考试问题上的表现。关于从医学文本中提取并提供给模型的知识如何提高LLM性能的数据有限。我们在240个随机选择的IM委员会风格问题上评估了GPT-3.5、GPT-4.0、LaMDA和Llama 2在有无额外模型输入增强情况下的表现。问题来源于美国医师协会发布的医学知识自我评估计划,每个问题都作为LLM提示的一部分。如有可能,通过应用程序编程接口(API)和相应的聊天机器人访问LLM。使用检索增强生成方法,用《哈里森内科学原理》对模型输入进行增强。以盲法形式呈现LLM对25个正确回答问题的生成解释,同时将MKSAP解释提供给一名负责选择人工生成答案的IM委员会认证医生。通过必应聊天或其API访问的GPT-4.0得分在77.5%至80.7%之间,依次优于GPT-3.5、人类受访者、LaMDA和Llama 2。GPT-4.0在每个测试的IM科目上都优于人类MKSAP用户,其在传染病(第80百分位)和风湿病(第99.7百分位)方面的百分位得分最高和最低。当通过API而非在线聊天机器人访问LLM时,GPT-3.5和GPT-4.0的性能均下降3.2%至5.3%。在进行额外输入增强后,通过API访问的GPT-3.5和GPT-4.0的性能均提高4.5%至7.5%。在25个问题的样本集中,盲审员在72%的问题中正确识别出了人工生成的MKSAP答案。GPT-4.0在IM委员会风格问题上表现最佳,优于人类受访者。用特定领域信息进行增强可提高性能,使检索增强生成成为提高医学考试LLM回答准确性的一种可能技术。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2322/11407633/898f827915ba/pdig.0000604.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2322/11407633/898f827915ba/pdig.0000604.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2322/11407633/898f827915ba/pdig.0000604.g001.jpg

相似文献

1
Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions.公开可用的大语言模型在内科医师资格考试风格问题上的表现。
PLOS Digit Health. 2024 Sep 17;3(9):e0000604. doi: 10.1371/journal.pdig.0000604. eCollection 2024 Sep.
2
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
3
Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.ChatGPT和GPT-4在神经外科笔试中的表现。
Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.
4
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
5
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
6
Performance of Large Language Models on a Neurology Board-Style Examination.大语言模型在神经科 board-style 考试中的表现。
JAMA Netw Open. 2023 Dec 1;6(12):e2346721. doi: 10.1001/jamanetworkopen.2023.46721.
7
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
8
Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估:比较研究
JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.
9
Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.使用检索增强语言模型提高GPT-3/4在生物医学数据上的结果准确性。
PLOS Digit Health. 2024 Aug 21;3(8):e0000568. doi: 10.1371/journal.pdig.0000568. eCollection 2024 Aug.
10
Performance of Large Language Models on Medical Oncology Examination Questions.大语言模型在医学肿瘤学考试问题上的表现。
JAMA Netw Open. 2024 Jun 3;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641.

引用本文的文献

1
Challenges of Implementing LLMs in Clinical Practice: Perspectives.在临床实践中应用大语言模型的挑战:观点
J Clin Med. 2025 Sep 1;14(17):6169. doi: 10.3390/jcm14176169.
2
Development and Evaluation of an Artificial Intelligence-Powered Surgical Oral Examination Simulator: A Pilot Study.人工智能驱动的外科口腔检查模拟器的开发与评估:一项试点研究。
Mayo Clin Proc Digit Health. 2025 Jun 9;3(3):100241. doi: 10.1016/j.mcpdig.2025.100241. eCollection 2025 Sep.
3
Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.

本文引用的文献

1
Performance of Three Large Language Models on Dermatology Board Examinations.三种大语言模型在皮肤科医师资格考试中的表现。
J Invest Dermatol. 2024 Feb;144(2):398-400. doi: 10.1016/j.jid.2023.06.208. Epub 2023 Aug 2.
2
Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination.大型语言模型在新生儿科专业考试练习题上的表现
JAMA Pediatr. 2023 Sep 1;177(9):977-979. doi: 10.1001/jamapediatrics.2023.2373.
3
Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.
在耳鼻喉科委员会考试中利用先进的大语言模型:使用Python和应用程序编程接口的调查
Eur Arch Otorhinolaryngol. 2025 Apr 25. doi: 10.1007/s00405-025-09404-x.
4
Evaluating large language models as patient education tools for inflammatory bowel disease: A comparative study.评估大型语言模型作为炎症性肠病患者教育工具的效果:一项比较研究。
World J Gastroenterol. 2025 Feb 14;31(6):102090. doi: 10.3748/wjg.v31.i6.102090.
5
Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.评估大型语言模型的性能以支持原发性免疫疾病患者的诊断和管理。
J Allergy Clin Immunol. 2025 Feb 14. doi: 10.1016/j.jaci.2025.02.004.
6
Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines.利用检索增强生成改进生物医学中的大语言模型应用:一项系统综述、荟萃分析和临床开发指南
J Am Med Inform Assoc. 2025 Apr 1;32(4):605-615. doi: 10.1093/jamia/ocaf008.
生成式大型语言模型在眼科 Board 式问题中的表现。
Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.
4
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险
N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.
5
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
6
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
7
Do USMLE steps, and ITE score predict the American Board of Internal Medicine Certifying Exam results?USMLE 步骤和 ITE 评分是否能预测美国内科医师学会认证考试结果?
BMC Med Educ. 2020 Mar 18;20(1):79. doi: 10.1186/s12909-020-1974-3.