Tarabanis Constantine, Zahid Sohail, Mamalis Marios, Zhang Kevin, Kalampokis Evangelos, Jankelson Lior
Leon H. Charney Division of Cardiology, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America.
Information Systems Laboratory, University of Macedonia, Thessaloniki, Greece.
PLOS Digit Health. 2024 Sep 17;3(9):e0000604. doi: 10.1371/journal.pdig.0000604. eCollection 2024 Sep.
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
正在进行的研究试图通过评估大语言模型(LLM)在医学考试中的表现,将其与医生的知识储备进行对比。此前尚无研究评估LLM在内科(IM)委员会考试问题上的表现。关于从医学文本中提取并提供给模型的知识如何提高LLM性能的数据有限。我们在240个随机选择的IM委员会风格问题上评估了GPT-3.5、GPT-4.0、LaMDA和Llama 2在有无额外模型输入增强情况下的表现。问题来源于美国医师协会发布的医学知识自我评估计划,每个问题都作为LLM提示的一部分。如有可能,通过应用程序编程接口(API)和相应的聊天机器人访问LLM。使用检索增强生成方法,用《哈里森内科学原理》对模型输入进行增强。以盲法形式呈现LLM对25个正确回答问题的生成解释,同时将MKSAP解释提供给一名负责选择人工生成答案的IM委员会认证医生。通过必应聊天或其API访问的GPT-4.0得分在77.5%至80.7%之间,依次优于GPT-3.5、人类受访者、LaMDA和Llama 2。GPT-4.0在每个测试的IM科目上都优于人类MKSAP用户,其在传染病(第80百分位)和风湿病(第99.7百分位)方面的百分位得分最高和最低。当通过API而非在线聊天机器人访问LLM时,GPT-3.5和GPT-4.0的性能均下降3.2%至5.3%。在进行额外输入增强后,通过API访问的GPT-3.5和GPT-4.0的性能均提高4.5%至7.5%。在25个问题的样本集中,盲审员在72%的问题中正确识别出了人工生成的MKSAP答案。GPT-4.0在IM委员会风格问题上表现最佳,优于人类受访者。用特定领域信息进行增强可提高性能,使检索增强生成成为提高医学考试LLM回答准确性的一种可能技术。