Department of Computer Science, Brown University, Providence, RI, United States.
Center for Computational Molecular Biology, Brown University, Providence, RI, United States.
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
BACKGROUND: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains. OBJECTIVE: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance. METHODS: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks. RESULTS: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs. CONCLUSIONS: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.
背景:患有罕见和复杂疾病的患者经常经历延迟诊断和误诊,因为对这些疾病的全面了解仅限于少数医学专家。在这种情况下,大型语言模型 (LLM) 已成为强大的知识聚合工具,可应用于临床决策支持和教育领域。
目的:本研究旨在探讨 3 种流行的 LLM,即 Bard(谷歌有限责任公司)、ChatGPT-3.5(OpenAI)和 GPT-4(OpenAI),在医学教育中的潜力,以增强对罕见和复杂疾病的诊断,同时研究提示工程对其性能的影响。
方法:我们通过公开可用的复杂和罕见病例来实现这些目标。我们实施了各种提示策略,使用开放式和多项选择题提示来评估这些模型的性能。此外,我们使用多数投票策略来利用语言模型内部的多种推理路径,旨在提高它们的可靠性。此外,我们将它们的性能与人类应答者和专门用于医学任务的生成式 LLM MedAlpaca 的性能进行了比较。
结果:值得注意的是,所有 LLM 的表现均优于人类平均共识和 MedAlpaca,在来自诊断病例挑战集的所有 30 个病例中,分别至少高出 5%和 13%。在经常误诊的病例类别中,Bard 与 MedAlpaca 打平,但比人类平均共识高出 14%,而 GPT-4 和 ChatGPT-3.5 在中度经常误诊的病例类别中表现优于 MedAlpaca 和人类应答者,最低准确率分别为 28%和 11%。多数投票策略,特别是与 GPT-4 一起,在来自诊断复杂病例集的所有病例中均获得了最高总体得分,超过了其他 LLM。在 Medical Information Mart for Intensive Care-III 数据集上,Bard 和 GPT-4 实现了最高的诊断准确性得分,多项选择题提示得分 93%,而 ChatGPT-3.5 和 MedAlpaca 分别得分为 73%和 47%。此外,我们的结果表明,没有一种适用于所有 LLM 的通用提示方法,并且单一策略并不普遍适用于所有 LLM。
结论:我们的研究结果揭示了 LLM 的诊断能力以及与确定与每个语言模型的特点和特定任务要求相匹配的最佳提示策略相关的挑战。提示工程的重要性得到了强调,为使用这些语言模型进行医学培训的研究人员和从业者提供了有价值的见解。此外,这项研究是朝着理解 LLM 如何增强罕见和复杂医学病例中的诊断推理迈出的重要一步,为开发有效的教育工具和准确的诊断辅助工具以改善患者护理和结果铺平了道路。
J Cardiothorac Vasc Anesth. 2024-5
NPJ Artif Intell. 2025
Int J Med Sci. 2025-5-31
J Intensive Med. 2024-12-24
NPJ Digit Med. 2024-8-7
JMIR Med Inform. 2024-6-19
J Am Med Inform Assoc. 2023-6-20
N Engl J Med. 2023-3-30
PLOS Digit Health. 2023-2-9
PLOS Digit Health. 2023-2-9
Lancet Digit Health. 2023-3