利用生成式人工智能辅助学习罕见且复杂的诊断：对流行的大型语言模型的定性研究。

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

机构信息

Department of Computer Science, Brown University, Providence, RI, United States.

Center for Computational Molecular Biology, Brown University, Providence, RI, United States.

出版信息

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

DOI:10.2196/51391

PMID:38349725

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10900078/

Abstract

BACKGROUND

Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains.

OBJECTIVE

This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance.

METHODS

We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks.

RESULTS

Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs.

CONCLUSIONS

Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.

摘要

背景

患有罕见和复杂疾病的患者经常经历延迟诊断和误诊，因为对这些疾病的全面了解仅限于少数医学专家。在这种情况下，大型语言模型 (LLM) 已成为强大的知识聚合工具，可应用于临床决策支持和教育领域。

目的

本研究旨在探讨 3 种流行的 LLM，即 Bard（谷歌有限责任公司）、ChatGPT-3.5（OpenAI）和 GPT-4（OpenAI），在医学教育中的潜力，以增强对罕见和复杂疾病的诊断，同时研究提示工程对其性能的影响。

方法

我们通过公开可用的复杂和罕见病例来实现这些目标。我们实施了各种提示策略，使用开放式和多项选择题提示来评估这些模型的性能。此外，我们使用多数投票策略来利用语言模型内部的多种推理路径，旨在提高它们的可靠性。此外，我们将它们的性能与人类应答者和专门用于医学任务的生成式 LLM MedAlpaca 的性能进行了比较。

结果

值得注意的是，所有 LLM 的表现均优于人类平均共识和 MedAlpaca，在来自诊断病例挑战集的所有 30 个病例中，分别至少高出 5%和 13%。在经常误诊的病例类别中，Bard 与 MedAlpaca 打平，但比人类平均共识高出 14%，而 GPT-4 和 ChatGPT-3.5 在中度经常误诊的病例类别中表现优于 MedAlpaca 和人类应答者，最低准确率分别为 28%和 11%。多数投票策略，特别是与 GPT-4 一起，在来自诊断复杂病例集的所有病例中均获得了最高总体得分，超过了其他 LLM。在 Medical Information Mart for Intensive Care-III 数据集上，Bard 和 GPT-4 实现了最高的诊断准确性得分，多项选择题提示得分 93%，而 ChatGPT-3.5 和 MedAlpaca 分别得分为 73%和 47%。此外，我们的结果表明，没有一种适用于所有 LLM 的通用提示方法，并且单一策略并不普遍适用于所有 LLM。

结论

我们的研究结果揭示了 LLM 的诊断能力以及与确定与每个语言模型的特点和特定任务要求相匹配的最佳提示策略相关的挑战。提示工程的重要性得到了强调，为使用这些语言模型进行医学培训的研究人员和从业者提供了有价值的见解。此外，这项研究是朝着理解 LLM 如何增强罕见和复杂医学病例中的诊断推理迈出的重要一步，为开发有效的教育工具和准确的诊断辅助工具以改善患者护理和结果铺平了道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f624/10900078/dbcd904915f0/mededu_v10i1e51391_fig1.jpg

相似文献

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.

Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study.

JMIR Dermatol. 2024 May 16;7:e55898. doi: 10.2196/55898.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.

J Cardiothorac Vasc Anesth. 2024 May;38(5):1251-1259. doi: 10.1053/j.jvca.2024.01.032. Epub 2024 Feb 1.

引用本文的文献

Large language models for disease diagnosis: a scoping review.

NPJ Artif Intell. 2025;1(1):9. doi: 10.1038/s44387-025-00011-z. Epub 2025 Jun 9.

Large Language Models in Medicine: Applications, Challenges, and Future Directions.

Int J Med Sci. 2025 May 31;22(11):2792-2801. doi: 10.7150/ijms.111780. eCollection 2025.

Conversion of Mixed-Language Free-Text CT Reports of Pancreatic Cancer to National Comprehensive Cancer Network Structured Reporting Templates by Using GPT-4.

Korean J Radiol. 2025 Jun;26(6):557-568. doi: 10.3348/kjr.2024.1228. Epub 2025 Apr 17.

Large language models in critical care.

J Intensive Med. 2024 Dec 24;5(2):113-118. doi: 10.1016/j.jointm.2024.12.001. eCollection 2025 Apr.

A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.

NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z.

The large language model diagnoses tuberculous pleural effusion in pleural effusion patients through clinical feature landscapes.

Respir Res. 2025 Feb 12;26(1):52. doi: 10.1186/s12931-025-03130-y.

Evaluation of the Accuracy of Artificial Intelligence (AI) Models in Dermatological Diagnosis and Comparison With Dermatology Specialists.

Cureus. 2025 Jan 7;17(1):e77067. doi: 10.7759/cureus.77067. eCollection 2025 Jan.

Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.

JAMIA Open. 2025 Jan 10;8(1):ooae154. doi: 10.1093/jamiaopen/ooae154. eCollection 2025 Feb.

Evaluating multimodal AI in medical diagnostics.

NPJ Digit Med. 2024 Aug 7;7(1):205. doi: 10.1038/s41746-024-01208-3.

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

medRxiv. 2024 Nov 7:2024.07.22.24310816. doi: 10.1101/2024.07.22.24310816.

本文引用的文献

Retrieval-Based Diagnostic Decision Support: Mixed Methods Study.

JMIR Med Inform. 2024 Jun 19;12:e50209. doi: 10.2196/50209.

Commentary: Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Neurosurgery. 2023 Nov 1;93(5):e123-e124. doi: 10.1227/neu.0000000000002618. Epub 2023 Jul 19.

ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations.

Front Artif Intell. 2023 May 4;6:1169595. doi: 10.3389/frai.2023.1169595. eCollection 2023.

Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers.

NPJ Digit Med. 2023 Apr 26;6(1):75. doi: 10.1038/s41746-023-00819-6.

Using AI-generated suggestions from ChatGPT to optimize clinical decision support.

J Am Med Inform Assoc. 2023 Jun 20;30(7):1237-1245. doi: 10.1093/jamia/ocad072.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.

N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.

Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios.

J Med Syst. 2023 Mar 4;47(1):33. doi: 10.1007/s10916-023-01925-4.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

ChatGPT passing USMLE shines a spotlight on the flaws of medical education.

PLOS Digit Health. 2023 Feb 9;2(2):e0000205. doi: 10.1371/journal.pdig.0000205. eCollection 2023 Feb.

ChatGPT: the future of discharge summaries?

Lancet Digit Health. 2023 Mar;5(3):e107-e108. doi: 10.1016/S2589-7500(23)00021-3. Epub 2023 Feb 6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用生成式人工智能辅助学习罕见且复杂的诊断：对流行的大型语言模型的定性研究。

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

机构信息

Department of Computer Science, Brown University, Providence, RI, United States.

Center for Computational Molecular Biology, Brown University, Providence, RI, United States.