文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

机构信息

Department of Computer Science, Brown University, Providence, RI, United States.

Center for Computational Molecular Biology, Brown University, Providence, RI, United States.

出版信息

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.


DOI:10.2196/51391
PMID:38349725
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10900078/
Abstract

BACKGROUND: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains. OBJECTIVE: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance. METHODS: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks. RESULTS: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs. CONCLUSIONS: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.

摘要

背景:患有罕见和复杂疾病的患者经常经历延迟诊断和误诊,因为对这些疾病的全面了解仅限于少数医学专家。在这种情况下,大型语言模型 (LLM) 已成为强大的知识聚合工具,可应用于临床决策支持和教育领域。

目的:本研究旨在探讨 3 种流行的 LLM,即 Bard(谷歌有限责任公司)、ChatGPT-3.5(OpenAI)和 GPT-4(OpenAI),在医学教育中的潜力,以增强对罕见和复杂疾病的诊断,同时研究提示工程对其性能的影响。

方法:我们通过公开可用的复杂和罕见病例来实现这些目标。我们实施了各种提示策略,使用开放式和多项选择题提示来评估这些模型的性能。此外,我们使用多数投票策略来利用语言模型内部的多种推理路径,旨在提高它们的可靠性。此外,我们将它们的性能与人类应答者和专门用于医学任务的生成式 LLM MedAlpaca 的性能进行了比较。

结果:值得注意的是,所有 LLM 的表现均优于人类平均共识和 MedAlpaca,在来自诊断病例挑战集的所有 30 个病例中,分别至少高出 5%和 13%。在经常误诊的病例类别中,Bard 与 MedAlpaca 打平,但比人类平均共识高出 14%,而 GPT-4 和 ChatGPT-3.5 在中度经常误诊的病例类别中表现优于 MedAlpaca 和人类应答者,最低准确率分别为 28%和 11%。多数投票策略,特别是与 GPT-4 一起,在来自诊断复杂病例集的所有病例中均获得了最高总体得分,超过了其他 LLM。在 Medical Information Mart for Intensive Care-III 数据集上,Bard 和 GPT-4 实现了最高的诊断准确性得分,多项选择题提示得分 93%,而 ChatGPT-3.5 和 MedAlpaca 分别得分为 73%和 47%。此外,我们的结果表明,没有一种适用于所有 LLM 的通用提示方法,并且单一策略并不普遍适用于所有 LLM。

结论:我们的研究结果揭示了 LLM 的诊断能力以及与确定与每个语言模型的特点和特定任务要求相匹配的最佳提示策略相关的挑战。提示工程的重要性得到了强调,为使用这些语言模型进行医学培训的研究人员和从业者提供了有价值的见解。此外,这项研究是朝着理解 LLM 如何增强罕见和复杂医学病例中的诊断推理迈出的重要一步,为开发有效的教育工具和准确的诊断辅助工具以改善患者护理和结果铺平了道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f624/10900078/9c82723eed28/mededu_v10i1e51391_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f624/10900078/dbcd904915f0/mededu_v10i1e51391_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f624/10900078/f28e475b2413/mededu_v10i1e51391_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f624/10900078/9c82723eed28/mededu_v10i1e51391_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f624/10900078/dbcd904915f0/mededu_v10i1e51391_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f624/10900078/f28e475b2413/mededu_v10i1e51391_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f624/10900078/9c82723eed28/mededu_v10i1e51391_fig3.jpg

相似文献

[1]
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024-2-13

[2]
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.

JMIR Med Educ. 2024-2-21

[3]
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

JMIR Med Inform. 2024-4-8

[4]
Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.

Cureus. 2023-8-4

[5]
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023-12-28

[6]
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

J Med Internet Res. 2024-4-17

[7]
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.

Cureus. 2024-3-11

[8]
Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study.

JMIR Dermatol. 2024-5-16

[9]
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024-4-13

[10]
Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.

J Cardiothorac Vasc Anesth. 2024-5

引用本文的文献

[1]
Large language models for disease diagnosis: a scoping review.

NPJ Artif Intell. 2025

[2]
Large Language Models in Medicine: Applications, Challenges, and Future Directions.

Int J Med Sci. 2025-5-31

[3]
Conversion of Mixed-Language Free-Text CT Reports of Pancreatic Cancer to National Comprehensive Cancer Network Structured Reporting Templates by Using GPT-4.

Korean J Radiol. 2025-6

[4]
Large language models in critical care.

J Intensive Med. 2024-12-24

[5]
A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.

NPJ Digit Med. 2025-3-22

[6]
The large language model diagnoses tuberculous pleural effusion in pleural effusion patients through clinical feature landscapes.

Respir Res. 2025-2-12

[7]
Evaluation of the Accuracy of Artificial Intelligence (AI) Models in Dermatological Diagnosis and Comparison With Dermatology Specialists.

Cureus. 2025-1-7

[8]
Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.

JAMIA Open. 2025-1-10

[9]
Evaluating multimodal AI in medical diagnostics.

NPJ Digit Med. 2024-8-7

[10]
Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

medRxiv. 2024-11-7

本文引用的文献

[1]
Retrieval-Based Diagnostic Decision Support: Mixed Methods Study.

JMIR Med Inform. 2024-6-19

[2]
Commentary: Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Neurosurgery. 2023-11-1

[3]
ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations.

Front Artif Intell. 2023-5-4

[4]
Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers.

NPJ Digit Med. 2023-4-26

[5]
Using AI-generated suggestions from ChatGPT to optimize clinical decision support.

J Am Med Inform Assoc. 2023-6-20

[6]
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.

N Engl J Med. 2023-3-30

[7]
Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios.

J Med Syst. 2023-3-4

[8]
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023-2-9

[9]
ChatGPT passing USMLE shines a spotlight on the flaws of medical education.

PLOS Digit Health. 2023-2-9

[10]
ChatGPT: the future of discharge summaries?

Lancet Digit Health. 2023-3

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索