大型语言模型在 3 个临床专业领域的治疗推荐中的应用：比较研究。

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

机构信息

Eye Center, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany.

Medical Graduate Center, School of Medicine, Technical University of Munich, Munich, Germany.

出版信息

J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.

DOI:10.2196/49324

PMID:37902826

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10644179/

Abstract

BACKGROUND

As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safety of the generated information across diverse medical specialties.

OBJECTIVE

This study aimed to evaluate the performance of 4 prominent LLMs, namely, Claude-instant-v1.0, GPT-3.5-Turbo, Command-xlarge-nightly, and Bloomz, in generating medical content spanning the clinical specialties of ophthalmology, orthopedics, and dermatology.

METHODS

Three domain-specific physicians evaluated the AI-generated therapeutic recommendations for a diverse set of 60 diseases. The evaluation criteria involved the mDISCERN score, correctness, and potential harmfulness of the recommendations. ANOVA and pairwise t tests were used to explore discrepancies in content quality and safety across models and specialties. Additionally, using the capabilities of OpenAI's most advanced model, GPT-4, an automated evaluation of each model's responses to the diseases was performed using the same criteria and compared to the physicians' assessments through Pearson correlation analysis.

RESULTS

Claude-instant-v1.0 emerged with the highest mean mDISCERN score (3.35, 95% CI 3.23-3.46). In contrast, Bloomz lagged with the lowest score (1.07, 95% CI 1.03-1.10). Our analysis revealed significant differences among the models in terms of quality (P<.001). Evaluating their reliability, the models displayed strong contrasts in their falseness ratings, with variations both across models (P<.001) and specialties (P<.001). Distinct error patterns emerged, such as confusing diagnoses; providing vague, ambiguous advice; or omitting critical treatments, such as antibiotics for infectious diseases. Regarding potential harm, GPT-3.5-Turbo was found to be the safest, with the lowest harmfulness rating. All models lagged in detailing the risks associated with treatment procedures, explaining the effects of therapies on quality of life, and offering additional sources of information. Pearson correlation analysis underscored a substantial alignment between physician assessments and GPT-4's evaluations across all established criteria (P<.01).

CONCLUSIONS

This study, while comprehensive, was limited by the involvement of a select number of specialties and physician evaluators. The straightforward prompting strategy ("How to treat…") and the assessment benchmarks, initially conceptualized for human-authored content, might have potential gaps in capturing the nuances of AI-driven information. The LLMs evaluated showed a notable capability in generating valuable medical content; however, evident lapses in content quality and potential harm signal the need for further refinements. Given the dynamic landscape of LLMs, this study's findings emphasize the need for regular and methodical assessments, oversight, and fine-tuning of these AI tools to ensure they produce consistently trustworthy and clinically safe medical advice. Notably, the introduction of an auto-evaluation mechanism using GPT-4, as detailed in this study, provides a scalable, transferable method for domain-agnostic evaluations, extending beyond therapy recommendation assessments.

摘要

背景

随着人工智能（AI）的不断进步，大型语言模型（LLM）已成为生成医疗信息的有前途的工具。为了评估其在医疗保健方面的快速适应性和潜在益处，需要从不同医学专业的质量、准确性和安全性等方面对生成信息进行严格评估。

目的

本研究旨在评估 4 种知名的 LLM（Claude-instant-v1.0、GPT-3.5-Turbo、Command-xlarge-nightly 和 Bloomz）在生成涵盖眼科、骨科和皮肤科等临床专业的医学内容方面的性能。

方法

三位特定领域的医生评估了针对 60 种不同疾病的 AI 生成的治疗建议。评估标准包括 mDISCERN 评分、正确性和建议的潜在危害性。使用方差分析和配对 t 检验，研究了不同模型和专业之间的内容质量和安全性差异。此外，利用 OpenAI 最先进模型的功能，我们使用相同的标准对每个模型对疾病的反应进行了自动评估，并通过 Pearson 相关分析将其与医生的评估进行比较。

结果

Claude-instant-v1.0 的平均 mDISCERN 评分最高（3.35，95%置信区间 3.23-3.46）。相比之下，Bloomz 的得分最低（1.07，95%置信区间 1.03-1.10）。我们的分析表明，在质量方面，模型之间存在显著差异（P<.001）。评估其可靠性后，我们发现模型在虚假评级方面存在强烈的差异，这不仅体现在模型之间（P<.001），也体现在专业之间（P<.001）。出现了不同的错误模式，例如混淆诊断、提供模糊、模棱两可的建议或遗漏关键治疗方法，例如针对传染病的抗生素。在潜在危害方面，GPT-3.5-Turbo 被认为是最安全的，其危害评分最低。所有模型在详细说明治疗程序相关风险、解释治疗对生活质量的影响以及提供其他信息来源方面都存在滞后。Pearson 相关分析强调了医生评估与 GPT-4 评估在所有既定标准方面的高度一致性（P<.01）。

结论

虽然本研究全面，但受到参与的专业和医生评估者数量的限制。直截了当的提示策略（“如何治疗……”）和评估基准，最初是为人类撰写的内容而构思的，可能在捕捉 AI 驱动信息的细微差别方面存在潜在差距。评估的 LLM 表现出生成有价值的医学内容的显著能力；然而，内容质量和潜在危害方面的明显缺陷表明需要进一步改进。鉴于 LLM 的动态发展，本研究的结果强调需要定期和系统地评估、监督和调整这些 AI 工具，以确保它们能够持续提供值得信赖且临床安全的医疗建议。值得注意的是，本研究详细介绍的使用 GPT-4 进行自动评估的方法提供了一种可扩展的、适用于所有领域的评估方法，超越了治疗建议评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f37/10644179/47502e510ec8/jmir_v25i1e49324_fig1.jpg

相似文献

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.

JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.

JMIR Med Educ. 2024 Aug 16;10:e59213. doi: 10.2196/59213.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine.

Clin Pract. 2023 Nov 20;13(6):1460-1487. doi: 10.3390/clinpract13060130.

Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.

JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.

JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

引用本文的文献

Development and evaluation of a lightweight large language model chatbot for medication enquiry.

PLOS Digit Health. 2025 Sep 4;4(9):e0000961. doi: 10.1371/journal.pdig.0000961. eCollection 2025 Sep.

Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators.

Eye (Lond). 2025 Aug 25. doi: 10.1038/s41433-025-03935-7.

Application of Large Language Models in Stroke Rehabilitation Health Education: 2-Phase Study.

J Med Internet Res. 2025 Jul 22;27:e73226. doi: 10.2196/73226.

Medical Students' Perceptions of Large Language Models in Healthcare: A Multinational Cross-Sectional Study.

J Med Educ Curric Dev. 2025 May 21;12:23821205251331124. doi: 10.1177/23821205251331124. eCollection 2025 Jan-Dec.

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.

J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080.

Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential.

Digit Health. 2025 Jul 7;11:20552076251355797. doi: 10.1177/20552076251355797. eCollection 2025 Jan-Dec.

Assessing the Accuracy of Diagnostic Capabilities of Large Language Models.

Diagnostics (Basel). 2025 Jun 29;15(13):1657. doi: 10.3390/diagnostics15131657.

Exploring the possibilities and limitations of customized large language model to support and improve cervical cancer screening.

BMC Med Inform Decis Mak. 2025 Jul 1;25(1):242. doi: 10.1186/s12911-025-03088-3.

How accurate are ChatGPT-4 responses in chronic urticaria? A critical analysis with information quality metrics.

World Allergy Organ J. 2025 Jun 14;18(7):101071. doi: 10.1016/j.waojou.2025.101071. eCollection 2025 Jul.

Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models.

Mayo Clin Proc Digit Health. 2025 May 23;3(3):100230. doi: 10.1016/j.mcpdig.2025.100230. eCollection 2025 Sep.

本文引用的文献

Fine-tuning large neural language models for biomedical natural language processing.

Patterns (N Y). 2023 Apr 14;4(4):100729. doi: 10.1016/j.patter.2023.100729.

Evaluating the use of large language model in identifying top research questions in gastroenterology.

Sci Rep. 2023 Mar 13;13(1):4164. doi: 10.1038/s41598-023-31412-2.

On the cusp: Considering the impact of artificial intelligence language models in healthcare.

Med. 2023 Mar 10;4(3):139-140. doi: 10.1016/j.medj.2023.02.008.

The potential impact of ChatGPT in clinical and translational medicine.

Clin Transl Med. 2023 Mar;13(3):e1216. doi: 10.1002/ctm2.1216.

AI in health and medicine.

Nat Med. 2022 Jan;28(1):31-38. doi: 10.1038/s41591-021-01614-0. Epub 2022 Jan 20.

Prevalence of Health Misinformation on Social Media: Systematic Review.

J Med Internet Res. 2021 Jan 20;23(1):e17187. doi: 10.2196/17187.

History of artificial intelligence in medicine.

Gastrointest Endosc. 2020 Oct;92(4):807-812. doi: 10.1016/j.gie.2020.06.040. Epub 2020 Jun 18.

Interventions to Improve Patient Comprehension in Informed Consent for Medical and Surgical Procedures: An Updated Systematic Review.

Med Decis Making. 2020 Feb;40(2):119-143. doi: 10.1177/0272989X19896348. Epub 2020 Jan 16.

[Conservative Treatment of Thoracic and Lumbar Vertebral Fractures - what's it all about?].

Z Orthop Unfall. 2019 Oct;157(5):574-596. doi: 10.1055/a-0824-8692. Epub 2019 Oct 8.

AI in Health: State of the Art, Challenges, and Future Directions.

Yearb Med Inform. 2019 Aug;28(1):16-26. doi: 10.1055/s-0039-1677908. Epub 2019 Aug 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大型语言模型在 3 个临床专业领域的治疗推荐中的应用：比较研究。

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献