大型语言模型在 3 个临床专业领域的治疗推荐中的应用:比较研究。

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

机构信息

Eye Center, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany.

Medical Graduate Center, School of Medicine, Technical University of Munich, Munich, Germany.

出版信息

J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.

Abstract

BACKGROUND

As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safety of the generated information across diverse medical specialties.

OBJECTIVE

This study aimed to evaluate the performance of 4 prominent LLMs, namely, Claude-instant-v1.0, GPT-3.5-Turbo, Command-xlarge-nightly, and Bloomz, in generating medical content spanning the clinical specialties of ophthalmology, orthopedics, and dermatology.

METHODS

Three domain-specific physicians evaluated the AI-generated therapeutic recommendations for a diverse set of 60 diseases. The evaluation criteria involved the mDISCERN score, correctness, and potential harmfulness of the recommendations. ANOVA and pairwise t tests were used to explore discrepancies in content quality and safety across models and specialties. Additionally, using the capabilities of OpenAI's most advanced model, GPT-4, an automated evaluation of each model's responses to the diseases was performed using the same criteria and compared to the physicians' assessments through Pearson correlation analysis.

RESULTS

Claude-instant-v1.0 emerged with the highest mean mDISCERN score (3.35, 95% CI 3.23-3.46). In contrast, Bloomz lagged with the lowest score (1.07, 95% CI 1.03-1.10). Our analysis revealed significant differences among the models in terms of quality (P<.001). Evaluating their reliability, the models displayed strong contrasts in their falseness ratings, with variations both across models (P<.001) and specialties (P<.001). Distinct error patterns emerged, such as confusing diagnoses; providing vague, ambiguous advice; or omitting critical treatments, such as antibiotics for infectious diseases. Regarding potential harm, GPT-3.5-Turbo was found to be the safest, with the lowest harmfulness rating. All models lagged in detailing the risks associated with treatment procedures, explaining the effects of therapies on quality of life, and offering additional sources of information. Pearson correlation analysis underscored a substantial alignment between physician assessments and GPT-4's evaluations across all established criteria (P<.01).

CONCLUSIONS

This study, while comprehensive, was limited by the involvement of a select number of specialties and physician evaluators. The straightforward prompting strategy ("How to treat…") and the assessment benchmarks, initially conceptualized for human-authored content, might have potential gaps in capturing the nuances of AI-driven information. The LLMs evaluated showed a notable capability in generating valuable medical content; however, evident lapses in content quality and potential harm signal the need for further refinements. Given the dynamic landscape of LLMs, this study's findings emphasize the need for regular and methodical assessments, oversight, and fine-tuning of these AI tools to ensure they produce consistently trustworthy and clinically safe medical advice. Notably, the introduction of an auto-evaluation mechanism using GPT-4, as detailed in this study, provides a scalable, transferable method for domain-agnostic evaluations, extending beyond therapy recommendation assessments.

摘要

背景

随着人工智能(AI)的不断进步,大型语言模型(LLM)已成为生成医疗信息的有前途的工具。为了评估其在医疗保健方面的快速适应性和潜在益处,需要从不同医学专业的质量、准确性和安全性等方面对生成信息进行严格评估。

目的

本研究旨在评估 4 种知名的 LLM(Claude-instant-v1.0、GPT-3.5-Turbo、Command-xlarge-nightly 和 Bloomz)在生成涵盖眼科、骨科和皮肤科等临床专业的医学内容方面的性能。

方法

三位特定领域的医生评估了针对 60 种不同疾病的 AI 生成的治疗建议。评估标准包括 mDISCERN 评分、正确性和建议的潜在危害性。使用方差分析和配对 t 检验,研究了不同模型和专业之间的内容质量和安全性差异。此外,利用 OpenAI 最先进模型的功能,我们使用相同的标准对每个模型对疾病的反应进行了自动评估,并通过 Pearson 相关分析将其与医生的评估进行比较。

结果

Claude-instant-v1.0 的平均 mDISCERN 评分最高(3.35,95%置信区间 3.23-3.46)。相比之下,Bloomz 的得分最低(1.07,95%置信区间 1.03-1.10)。我们的分析表明,在质量方面,模型之间存在显著差异(P<.001)。评估其可靠性后,我们发现模型在虚假评级方面存在强烈的差异,这不仅体现在模型之间(P<.001),也体现在专业之间(P<.001)。出现了不同的错误模式,例如混淆诊断、提供模糊、模棱两可的建议或遗漏关键治疗方法,例如针对传染病的抗生素。在潜在危害方面,GPT-3.5-Turbo 被认为是最安全的,其危害评分最低。所有模型在详细说明治疗程序相关风险、解释治疗对生活质量的影响以及提供其他信息来源方面都存在滞后。Pearson 相关分析强调了医生评估与 GPT-4 评估在所有既定标准方面的高度一致性(P<.01)。

结论

虽然本研究全面,但受到参与的专业和医生评估者数量的限制。直截了当的提示策略(“如何治疗……”)和评估基准,最初是为人类撰写的内容而构思的,可能在捕捉 AI 驱动信息的细微差别方面存在潜在差距。评估的 LLM 表现出生成有价值的医学内容的显著能力;然而,内容质量和潜在危害方面的明显缺陷表明需要进一步改进。鉴于 LLM 的动态发展,本研究的结果强调需要定期和系统地评估、监督和调整这些 AI 工具,以确保它们能够持续提供值得信赖且临床安全的医疗建议。值得注意的是,本研究详细介绍的使用 GPT-4 进行自动评估的方法提供了一种可扩展的、适用于所有领域的评估方法,超越了治疗建议评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f37/10644179/47502e510ec8/jmir_v25i1e49324_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索