Suppr超能文献

在耳鼻喉科、头颈外科中,评估本地运行和基于网络的大语言模型与人类委员会建议的决策情况。

Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery.

作者信息

Buhr Christoph Raphael, Ernst Benjamin Philipp, Blaikie Andrew, Smith Harry, Kelsey Tom, Matthias Christoph, Fleischmann Maximilian, Jungmann Florian, Alt Jürgen, Brandts Christian, Kämmerer Peer W, Foersch Sebastian, Kuhn Sebastian, Eckrich Jonas

机构信息

Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, 55131, Mainz, Germany.

School of Medicine, University of St Andrews, St Andrews, UK.

出版信息

Eur Arch Otorhinolaryngol. 2025 Mar;282(3):1593-1607. doi: 10.1007/s00405-024-09153-3. Epub 2025 Jan 10.

Abstract

INTRODUCTION

Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns.

MATERIAL AND METHODS

Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations.

RESULTS

ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions.

DISCUSSION

This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.

摘要

引言

肿瘤多学科协作组是现代癌症治疗的基石。鉴于其先进的功能,大语言模型(LLMs)在为耳鼻喉科(ORL)头颈外科生成肿瘤多学科协作组决策方面的作用日益受到关注。然而,对基于网络的大语言模型中数据保护和患者机密信息使用的担忧限制了它们的广泛采用,并阻碍了对其全部潜力的探索。在同类的第一项研究中,我们将标准的人类多学科肿瘤多学科协作组建议(MDT)与一个基于网络的大语言模型(ChatGPT - 4o)和一个本地运行的大语言模型(Llama 3)进行了比较,同时解决了数据保护问题。

材料与方法

向一个由耳鼻喉科、颅颌面外科、医学肿瘤学、放射学、放射肿瘤学和病理学专家组成的MDT展示了25个模拟肿瘤多学科协作组病例。这个多学科团队对病例进行了全面分析。使用结构化提示将相同的病例输入到ChatGPT - 4o和Llama 3中,并评估大语言模型与MDT建议之间的一致性。四名MDT成员从医学充分性(使用六点李克特量表)以及所提供的信息是否可能影响MDT的原始建议方面评估了大语言模型的建议。

结果

在区分根治性和姑息性治疗策略方面,ChatGPT - 4o与MDT的一致性为84%(25个病例中的21个),Llama 3与MDT的一致性为92%(25个病例中的23个)。在64%的病例(16/25)中ChatGPT - 4o以及在60%的病例(15/25)中Llama识别出了MDT考虑的所有一线治疗方案,尽管优先级不同。ChatGPT - 4o在52%的病例(13/25)中呈现了MDT的所有一线治疗方案,而Llama 3在48%的病例(12/25)中提供了同源治疗策略。此外,两个模型在28%的病例(7/25)中都将MDT的至少一种一线治疗方案作为其首要推荐。医学充分性评分方面,ChatGPT - 4o的平均得分为4.7(四分位距:4 - 6),Llama 3的平均得分为4.3(四分位距:3 - 5)。在17%的评估(33/200)中,MDT成员表示大语言模型的建议可能会增强MDT的决策。

讨论

本研究证明了两个大语言模型在耳鼻喉科头颈外科提供可行治疗建议的能力。本地运行的Llama 3绕过了许多数据保护问题,并显示出作为支持MDT决策的临床工具的潜力。然而目前,大语言模型应该增强而不是取代人类决策。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c874/11890241/c54a33d1e61a/405_2024_9153_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验