Suppr超能文献

作为结肠癌管理临床决策支持工具的公开可用大语言模型的准确性和一致性。

Accuracy and consistency of publicly available Large Language Models as clinical decision support tools for the management of colon cancer.

作者信息

Kaiser Kristen N, Hughes Alexa J, Yang Anthony D, Turk Anita A, Mohanty Sanjay, Gonzalez Andrew A, Patzer Rachel E, Bilimoria Karl Y, Ellis Ryan J

机构信息

Department of Surgery, Indiana School of Medicine, Surgical Outcomes and Quality Improvement Center (SOQIC), Indianapolis, Indiana, USA.

Department of Surgery, Division of Surgical Oncology, Indiana University School of Medicine, Indianapolis, Indiana, USA.

出版信息

J Surg Oncol. 2024 Oct;130(5):1104-1110. doi: 10.1002/jso.27821. Epub 2024 Aug 19.

Abstract

BACKGROUND

Large Language Models (LLM; e.g., ChatGPT) may be used to assist clinicians and form the basis of future clinical decision support (CDS) for colon cancer. The objectives of this study were to (1) evaluate the response accuracy of two LLM-powered interfaces in identifying guideline-based care in simulated clinical scenarios and (2) define response variation between and within LLMs.

METHODS

Clinical scenarios with "next steps in management" queries were developed based on National Comprehensive Cancer Network guidelines. Prompts were entered into OpenAI ChatGPT and Microsoft Copilot in independent sessions, yielding four responses per scenario. Responses were compared to clinician-developed responses and assessed for accuracy, consistency, and verbosity.

RESULTS

Across 108 responses to 27 prompts, both platforms yielded completely correct responses to 36% of scenarios (n = 39). For ChatGPT, 39% (n = 21) were missing information and 24% (n = 14) contained inaccurate/misleading information. Copilot performed similarly, with 37% (n = 20) having missing information and 28% (n = 15) containing inaccurate/misleading information (p = 0.96). Clinician responses were significantly shorter (34 ± 15.5 words) than both ChatGPT (251 ± 86 words) and Copilot (271 ± 67 words; both p < 0.01).

CONCLUSIONS

Publicly available LLM applications often provide verbose responses with vague or inaccurate information regarding colon cancer management. Significant optimization is required before use in formal CDS.

摘要

背景

大语言模型(LLM,例如ChatGPT)可用于协助临床医生,并为未来结肠癌的临床决策支持(CDS)奠定基础。本研究的目的是:(1)评估两个由LLM驱动的界面在模拟临床场景中识别基于指南的治疗方案时的回答准确性;(2)确定不同LLM之间以及同一LLM内部的回答差异。

方法

根据美国国立综合癌症网络指南制定了包含“下一步治疗方案”问题的临床场景。在独立会话中,将提示输入到OpenAI ChatGPT和Microsoft Copilot中,每个场景产生四个回答。将这些回答与临床医生给出的回答进行比较,并评估其准确性、一致性和冗长程度。

结果

在对27个提示的108个回答中,两个平台对36%的场景(n = 39)给出了完全正确的回答。对于ChatGPT,39%(n = 21)的回答缺少信息,24%(n = 14)的回答包含不准确/误导性信息。Copilot的表现类似,37%(n = 20)的回答缺少信息,28%(n = 15)的回答包含不准确/误导性信息(p = 0.96)。临床医生的回答明显比ChatGPT(251 ± 86个单词)和Copilot(271 ± 67个单词)都短(均p < 0.01)。

结论

公开可用的LLM应用程序通常提供冗长的回答,且关于结肠癌治疗的信息模糊或不准确。在用于正式的CDS之前,需要进行重大优化。

相似文献

本文引用的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验