Kaiser Kristen N, Hughes Alexa J, Yang Anthony D, Turk Anita A, Mohanty Sanjay, Gonzalez Andrew A, Patzer Rachel E, Bilimoria Karl Y, Ellis Ryan J
Department of Surgery, Indiana School of Medicine, Surgical Outcomes and Quality Improvement Center (SOQIC), Indianapolis, Indiana, USA.
Department of Surgery, Division of Surgical Oncology, Indiana University School of Medicine, Indianapolis, Indiana, USA.
J Surg Oncol. 2024 Oct;130(5):1104-1110. doi: 10.1002/jso.27821. Epub 2024 Aug 19.
Large Language Models (LLM; e.g., ChatGPT) may be used to assist clinicians and form the basis of future clinical decision support (CDS) for colon cancer. The objectives of this study were to (1) evaluate the response accuracy of two LLM-powered interfaces in identifying guideline-based care in simulated clinical scenarios and (2) define response variation between and within LLMs.
Clinical scenarios with "next steps in management" queries were developed based on National Comprehensive Cancer Network guidelines. Prompts were entered into OpenAI ChatGPT and Microsoft Copilot in independent sessions, yielding four responses per scenario. Responses were compared to clinician-developed responses and assessed for accuracy, consistency, and verbosity.
Across 108 responses to 27 prompts, both platforms yielded completely correct responses to 36% of scenarios (n = 39). For ChatGPT, 39% (n = 21) were missing information and 24% (n = 14) contained inaccurate/misleading information. Copilot performed similarly, with 37% (n = 20) having missing information and 28% (n = 15) containing inaccurate/misleading information (p = 0.96). Clinician responses were significantly shorter (34 ± 15.5 words) than both ChatGPT (251 ± 86 words) and Copilot (271 ± 67 words; both p < 0.01).
Publicly available LLM applications often provide verbose responses with vague or inaccurate information regarding colon cancer management. Significant optimization is required before use in formal CDS.
大语言模型(LLM,例如ChatGPT)可用于协助临床医生,并为未来结肠癌的临床决策支持(CDS)奠定基础。本研究的目的是:(1)评估两个由LLM驱动的界面在模拟临床场景中识别基于指南的治疗方案时的回答准确性;(2)确定不同LLM之间以及同一LLM内部的回答差异。
根据美国国立综合癌症网络指南制定了包含“下一步治疗方案”问题的临床场景。在独立会话中,将提示输入到OpenAI ChatGPT和Microsoft Copilot中,每个场景产生四个回答。将这些回答与临床医生给出的回答进行比较,并评估其准确性、一致性和冗长程度。
在对27个提示的108个回答中,两个平台对36%的场景(n = 39)给出了完全正确的回答。对于ChatGPT,39%(n = 21)的回答缺少信息,24%(n = 14)的回答包含不准确/误导性信息。Copilot的表现类似,37%(n = 20)的回答缺少信息,28%(n = 15)的回答包含不准确/误导性信息(p = 0.96)。临床医生的回答明显比ChatGPT(251 ± 86个单词)和Copilot(271 ± 67个单词)都短(均p < 0.01)。
公开可用的LLM应用程序通常提供冗长的回答,且关于结肠癌治疗的信息模糊或不准确。在用于正式的CDS之前,需要进行重大优化。