作为结肠癌管理临床决策支持工具的公开可用大语言模型的准确性和一致性。

Accuracy and consistency of publicly available Large Language Models as clinical decision support tools for the management of colon cancer.

作者信息

Kaiser Kristen N, Hughes Alexa J, Yang Anthony D, Turk Anita A, Mohanty Sanjay, Gonzalez Andrew A, Patzer Rachel E, Bilimoria Karl Y, Ellis Ryan J

机构信息

Department of Surgery, Indiana School of Medicine, Surgical Outcomes and Quality Improvement Center (SOQIC), Indianapolis, Indiana, USA.

Department of Surgery, Division of Surgical Oncology, Indiana University School of Medicine, Indianapolis, Indiana, USA.

出版信息

J Surg Oncol. 2024 Oct;130(5):1104-1110. doi: 10.1002/jso.27821. Epub 2024 Aug 19.

DOI:10.1002/jso.27821

PMID:39155667

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12049739/

Abstract

BACKGROUND

Large Language Models (LLM; e.g., ChatGPT) may be used to assist clinicians and form the basis of future clinical decision support (CDS) for colon cancer. The objectives of this study were to (1) evaluate the response accuracy of two LLM-powered interfaces in identifying guideline-based care in simulated clinical scenarios and (2) define response variation between and within LLMs.

METHODS

Clinical scenarios with "next steps in management" queries were developed based on National Comprehensive Cancer Network guidelines. Prompts were entered into OpenAI ChatGPT and Microsoft Copilot in independent sessions, yielding four responses per scenario. Responses were compared to clinician-developed responses and assessed for accuracy, consistency, and verbosity.

RESULTS

Across 108 responses to 27 prompts, both platforms yielded completely correct responses to 36% of scenarios (n = 39). For ChatGPT, 39% (n = 21) were missing information and 24% (n = 14) contained inaccurate/misleading information. Copilot performed similarly, with 37% (n = 20) having missing information and 28% (n = 15) containing inaccurate/misleading information (p = 0.96). Clinician responses were significantly shorter (34 ± 15.5 words) than both ChatGPT (251 ± 86 words) and Copilot (271 ± 67 words; both p < 0.01).

CONCLUSIONS

Publicly available LLM applications often provide verbose responses with vague or inaccurate information regarding colon cancer management. Significant optimization is required before use in formal CDS.

摘要

背景

大语言模型（LLM，例如ChatGPT）可用于协助临床医生，并为未来结肠癌的临床决策支持（CDS）奠定基础。本研究的目的是：（1）评估两个由LLM驱动的界面在模拟临床场景中识别基于指南的治疗方案时的回答准确性；（2）确定不同LLM之间以及同一LLM内部的回答差异。

方法

根据美国国立综合癌症网络指南制定了包含“下一步治疗方案”问题的临床场景。在独立会话中，将提示输入到OpenAI ChatGPT和Microsoft Copilot中，每个场景产生四个回答。将这些回答与临床医生给出的回答进行比较，并评估其准确性、一致性和冗长程度。

结果

在对27个提示的108个回答中，两个平台对36%的场景（n = 39）给出了完全正确的回答。对于ChatGPT，39%（n = 21）的回答缺少信息，24%（n = 14）的回答包含不准确/误导性信息。Copilot的表现类似，37%（n = 20）的回答缺少信息，28%（n = 15）的回答包含不准确/误导性信息（p = 0.96）。临床医生的回答明显比ChatGPT（251 ± 86个单词）和Copilot（271 ± 67个单词）都短（均p < 0.01）。

结论

公开可用的LLM应用程序通常提供冗长的回答，且关于结肠癌治疗的信息模糊或不准确。在用于正式的CDS之前，需要进行重大优化。

相似文献

Accuracy and consistency of publicly available Large Language Models as clinical decision support tools for the management of colon cancer.作为结肠癌管理临床决策支持工具的公开可用大语言模型的准确性和一致性。

J Surg Oncol. 2024 Oct;130(5):1104-1110. doi: 10.1002/jso.27821. Epub 2024 Aug 19.

Use of large language models as clinical decision support tools for management pancreatic adenocarcinoma using National Comprehensive Cancer Network guidelines.使用大语言模型作为依据美国国立综合癌症网络指南管理胰腺腺癌的临床决策支持工具。

Surgery. 2025 Jun;182:109267. doi: 10.1016/j.surg.2025.109267. Epub 2025 Mar 6.

Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.大型语言模型在新冠肺炎对妊娠影响方面的熟练度、清晰度和客观性与专家知识对比：横断面试点研究

JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126.

AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估：观察性比较案例研究

J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.

Performance of ChatGPT and Microsoft Copilot in Bing in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports.ChatGPT和必应中的微软Copilot在回答产科超声问题及分析产科超声报告方面的表现。

Sci Rep. 2025 Apr 26;15(1):14627. doi: 10.1038/s41598-025-99268-2.

A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.大语言模型与人类受试者在皮肤病学方面表现的比较分析

Indian Dermatol Online J. 2025 Feb 27;16(2):241-247. doi: 10.4103/idoj.idoj_221_24. eCollection 2025 Mar-Apr.

Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy.评估大语言模型（ChatGPT-4、Gemini和Microsoft Copilot）对乳腺成像常见问题的回答：可读性和准确性研究

Cureus. 2024 May 9;16(5):e59960. doi: 10.7759/cureus.59960. eCollection 2024 May.

Exploring the role of artificial intelligence, large language models: Comparing patient-focused information and clinical decision support capabilities to the gynecologic oncology guidelines.探索人工智能、大语言模型的作用：将以患者为中心的信息和临床决策支持能力与妇科肿瘤学指南进行比较。

Int J Gynaecol Obstet. 2025 Feb;168(2):419-427. doi: 10.1002/ijgo.15869. Epub 2024 Aug 20.

Impact of artificial intelligence in managing musculoskeletal pathologies in physiatry: a qualitative observational study evaluating the potential use of ChatGPT versus Copilot for patient information and clinical advice on low back pain.人工智能在物理医学中管理肌肉骨骼疾病的影响：一项定性观察性研究，评估ChatGPT与Copilot在提供腰痛患者信息和临床建议方面的潜在用途。

J Yeungnam Med Sci. 2025;42:11. doi: 10.12701/jyms.2024.01151. Epub 2024 Nov 29.

Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study.商用大型语言模型（ChatGPT）运用简单分诊与快速治疗（START）协议对模拟患者进行灾难分诊的准确性：再现性和可重复性研究。

J Med Internet Res. 2024 Sep 30;26:e55648. doi: 10.2196/55648.

引用本文的文献

The Role of Artificial Intelligence (ChatGPT-4o) in Supporting Tumor Board Decisions.人工智能（ChatGPT-4o）在辅助肿瘤专家委员会决策中的作用

J Clin Med. 2025 May 18;14(10):3535. doi: 10.3390/jcm14103535.

Careful design of Large Language Model pipelines enables expert-level retrieval of evidence-based information from syntheses and databases.精心设计大语言模型管道，能够从综合资料和数据库中以专家级水平检索循证信息。

PLoS One. 2025 May 15;20(5):e0323563. doi: 10.1371/journal.pone.0323563. eCollection 2025.

Assessing the accuracy of the GPT-4 model in multidisciplinary tumor board decision prediction.评估GPT-4模型在多学科肿瘤病例讨论决策预测中的准确性。

Clin Transl Oncol. 2025 Mar 25. doi: 10.1007/s12094-025-03905-1.

Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology.医疗保健教育中的人工智能：评估ChatGPT、Copilot和谷歌Gemini在心血管药理学方面的准确性。

Front Med (Lausanne). 2025 Feb 19;12:1495378. doi: 10.3389/fmed.2025.1495378. eCollection 2025.

本文引用的文献

Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making.整合人类专业知识和自动化方法，对大型语言模型在临床决策中的可行性进行动态和多参数评估。

Int J Med Inform. 2024 Aug;188:105501. doi: 10.1016/j.ijmedinf.2024.105501. Epub 2024 May 26.

Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard.大型语言模型在造血干细胞移植导航中对医疗保健专业人员和患者的实用性：ChatGPT-3.5、ChatGPT-4 和 Bard 的性能比较。

J Med Internet Res. 2024 May 17;26:e54758. doi: 10.2196/54758.

AI-Driven Clinical Decision Support Systems: An Ongoing Pursuit of Potential.人工智能驱动的临床决策支持系统：对潜力的持续追求。

Cureus. 2024 Apr 6;16(4):e57728. doi: 10.7759/cureus.57728. eCollection 2024 Apr.

Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.大语言模型与用户信任：自我参照学习循环的后果及医疗保健专业人员的技能退化

J Med Internet Res. 2024 Apr 25;26:e56764. doi: 10.2196/56764.

Artificial intelligence large language model ChatGPT: is it a trustworthy and reliable source of information for sarcoma patients?人工智能大语言模型 ChatGPT：它是肉瘤患者值得信赖和可靠的信息来源吗？

Front Public Health. 2024 Mar 22;12:1303319. doi: 10.3389/fpubh.2024.1303319. eCollection 2024.

Large language models as decision aids in neuro-oncology: a review of shared decision-making applications.大语言模型在神经肿瘤学中的决策辅助作用：综述共享决策应用。

J Cancer Res Clin Oncol. 2024 Mar 19;150(3):139. doi: 10.1007/s00432-024-05673-x.

Assessing the research landscape and clinical utility of large language models: a scoping review.评估大型语言模型的研究现状和临床实用性：范围综述。

BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.

A Survey of Clinicians' Views of the Utility of Large Language Models.临床医生对大型语言模型实用性的看法调查。

Appl Clin Inform. 2024 Mar;15(2):306-312. doi: 10.1055/a-2281-7092. Epub 2024 Mar 5.

Exploring the landscape of AI-assisted decision-making in head and neck cancer treatment: a comparative analysis of NCCN guidelines and ChatGPT responses.探索人工智能辅助头颈部癌症治疗决策的全景：NCCN 指南与 ChatGPT 回复的比较分析。

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2123-2136. doi: 10.1007/s00405-024-08525-z. Epub 2024 Feb 29.

Advancing Medical Practice with Artificial Intelligence: ChatGPT in Healthcare.人工智能推动医学实践：医疗保健中的 ChatGPT。

Isr Med Assoc J. 2024 Feb;26(2):80-85.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验