Bresler Tamir E, Wilson Tyler, Makaryan Tadevos, Pandya Shivam, Palmer Kevin, Meyer Ryan, Htway Zin M, Fujita Manabu
Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, California, USA.
Department of Laboratory, Los Robles Regional Medical Center, Thousand Oaks, California, USA.
J Surg Oncol. 2025 Jun 19. doi: 10.1002/jso.70005.
We explored the ability of large language models (LLMs) ChatGPT-4 and Gemini 1.0 Ultra in guiding clinical decision-making for six gastrointestinal cancers using the National Comprehensive Cancer Network (NCCN) Clinical Practice Guidelines.
We reviewed the NCCN Guidelines for anal squamous cell carcinoma, small bowel, ampullary, and pancreatic adenocarcinoma, and biliary tract and gastric cancers. Clinical questions were designed and categorized by type, queried up to three times, and rated on a Likert scale: (5) Correct; (4) Correct following clarification; (3) Correct but incomplete; (2) Partially incorrect; (1) Absolutely incorrect. Subgroup analysis was conducted on Correctness (scores 3-5) and Accuracy (scores 4-5).
A total of 270 questions were generated (range-per-cancer 32-68). ChatGPT-4 versus Gemini 1.0 Ultra score differences were not statistically-significant (Mean Rank 278.30 vs. 262.70, p = 0.222). Correctness was seen in 77.78% versus 75.93% of responses, and Accuracy in 64.81% versus 57.41%. There were no statistically-significant differences in Correctness or Accuracy between LLMs in terms of question or cancer type.
Both LLMs demonstrated a limited capacity to assist with complex clinical decision-making. Their current Accuracy level falls below the acceptable threshold for clinical use. Future studies exploring LLMs in the healthcare domain are warranted.
我们使用美国国立综合癌症网络(NCCN)临床实践指南,探讨了大语言模型(LLMs)ChatGPT-4和Gemini 1.0 Ultra在指导六种胃肠道癌症临床决策方面的能力。
我们回顾了NCCN关于肛管鳞状细胞癌、小肠癌、壶腹癌和胰腺腺癌以及胆管癌和胃癌的指南。设计临床问题并按类型分类,最多查询三次,并采用李克特量表进行评分:(5)正确;(4)经澄清后正确;(3)正确但不完整;(2)部分错误;(1)绝对错误。对正确性(得分3 - 5)和准确性(得分4 - 5)进行亚组分析。
共提出270个问题(每种癌症的问题数量范围为32 - 68个)。ChatGPT-4与Gemini 1.0 Ultra的得分差异无统计学意义(平均秩次278.30对262.70,p = 0.222)。回答的正确性分别为77.78%和75.93%,准确性分别为64.81%和57.41%。在问题或癌症类型方面,大语言模型在正确性或准确性上没有统计学显著差异。
两种大语言模型在协助复杂临床决策方面的能力有限。它们目前的准确性水平低于临床使用的可接受阈值。有必要开展未来研究探索大语言模型在医疗领域的应用。