前沿人工智能：依据美国国立综合癌症网络（NCCN）指南，利用Gemini-1.0 Ultra和ChatGPT-4为六种胃肠道癌症提供肿瘤护理指导

AI at the Forefront: Navigating Oncologic Care for Six Gastrointestinal Cancers According to the NCCN Guidelines Utilizing Gemini-1.0 Ultra and ChatGPT-4.

作者信息

Bresler Tamir E, Wilson Tyler, Makaryan Tadevos, Pandya Shivam, Palmer Kevin, Meyer Ryan, Htway Zin M, Fujita Manabu

机构信息

Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, California, USA.

Department of Laboratory, Los Robles Regional Medical Center, Thousand Oaks, California, USA.

出版信息

J Surg Oncol. 2025 Jun 19. doi: 10.1002/jso.70005.

DOI:10.1002/jso.70005

PMID:40536141

Abstract

BACKGROUND AND OBJECTIVES

We explored the ability of large language models (LLMs) ChatGPT-4 and Gemini 1.0 Ultra in guiding clinical decision-making for six gastrointestinal cancers using the National Comprehensive Cancer Network (NCCN) Clinical Practice Guidelines.

METHODS

We reviewed the NCCN Guidelines for anal squamous cell carcinoma, small bowel, ampullary, and pancreatic adenocarcinoma, and biliary tract and gastric cancers. Clinical questions were designed and categorized by type, queried up to three times, and rated on a Likert scale: (5) Correct; (4) Correct following clarification; (3) Correct but incomplete; (2) Partially incorrect; (1) Absolutely incorrect. Subgroup analysis was conducted on Correctness (scores 3-5) and Accuracy (scores 4-5).

RESULTS

A total of 270 questions were generated (range-per-cancer 32-68). ChatGPT-4 versus Gemini 1.0 Ultra score differences were not statistically-significant (Mean Rank 278.30 vs. 262.70, p = 0.222). Correctness was seen in 77.78% versus 75.93% of responses, and Accuracy in 64.81% versus 57.41%. There were no statistically-significant differences in Correctness or Accuracy between LLMs in terms of question or cancer type.

CONCLUSIONS

Both LLMs demonstrated a limited capacity to assist with complex clinical decision-making. Their current Accuracy level falls below the acceptable threshold for clinical use. Future studies exploring LLMs in the healthcare domain are warranted.

摘要

背景与目的

我们使用美国国立综合癌症网络（NCCN）临床实践指南，探讨了大语言模型（LLMs）ChatGPT-4和Gemini 1.0 Ultra在指导六种胃肠道癌症临床决策方面的能力。

方法

我们回顾了NCCN关于肛管鳞状细胞癌、小肠癌、壶腹癌和胰腺腺癌以及胆管癌和胃癌的指南。设计临床问题并按类型分类，最多查询三次，并采用李克特量表进行评分：（5）正确；（4）经澄清后正确；（3）正确但不完整；（2）部分错误；（1）绝对错误。对正确性（得分3 - 5）和准确性（得分4 - 5）进行亚组分析。

结果

共提出270个问题（每种癌症的问题数量范围为32 - 68个）。ChatGPT-4与Gemini 1.0 Ultra的得分差异无统计学意义（平均秩次278.30对262.70，p = 0.222）。回答的正确性分别为77.78%和75.93%，准确性分别为64.81%和57.41%。在问题或癌症类型方面，大语言模型在正确性或准确性上没有统计学显著差异。