大语言模型（ChatGPT和Gemini Advanced）在胃肠病理学及胃肠病学应用临床综述中的表现

Performance of Large Language Models (ChatGPT and Gemini Advanced) in Gastrointestinal Pathology and Clinical Review of Applications in Gastroenterology.

作者信息

Jain Swachi, Chakraborty Baidarbhi, Agarwal Ashish, Sharma Rashi

机构信息

Pathology and Laboratory Medicine, Icahn School of Medicine at Mount Sinai, New York, USA.

Pathology, St. Clare Hospital, Denville, USA.

出版信息

Cureus. 2025 Apr 2;17(4):e81618. doi: 10.7759/cureus.81618. eCollection 2025 Apr.

DOI:10.7759/cureus.81618

PMID:40322390

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12048130/

Abstract

Introduction Artificial intelligence (AI) chatbots have been widely tested in their performance on various examinations, with limited data on their performance in clinical scenarios. The role of Chat Generative Pre-Trained Transformer (ChatGPT) (OpenAI, San Francisco, California, United States) and Gemini Advanced (Google LLC, Mountain View, California, United States) in multiple aspects of gastroenterology including answering patient questions, providing medical advice, and as tools to potentially assist healthcare providers has shown some promise, though associated with many limitations. We aimed to study the performance of ChatGPT-4.0, ChatGPT-3.5, and Gemini Advanced across 20 clinicopathologic scenarios in the unexplored realm of gastrointestinal pathology. Materials and methods Twenty clinicopathological scenarios in gastrointestinal pathology were provided to these three large language models. Two fellowship-trained pathologists independently assessed their responses, evaluating both the diagnostic accuracy and confidence of the models. The results were then compared using the chi-squared test. The study also evaluated each model's ability in four key areas, namely, (1) ability to provide differential diagnoses, (2) interpretation of immunohistochemical stains, (3) ability to deliver a concise final diagnosis, (4) and explanation provided for the thought process, using a five-point scoring system. The mean, median score±standard deviation (SD), and interquartile ranges were calculated. A comparative analysis of these four parameters across ChatGPT-4.0, ChatGPT-3.5, and Gemini Advanced was conducted using the Mann-Whitney U test. A p-value of <0.05 was considered statistically significant. Other parameters evaluated were the ability to provide a tumor, node, and metastasis (TNM) stage and the incidence of pseudo-references "hallucinations" while citing reference material. Results Gemini Advanced (diagnostic accuracy: p=0.01; providing differential diagnosis: p=0.03) and ChatGPT-4.0 (interpretation of immunohistochemistry (IHC) stains: p=0.001; providing differential diagnosis: p=0.002) performed significantly better in certain realms than ChatGPT-3.5, indicating continuously improving data training sets. However, the mean performances of ChatGPT-4.0 and Gemini Advanced ranged between 3.0 and 3.7 and were at best classified as average. None of the models could provide the accurate TNM staging for these clinical scenarios, with 25-50% citing references that do not exist (hallucinations). Conclusion This study indicated that though these models are evolving, they need human supervision and definite improvements before being used in clinical medicine. This is the first study of its kind in gastrointestinal pathology to the best of our knowledge.

摘要

引言人工智能（AI）聊天机器人已在各种考试中的表现方面得到广泛测试，但在临床场景中的表现数据有限。聊天生成预训练变换器（ChatGPT）（美国加利福尼亚州旧金山的OpenAI公司）和Gemini Advanced（美国加利福尼亚州山景城的谷歌有限责任公司）在胃肠病学的多个方面，包括回答患者问题、提供医疗建议以及作为潜在协助医疗保健提供者的工具，已显示出一些前景，尽管存在许多局限性。我们旨在研究ChatGPT - 4.0、ChatGPT - 3.5和Gemini Advanced在胃肠道病理学未探索领域的20个临床病理场景中的表现。

材料与方法向这三个大语言模型提供了胃肠道病理学中的20个临床病理场景。两位经过专科培训的病理学家独立评估它们的回答，评估模型的诊断准确性和可信度。然后使用卡方检验比较结果。该研究还使用五分制评分系统评估了每个模型在四个关键领域的能力，即（1）提供鉴别诊断的能力，（2）免疫组化染色的解读，（3）给出简明最终诊断的能力，（4）为思维过程提供的解释。计算了平均值、中位数得分±标准差（SD）和四分位间距。使用曼 - 惠特尼U检验对ChatGPT - 4.0、ChatGPT - 3.5和Gemini Advanced在这四个参数上进行了比较分析。p值<0.05被认为具有统计学意义。评估的其他参数包括提供肿瘤、淋巴结和转移（TNM）分期的能力以及引用参考资料时假参考“幻觉”的发生率。

结果 Gemini Advanced（诊断准确性：p = 0.01；提供鉴别诊断：p = 0.03）和ChatGPT - 4.0（免疫组化（IHC）染色解读：p = 0.001；提供鉴别诊断：p = 0.002）在某些领域的表现明显优于ChatGPT - 3.5，表明数据训练集在不断改进。然而，ChatGPT - 4.0和Gemini Advanced的平均表现介于3.0至3.7之间，充其量只能归类为中等。没有一个模型能够为这些临床场景提供准确的TNM分期，25% - 50%的模型引用不存在的参考文献（幻觉）。

结论本研究表明，尽管这些模型在不断发展，但在用于临床医学之前，它们需要人工监督并进行明确改进。据我们所知，这是胃肠道病理学领域同类研究中的首次研究。