• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

五种大语言模型对肝癌综合治疗反应的比较研究

A Comparative Study of Five Large Language Models' Response for Liver Cancer Comprehensive Treatment.

作者信息

Zhong Deyuan, Liang Yuxin, Yan Hong-Tao, Chen Xinpei, Yang Qinyan, Ma Shuoshuo, Su Yuhao, Chen YaHui, Huang Xiaolun, Wang Ming

机构信息

Department of Liver Transplantation Center and HBP Surgery, Sichuan Clinical Research Center for Cancer, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, School of Medicine, University of Electronic Science and Technology of China, Chengdu, People's Republic of China.

出版信息

J Hepatocell Carcinoma. 2025 Aug 20;12:1861-1871. doi: 10.2147/JHC.S531642. eCollection 2025.

DOI:10.2147/JHC.S531642
PMID:40861309
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12375359/
Abstract

INTRODUCTION

Large language models (LLMs) are increasingly used in healthcare, yet their reliability in specialized clinical fields remains uncertain. Liver cancer, as a complex and high-burden disease, poses unique challenges for AI-based tools. This study aimed to evaluate the comprehensibility and clinical applicability of five mainstream LLMs in addressing liver cancer-related clinical questions.

METHODS

We developed 90 standardized questions covering multiple aspects of liver cancer management. Five LLMs-GPT-4, Gemini, Copilot, Kimi, and Ernie Bot-were evaluated in a blinded fashion by three independent hepatobiliary experts. Responses were scored using predefined criteria for comprehensibility and clinical applicability. Overall group comparisons were conducted using the Fisher-Freeman-Halton test (for categorical data) and the Kruskal-Wallis test (for ordinal scores), followed by Dunn's post-hoc test or Fisher's exact test with Bonferroni correction. Inter-rater reliability was assessed using Fleiss' kappa.

RESULTS

Kimi and GPT-4 achieved the highest proportions of fully applicable responses (68% and 62%, respectively), while Ernie Bot and Copilot showed the lowest. Comprehensibility was generally high, with Kimi and Ernie Bot scoring over 98%. However, none of the LLMs consistently provided guideline-concordant answers to all questions. Performance on professional-level questions was significantly lower than on common-sense ones, highlighting deficiencies in complex clinical reasoning.

CONCLUSION

LLMs demonstrate varied performance in liver cancer-related queries. While GPT-4 and Kimi show promise in clinical applicability, limitations in accuracy and consistency-particularly for complex medical decisions-underscore the need for domain-specific optimization before clinical integration.

TRIAL REGISTRATION

Not applicable.

摘要

引言

大语言模型(LLMs)在医疗保健领域的应用日益广泛,但其在专业临床领域的可靠性仍不确定。肝癌作为一种复杂且负担沉重的疾病,给基于人工智能的工具带来了独特挑战。本研究旨在评估五种主流大语言模型在解决肝癌相关临床问题方面的可理解性和临床适用性。

方法

我们制定了90个涵盖肝癌管理多个方面的标准化问题。由三位独立的肝胆专家以盲法对五种大语言模型——GPT-4、Gemini、Copilot、Kimi和文心一言进行评估。根据预先定义的可理解性和临床适用性标准对回答进行评分。使用Fisher-Freeman-Halton检验(用于分类数据)和Kruskal-Wallis检验(用于有序评分)进行总体组间比较,随后进行Dunn事后检验或经Bonferroni校正的Fisher精确检验。使用Fleiss' kappa评估评分者间的可靠性。

结果

Kimi和GPT-4获得完全适用回答的比例最高(分别为68%和62%),而文心一言和Copilot的比例最低。可理解性总体较高,Kimi和文心一言的得分超过98%。然而,没有一个大语言模型能始终如一地为所有问题提供符合指南的答案。在专业水平问题上的表现明显低于常识问题,凸显了复杂临床推理方面的不足。

结论

大语言模型在肝癌相关问题上表现各异。虽然GPT-4和Kimi在临床适用性方面显示出潜力,但在准确性和一致性方面的局限性——特别是对于复杂的医疗决策——强调了在临床整合之前进行特定领域优化的必要性。

试验注册

不适用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e08/12375359/2503e1b5df09/JHC-12-1861-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e08/12375359/1364ed4e59e4/JHC-12-1861-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e08/12375359/bbf1bcc13ff6/JHC-12-1861-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e08/12375359/2503e1b5df09/JHC-12-1861-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e08/12375359/1364ed4e59e4/JHC-12-1861-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e08/12375359/bbf1bcc13ff6/JHC-12-1861-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e08/12375359/2503e1b5df09/JHC-12-1861-g0003.jpg

相似文献

1
A Comparative Study of Five Large Language Models' Response for Liver Cancer Comprehensive Treatment.五种大语言模型对肝癌综合治疗反应的比较研究
J Hepatocell Carcinoma. 2025 Aug 20;12:1861-1871. doi: 10.2147/JHC.S531642. eCollection 2025.
2
How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors.人工智能的准确性如何?对常用大语言模型回应患者对偶然发现的肾肿瘤担忧的批判性评估。
J Clin Med. 2025 Aug 12;14(16):5697. doi: 10.3390/jcm14165697.
3
A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
4
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理:横断面评估研究
J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.
5
[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].五种大语言模型在口腔辅助诊断、治疗及健康咨询领域的应用初探
Zhonghua Kou Qiang Yi Xue Za Zhi. 2025 Jul 30;60(8):871-878. doi: 10.3760/cma.j.cn112144-20241107-00418.
6
Evaluating the reliability of the responses of large language models to keratoconus-related questions.评估大语言模型对圆锥角膜相关问题回答的可靠性。
Clin Exp Optom. 2024 Oct 24:1-8. doi: 10.1080/08164622.2024.2419524.
7
Comparative performance analysis of global and chinese-domain large language models for myopia.全球和中国领域用于近视研究的大语言模型的性能对比分析
Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.
8
Large language models (LLMs) in radiology exams for medical students: Performance and consequences.面向医学生的放射学考试中的大语言模型:表现与影响。
Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.
9
Comparative Analysis of Generative Artificial Intelligence Systems in Solving Clinical Pharmacy Problems: Mixed Methods Study.生成式人工智能系统解决临床药学问题的比较分析:混合方法研究
JMIR Med Inform. 2025 Jul 24;13:e76128. doi: 10.2196/76128.
10
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险

本文引用的文献

1
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究
JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.
2
Augmented non-hallucinating large language models as medical information curators.增强型非幻觉大语言模型作为医学信息整理者
NPJ Digit Med. 2024 Apr 23;7(1):100. doi: 10.1038/s41746-024-01081-0.
3
Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis.
五种不同人工智能聊天机器人对阳痿热搜查询的反应:比较分析。
J Med Syst. 2024 Apr 3;48(1):38. doi: 10.1007/s10916-024-02056-0.
4
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
5
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
6
How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard.人工智能如何回答常见肺癌问题:ChatGPT 与 Google Bard 对比。
Radiology. 2023 Jun;307(5):e230922. doi: 10.1148/radiol.230922.
7
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
8
ChatGPT Answers Common Patient Questions About Colonoscopy.ChatGPT回答患者关于结肠镜检查的常见问题。
Gastroenterology. 2023 Aug;165(2):509-511.e7. doi: 10.1053/j.gastro.2023.04.033. Epub 2023 May 5.
9
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
10
Impact of ChatGPT on medical chatbots as a disruptive technology.ChatGPT作为一种颠覆性技术对医疗聊天机器人的影响。
Front Artif Intell. 2023 Apr 5;6:1166014. doi: 10.3389/frai.2023.1166014. eCollection 2023.