Zhong Deyuan, Liang Yuxin, Yan Hong-Tao, Chen Xinpei, Yang Qinyan, Ma Shuoshuo, Su Yuhao, Chen YaHui, Huang Xiaolun, Wang Ming
Department of Liver Transplantation Center and HBP Surgery, Sichuan Clinical Research Center for Cancer, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, School of Medicine, University of Electronic Science and Technology of China, Chengdu, People's Republic of China.
J Hepatocell Carcinoma. 2025 Aug 20;12:1861-1871. doi: 10.2147/JHC.S531642. eCollection 2025.
Large language models (LLMs) are increasingly used in healthcare, yet their reliability in specialized clinical fields remains uncertain. Liver cancer, as a complex and high-burden disease, poses unique challenges for AI-based tools. This study aimed to evaluate the comprehensibility and clinical applicability of five mainstream LLMs in addressing liver cancer-related clinical questions.
We developed 90 standardized questions covering multiple aspects of liver cancer management. Five LLMs-GPT-4, Gemini, Copilot, Kimi, and Ernie Bot-were evaluated in a blinded fashion by three independent hepatobiliary experts. Responses were scored using predefined criteria for comprehensibility and clinical applicability. Overall group comparisons were conducted using the Fisher-Freeman-Halton test (for categorical data) and the Kruskal-Wallis test (for ordinal scores), followed by Dunn's post-hoc test or Fisher's exact test with Bonferroni correction. Inter-rater reliability was assessed using Fleiss' kappa.
Kimi and GPT-4 achieved the highest proportions of fully applicable responses (68% and 62%, respectively), while Ernie Bot and Copilot showed the lowest. Comprehensibility was generally high, with Kimi and Ernie Bot scoring over 98%. However, none of the LLMs consistently provided guideline-concordant answers to all questions. Performance on professional-level questions was significantly lower than on common-sense ones, highlighting deficiencies in complex clinical reasoning.
LLMs demonstrate varied performance in liver cancer-related queries. While GPT-4 and Kimi show promise in clinical applicability, limitations in accuracy and consistency-particularly for complex medical decisions-underscore the need for domain-specific optimization before clinical integration.
Not applicable.
大语言模型(LLMs)在医疗保健领域的应用日益广泛,但其在专业临床领域的可靠性仍不确定。肝癌作为一种复杂且负担沉重的疾病,给基于人工智能的工具带来了独特挑战。本研究旨在评估五种主流大语言模型在解决肝癌相关临床问题方面的可理解性和临床适用性。
我们制定了90个涵盖肝癌管理多个方面的标准化问题。由三位独立的肝胆专家以盲法对五种大语言模型——GPT-4、Gemini、Copilot、Kimi和文心一言进行评估。根据预先定义的可理解性和临床适用性标准对回答进行评分。使用Fisher-Freeman-Halton检验(用于分类数据)和Kruskal-Wallis检验(用于有序评分)进行总体组间比较,随后进行Dunn事后检验或经Bonferroni校正的Fisher精确检验。使用Fleiss' kappa评估评分者间的可靠性。
Kimi和GPT-4获得完全适用回答的比例最高(分别为68%和62%),而文心一言和Copilot的比例最低。可理解性总体较高,Kimi和文心一言的得分超过98%。然而,没有一个大语言模型能始终如一地为所有问题提供符合指南的答案。在专业水平问题上的表现明显低于常识问题,凸显了复杂临床推理方面的不足。
大语言模型在肝癌相关问题上表现各异。虽然GPT-4和Kimi在临床适用性方面显示出潜力,但在准确性和一致性方面的局限性——特别是对于复杂的医疗决策——强调了在临床整合之前进行特定领域优化的必要性。
不适用。