生成式人工智能在临床肾脏病学中的辅助评估：评估GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在患者互动及肾活检解读中的表现

Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation.

作者信息

Lin Shih-Yi, Jiang Chang-Cheng, Law Kin-Man, Yeh Pei-Chun, Tsai Min-Kuang, Chou Chu-Lin, Wang I-Kuan, Ting I-Wen, Chen Yu-Wei, Chou Che-Yi, Hsieh Ming-Han, Pan Heng-Chih, Hsieh Sung-Lin, Chiu Chien-Hua, Lee Pei-Wen, Hong Yu-Cyuan, Hsu Ying-Yu, Kuo Huey-Liang, Ju Shu-Woei, Kao Chia-Hung

机构信息

Graduate Institute of Biomedical Sciences, College of Medicine, China Medical University, Taichung, Taiwan.

Division of Nephrology and Kidney Institute, China Medical University Hospital, Taichung, Taiwan.

出版信息

Digit Health. 2025 Jun 2;11:20552076251342067. doi: 10.1177/20552076251342067. eCollection 2025 Jan-Dec.

DOI:10.1177/20552076251342067

PMID:40469778

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12134521/

Abstract

IMPORTANCE

Compares the responses of four AI models to common nephrology-related questions encountered in clinical settings.

OBJECTIVE

To evaluate generative AI models in enhancing nephrology patient communication and education.

DESIGN

Generative AI in Nephrology.

SETTING

In a study conducted from December 8-12, 2023, and October 21-23, 2024, IT engineers evaluated GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 for nephrology patient communication and education, querying each with 21 nephrology questions and three renal biopsy reports, repeated for consistency.

INTERVENTIONS FOR CLINICAL TRIALS OR EXPOSURES FOR OBSERVATIONAL STUDIES

None.

MAIN OUTCOMES AND MEASURES

Fifteen nephrologists and one nephrology researcher assessed responses for Appropriateness, Helpfulness, Consistency, and human-like empathy, with rating scale (1-4). Using Shapiro-Wilk and Mann-Whitney tests with Holm correction, along with TF-IDF, BertScore, and ROUGE were used. The study compared the performance of GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 nephrology-related questions.

RESULTS

GPT-4o consistently achieved high scores in Appropriateness (3.39 ± 0.7) and Helpfulness (3.24 ± 0.73), while PaLM 2 demonstrated the highest consistency score (3.0 ± 0.86). In empathy, GPT-4 achieved the highest overall score (80.73%), excelling in patient-centric scenarios, followed by GPT-4o (76.56%). PaLM 2 showed competitive empathy in specific cases, despite scoring lower in consistency and Appropriateness.For Kidney-Related Queries, GPT-4o excelled in relevance metrics, achieving the highest BertScore (0.57) and ROUGE for one-word metrics (0.54). Gemini 1.0 Ultra led in generating coherent responses for Renal Biopsy Reports with the highest TF-IDF (0.56) and ROUGE for longest similar sentences (0.47). All 101 references provided by GPT-4 were 100% accurate.

CONCLUSIONS AND RELEVANCE

GPT-4o emerged as the most accurate and consistent model across most evaluation categories, while GPT-4 demonstrated superior empathy and balanced performance. PaLM 2 and Gemini 1.0 Ultra showed strengths in specific areas, highlighting the potential for tailored applications of generative AI in nephrology clinical practice.

摘要

重要性

比较四种人工智能模型对临床环境中常见的肾脏病相关问题的回答。

目的

评估生成式人工智能模型在改善肾脏病患者沟通与教育方面的作用。

设计

肾脏病领域的生成式人工智能研究。

背景

在2023年12月8日至12日以及2024年10月21日至23日进行的一项研究中，信息技术工程师评估了GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在肾脏病患者沟通与教育方面的表现，用21个肾脏病问题和三份肾活检报告对每个模型进行查询，为确保一致性进行了重复查询。

临床试验干预措施或观察性研究暴露因素

无。

主要结局和测量指标

15名肾脏病专家和1名肾脏病研究人员使用评分量表（1 - 4分）评估回答的恰当性、有用性、一致性和类人同理心。使用了经霍尔姆校正的夏皮罗 - 威尔克检验和曼 - 惠特尼检验，以及词频 - 逆文档频率（TF-IDF）、BertScore和ROUGE。该研究比较了GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在24个肾脏病相关问题上的表现。

结果

GPT-4o在恰当性（3.39 ± 0.7）和有用性（3.24 ± 0.73）方面始终获得高分，而PaLM 2的一致性得分最高（3.0 ± 0.86）。在同理心方面，GPT-4的总体得分最高（80.73%），在以患者为中心的场景中表现出色，其次是GPT-4o（76.56%）。尽管PaLM 2在一致性和恰当性方面得分较低，但在特定情况下表现出有竞争力的同理心。对于肾脏相关问题，GPT-4o在相关性指标方面表现出色，获得最高的BertScore（0.57）和单字指标的ROUGE（0.54）。Gemini 1.0 Ultra在生成与肾活检报告相关的连贯回答方面领先，其TF-IDF最高（0.56），最长相似句子的ROUGE最高（0.47）。GPT-4提供的所有101条参考文献100%准确。

结论与意义

在大多数评估类别中，GPT-4o是最准确和一致的模型，而GPT-4表现出卓越的同理心和平衡的性能。PaLM 2和Gemini 1.0 Ultra在特定领域表现出优势，凸显了生成式人工智能在肾脏病临床实践中定制应用的潜力。

相似文献

Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation.生成式人工智能在临床肾脏病学中的辅助评估：评估GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在患者互动及肾活检解读中的表现

Digit Health. 2025 Jun 2;11:20552076251342067. doi: 10.1177/20552076251342067. eCollection 2025 Jan-Dec.

Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023.评估人工智能在医学知识测试中的效能：一项使用2020年至2023年台湾内科医师考试试题的研究。

Digit Health. 2024 Oct 18;10:20552076241291404. doi: 10.1177/20552076241291404. eCollection 2024 Jan-Dec.

The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: A comparison with cardiologists and emergency medicine specialists.Gemini、GPT-4 和 GPT-4o 在心电图分析中的准确性：与心脏病专家和急诊医学专家的比较。

Am J Emerg Med. 2024 Oct;84:68-73. doi: 10.1016/j.ajem.2024.07.043. Epub 2024 Jul 30.

Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology.评估Microsoft Copilot、GPT-4和Google Gemini在眼科领域的性能。

Can J Ophthalmol. 2025 Feb 4. doi: 10.1016/j.jcjo.2025.01.001.

Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.ChatGPT-4o和谷歌Gemini在基于图像的神经外科委员会问题上的表现准确性和质量。

Neurosurg Rev. 2025 Mar 25;48(1):320. doi: 10.1007/s10143-025-03472-7.

Dr. Chatbot: Investigating the Quality and Quantity of Responses Generated by Three AI Chatbots to Prompts Regarding Carpal Tunnel Syndrome.聊天机器人博士：调查三款人工智能聊天机器人针对腕管综合征提示所生成回复的质量和数量。

Cureus. 2025 Mar 24;17(3):e81068. doi: 10.7759/cureus.81068. eCollection 2025 Mar.

Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.评估人工智能在核心脏病学方面的熟练程度：大型语言模型参加资格考试。

J Nucl Cardiol. 2025 Mar;45:102089. doi: 10.1016/j.nuclcard.2024.102089. Epub 2024 Nov 29.

Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam.评估人工智能在核心脏病学方面的能力：大型语言模型参加资格考试准备。

medRxiv. 2024 Jul 16:2024.07.16.24310297. doi: 10.1101/2024.07.16.24310297.

Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.甲状腺眼病与人工智能：ChatGPT-3.5、ChatGPT-4o和Gemini在患者信息传递方面的比较研究

Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.

Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams.ChatGPT-4o与Gemini在放射诊断学培训考试中的性能对比分析

Cureus. 2025 Mar 20;17(3):e80874. doi: 10.7759/cureus.80874. eCollection 2025 Mar.

本文引用的文献

Learning to Fake It: Limited Responses and Fabricated References Provided by ChatGPT for Medical Questions.学会伪装：ChatGPT对医学问题的有限回答与编造参考文献

Mayo Clin Proc Digit Health. 2023 Jun 12;1(3):226-234. doi: 10.1016/j.mcpdig.2023.05.004. eCollection 2023 Sep.

Large language model triaging of simulated nephrology patient inbox messages.模拟肾脏病患者收件箱消息的大语言模型分诊

Front Artif Intell. 2024 Sep 9;7:1452469. doi: 10.3389/frai.2024.1452469. eCollection 2024.

Generative artificial intelligence responses to patient messages in the electronic health record: early lessons learned.电子健康记录中生成式人工智能对患者信息的回复：早期经验教训

JAMIA Open. 2024 Apr 10;7(2):ooae028. doi: 10.1093/jamiaopen/ooae028. eCollection 2024 Jul.

Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications.将检索增强生成与大型语言模型在肾脏病学中的整合：推进实际应用。

Medicina (Kaunas). 2024 Mar 8;60(3):445. doi: 10.3390/medicina60030445.

ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions.ChatGPT的回答一致性：关于医学考试问题重复查询的研究

Eur J Investig Health Psychol Educ. 2024 Mar 8;14(3):657-668. doi: 10.3390/ejihpe14030043.

Performance of ChatGPT on Nephrology Test Questions.ChatGPT 在肾病学试题上的表现。

Clin J Am Soc Nephrol. 2024 Jan 1;19(1):35-43. doi: 10.2215/CJN.0000000000000330. Epub 2023 Oct 18.

Efficacy of AI Chats to Determine an Emergency: A Comparison Between OpenAI's ChatGPT, Google Bard, and Microsoft Bing AI Chat.人工智能聊天工具在判定紧急情况方面的效能：OpenAI的ChatGPT、谷歌巴德和微软必应人工智能聊天工具的比较

Cureus. 2023 Sep 18;15(9):e45473. doi: 10.7759/cureus.45473. eCollection 2023 Sep.

ChatGPT: Can You Prepare My Patients for [F]FDG PET/CT and Explain My Reports?ChatGPT：你能否为我的患者准备 [F]FDG PET/CT 并解释我的报告？

J Nucl Med. 2023 Dec 1;64(12):1876-1879. doi: 10.2967/jnumed.123.266114.

Art and the science of generative AI.生成式人工智能的艺术与科学。

Science. 2023 Jun 16;380(6650):1110-1111. doi: 10.1126/science.adh4451. Epub 2023 Jun 15.

ChatGPT outperforms humans in emotional awareness evaluations.ChatGPT在情绪感知评估方面表现优于人类。

Front Psychol. 2023 May 26;14:1199058. doi: 10.3389/fpsyg.2023.1199058. eCollection 2023.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。