Suppr超能文献

生成式人工智能在临床肾脏病学中的辅助评估:评估GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在患者互动及肾活检解读中的表现

Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation.

作者信息

Lin Shih-Yi, Jiang Chang-Cheng, Law Kin-Man, Yeh Pei-Chun, Tsai Min-Kuang, Chou Chu-Lin, Wang I-Kuan, Ting I-Wen, Chen Yu-Wei, Chou Che-Yi, Hsieh Ming-Han, Pan Heng-Chih, Hsieh Sung-Lin, Chiu Chien-Hua, Lee Pei-Wen, Hong Yu-Cyuan, Hsu Ying-Yu, Kuo Huey-Liang, Ju Shu-Woei, Kao Chia-Hung

机构信息

Graduate Institute of Biomedical Sciences, College of Medicine, China Medical University, Taichung, Taiwan.

Division of Nephrology and Kidney Institute, China Medical University Hospital, Taichung, Taiwan.

出版信息

Digit Health. 2025 Jun 2;11:20552076251342067. doi: 10.1177/20552076251342067. eCollection 2025 Jan-Dec.

Abstract

IMPORTANCE

Compares the responses of four AI models to common nephrology-related questions encountered in clinical settings.

OBJECTIVE

To evaluate generative AI models in enhancing nephrology patient communication and education.

DESIGN

Generative AI in Nephrology.

SETTING

In a study conducted from December 8-12, 2023, and October 21-23, 2024, IT engineers evaluated GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 for nephrology patient communication and education, querying each with 21 nephrology questions and three renal biopsy reports, repeated for consistency.

INTERVENTIONS FOR CLINICAL TRIALS OR EXPOSURES FOR OBSERVATIONAL STUDIES

None.

MAIN OUTCOMES AND MEASURES

Fifteen nephrologists and one nephrology researcher assessed responses for Appropriateness, Helpfulness, Consistency, and human-like empathy, with rating scale (1-4). Using Shapiro-Wilk and Mann-Whitney tests with Holm correction, along with TF-IDF, BertScore, and ROUGE were used. The study compared the performance of GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 nephrology-related questions.

RESULTS

GPT-4o consistently achieved high scores in Appropriateness (3.39 ± 0.7) and Helpfulness (3.24 ± 0.73), while PaLM 2 demonstrated the highest consistency score (3.0 ± 0.86). In empathy, GPT-4 achieved the highest overall score (80.73%), excelling in patient-centric scenarios, followed by GPT-4o (76.56%). PaLM 2 showed competitive empathy in specific cases, despite scoring lower in consistency and Appropriateness.For Kidney-Related Queries, GPT-4o excelled in relevance metrics, achieving the highest BertScore (0.57) and ROUGE for one-word metrics (0.54). Gemini 1.0 Ultra led in generating coherent responses for Renal Biopsy Reports with the highest TF-IDF (0.56) and ROUGE for longest similar sentences (0.47). All 101 references provided by GPT-4 were 100% accurate.

CONCLUSIONS AND RELEVANCE

GPT-4o emerged as the most accurate and consistent model across most evaluation categories, while GPT-4 demonstrated superior empathy and balanced performance. PaLM 2 and Gemini 1.0 Ultra showed strengths in specific areas, highlighting the potential for tailored applications of generative AI in nephrology clinical practice.

摘要

重要性

比较四种人工智能模型对临床环境中常见的肾脏病相关问题的回答。

目的

评估生成式人工智能模型在改善肾脏病患者沟通与教育方面的作用。

设计

肾脏病领域的生成式人工智能研究。

背景

在2023年12月8日至12日以及2024年10月21日至23日进行的一项研究中,信息技术工程师评估了GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在肾脏病患者沟通与教育方面的表现,用21个肾脏病问题和三份肾活检报告对每个模型进行查询,为确保一致性进行了重复查询。

临床试验干预措施或观察性研究暴露因素

无。

主要结局和测量指标

15名肾脏病专家和1名肾脏病研究人员使用评分量表(1 - 4分)评估回答的恰当性、有用性、一致性和类人同理心。使用了经霍尔姆校正的夏皮罗 - 威尔克检验和曼 - 惠特尼检验,以及词频 - 逆文档频率(TF-IDF)、BertScore和ROUGE。该研究比较了GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在24个肾脏病相关问题上的表现。

结果

GPT-4o在恰当性(3.39 ± 0.7)和有用性(3.24 ± 0.73)方面始终获得高分,而PaLM 2的一致性得分最高(3.0 ± 0.86)。在同理心方面,GPT-4的总体得分最高(80.73%),在以患者为中心的场景中表现出色,其次是GPT-4o(76.56%)。尽管PaLM 2在一致性和恰当性方面得分较低,但在特定情况下表现出有竞争力的同理心。对于肾脏相关问题,GPT-4o在相关性指标方面表现出色,获得最高的BertScore(0.57)和单字指标的ROUGE(0.54)。Gemini 1.0 Ultra在生成与肾活检报告相关的连贯回答方面领先,其TF-IDF最高(0.56),最长相似句子的ROUGE最高(0.47)。GPT-4提供的所有101条参考文献100%准确。

结论与意义

在大多数评估类别中,GPT-4o是最准确和一致的模型,而GPT-4表现出卓越的同理心和平衡的性能。PaLM 2和Gemini 1.0 Ultra在特定领域表现出优势,凸显了生成式人工智能在肾脏病临床实践中定制应用的潜力。

相似文献

本文引用的文献

2
6
Performance of ChatGPT on Nephrology Test Questions.ChatGPT 在肾病学试题上的表现。
Clin J Am Soc Nephrol. 2024 Jan 1;19(1):35-43. doi: 10.2215/CJN.0000000000000330. Epub 2023 Oct 18.
9
Art and the science of generative AI.生成式人工智能的艺术与科学。
Science. 2023 Jun 16;380(6650):1110-1111. doi: 10.1126/science.adh4451. Epub 2023 Jun 15.
10
ChatGPT outperforms humans in emotional awareness evaluations.ChatGPT在情绪感知评估方面表现优于人类。
Front Psychol. 2023 May 26;14:1199058. doi: 10.3389/fpsyg.2023.1199058. eCollection 2023.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验