Lin Shih-Yi, Jiang Chang-Cheng, Law Kin-Man, Yeh Pei-Chun, Tsai Min-Kuang, Chou Chu-Lin, Wang I-Kuan, Ting I-Wen, Chen Yu-Wei, Chou Che-Yi, Hsieh Ming-Han, Pan Heng-Chih, Hsieh Sung-Lin, Chiu Chien-Hua, Lee Pei-Wen, Hong Yu-Cyuan, Hsu Ying-Yu, Kuo Huey-Liang, Ju Shu-Woei, Kao Chia-Hung
Graduate Institute of Biomedical Sciences, College of Medicine, China Medical University, Taichung, Taiwan.
Division of Nephrology and Kidney Institute, China Medical University Hospital, Taichung, Taiwan.
Digit Health. 2025 Jun 2;11:20552076251342067. doi: 10.1177/20552076251342067. eCollection 2025 Jan-Dec.
Compares the responses of four AI models to common nephrology-related questions encountered in clinical settings.
To evaluate generative AI models in enhancing nephrology patient communication and education.
Generative AI in Nephrology.
In a study conducted from December 8-12, 2023, and October 21-23, 2024, IT engineers evaluated GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 for nephrology patient communication and education, querying each with 21 nephrology questions and three renal biopsy reports, repeated for consistency.
None.
Fifteen nephrologists and one nephrology researcher assessed responses for Appropriateness, Helpfulness, Consistency, and human-like empathy, with rating scale (1-4). Using Shapiro-Wilk and Mann-Whitney tests with Holm correction, along with TF-IDF, BertScore, and ROUGE were used. The study compared the performance of GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 nephrology-related questions.
GPT-4o consistently achieved high scores in Appropriateness (3.39 ± 0.7) and Helpfulness (3.24 ± 0.73), while PaLM 2 demonstrated the highest consistency score (3.0 ± 0.86). In empathy, GPT-4 achieved the highest overall score (80.73%), excelling in patient-centric scenarios, followed by GPT-4o (76.56%). PaLM 2 showed competitive empathy in specific cases, despite scoring lower in consistency and Appropriateness.For Kidney-Related Queries, GPT-4o excelled in relevance metrics, achieving the highest BertScore (0.57) and ROUGE for one-word metrics (0.54). Gemini 1.0 Ultra led in generating coherent responses for Renal Biopsy Reports with the highest TF-IDF (0.56) and ROUGE for longest similar sentences (0.47). All 101 references provided by GPT-4 were 100% accurate.
GPT-4o emerged as the most accurate and consistent model across most evaluation categories, while GPT-4 demonstrated superior empathy and balanced performance. PaLM 2 and Gemini 1.0 Ultra showed strengths in specific areas, highlighting the potential for tailored applications of generative AI in nephrology clinical practice.
比较四种人工智能模型对临床环境中常见的肾脏病相关问题的回答。
评估生成式人工智能模型在改善肾脏病患者沟通与教育方面的作用。
肾脏病领域的生成式人工智能研究。
在2023年12月8日至12日以及2024年10月21日至23日进行的一项研究中,信息技术工程师评估了GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在肾脏病患者沟通与教育方面的表现,用21个肾脏病问题和三份肾活检报告对每个模型进行查询,为确保一致性进行了重复查询。
无。
15名肾脏病专家和1名肾脏病研究人员使用评分量表(1 - 4分)评估回答的恰当性、有用性、一致性和类人同理心。使用了经霍尔姆校正的夏皮罗 - 威尔克检验和曼 - 惠特尼检验,以及词频 - 逆文档频率(TF-IDF)、BertScore和ROUGE。该研究比较了GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在24个肾脏病相关问题上的表现。
GPT-4o在恰当性(3.39 ± 0.7)和有用性(3.24 ± 0.73)方面始终获得高分,而PaLM 2的一致性得分最高(3.0 ± 0.86)。在同理心方面,GPT-4的总体得分最高(80.73%),在以患者为中心的场景中表现出色,其次是GPT-4o(76.56%)。尽管PaLM 2在一致性和恰当性方面得分较低,但在特定情况下表现出有竞争力的同理心。对于肾脏相关问题,GPT-4o在相关性指标方面表现出色,获得最高的BertScore(0.57)和单字指标的ROUGE(0.54)。Gemini 1.0 Ultra在生成与肾活检报告相关的连贯回答方面领先,其TF-IDF最高(0.56),最长相似句子的ROUGE最高(0.47)。GPT-4提供的所有101条参考文献100%准确。
在大多数评估类别中,GPT-4o是最准确和一致的模型,而GPT-4表现出卓越的同理心和平衡的性能。PaLM 2和Gemini 1.0 Ultra在特定领域表现出优势,凸显了生成式人工智能在肾脏病临床实践中定制应用的潜力。