评估大型语言模型（GPT-3.5和GPT-4）的性能以及儿科肾脏病学的准确临床信息。

Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology.

作者信息

Sav Nadide Melike

机构信息

Department of Pediatric Nephrology, Duzce University, Duzce, Turkey.

出版信息

Pediatr Nephrol. 2025 Mar 5. doi: 10.1007/s00467-025-06723-3.

DOI:10.1007/s00467-025-06723-3

PMID:40045013

Abstract

BACKGROUND

Artificial intelligence (AI) has emerged as a transformative tool in healthcare, offering significant advancements in providing accurate clinical information. However, the performance and applicability of AI models in specialized fields such as pediatric nephrology remain underexplored. This study is aimed at evaluating the ability of two AI-based language models, GPT-3.5 and GPT-4, to provide accurate and reliable clinical information in pediatric nephrology. The models were evaluated on four criteria: accuracy, scope, patient friendliness, and clinical applicability.

METHODS

Forty pediatric nephrology specialists with ≥ 5 years of experience rated GPT-3.5 and GPT-4 responses to 10 clinical questions using a 1-5 scale via Google Forms. Ethical approval was obtained, and informed consent was secured from all participants.

RESULTS

Both GPT-3.5 and GPT-4 demonstrated comparable performance across all criteria, with no statistically significant differences observed (p > 0.05). GPT-4 exhibited slightly higher mean scores in all parameters, but the differences were negligible (Cohen's d < 0.1 for all criteria). Reliability analysis revealed low internal consistency for both models (Cronbach's alpha ranged between 0.019 and 0.162). Correlation analysis indicated no significant relationship between participants' years of professional experience and their evaluations of GPT-3.5 (correlation coefficients ranged from - 0.026 to 0.074).

CONCLUSIONS

While GPT-3.5 and GPT-4 provided a foundational level of clinical information support, neither model exhibited superior performance in addressing the unique challenges of pediatric nephrology. The findings highlight the need for domain-specific training and integration of updated clinical guidelines to enhance the applicability and reliability of AI models in specialized fields. This study underscores the potential of AI in pediatric nephrology while emphasizing the importance of human oversight and the need for further refinements in AI applications.

摘要

背景

人工智能（AI）已成为医疗保健领域的一种变革性工具，在提供准确临床信息方面取得了重大进展。然而，人工智能模型在儿科肾脏病等专业领域的性能和适用性仍未得到充分探索。本研究旨在评估两种基于人工智能的语言模型GPT-3.5和GPT-4在儿科肾脏病中提供准确可靠临床信息的能力。这些模型根据四个标准进行评估：准确性、范围、对患者的友好程度和临床适用性。

方法

40名具有≥5年经验的儿科肾脏病专家通过谷歌表单使用1-5评分量表对GPT-3.5和GPT-4对10个临床问题的回答进行评分。获得了伦理批准，并征得所有参与者的知情同意。

结果

GPT-3.5和GPT-4在所有标准上均表现出相当的性能，未观察到统计学上的显著差异（p>0.05）。GPT-4在所有参数上的平均得分略高，但差异可忽略不计（所有标准的科恩d<0.1）。可靠性分析显示，两个模型的内部一致性都很低（克朗巴赫α系数在0.019至0.162之间）。相关性分析表明，参与者的专业经验年限与他们对GPT-3.5的评价之间没有显著关系（相关系数范围为-0.026至0.074）。

结论

虽然GPT-3.5和GPT-4提供了一定程度的临床信息支持，但在应对儿科肾脏病的独特挑战方面，这两种模型均未表现出卓越的性能。研究结果凸显了进行特定领域培训以及整合最新临床指南以提高人工智能模型在专业领域的适用性和可靠性的必要性。本研究强调了人工智能在儿科肾脏病中的潜力，同时强调了人工监督的重要性以及人工智能应用进一步优化的必要性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估大型语言模型（GPT-3.5和GPT-4）的性能以及儿科肾脏病学的准确临床信息。

Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

本文引用的文献

评估大型语言模型（GPT-3.5和GPT-4）的性能以及儿科肾脏病学的准确临床信息。

Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

本文引用的文献