Pham Justin H, Thongprayoon Charat, Suppadungsuk Supawadee, Miao Jing, Craici Iasmina M, Cheungpasitporn Wisit
Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA.
Department of Nephrology and Hypertension, Mayo Clinic, Rochester, MN, USA.
Digit Health. 2024 Aug 28;10:20552076241277458. doi: 10.1177/20552076241277458. eCollection 2024 Jan-Dec.
Professional opinion polling has become a popular means of seeking advice for complex nephrology questions in the #AskRenal community on X. ChatGPT is a large language model with remarkable problem-solving capabilities, but its ability to provide solutions for real-world clinical scenarios remains unproven. This study seeks to evaluate how closely ChatGPT's responses align with current prevailing medical opinions in nephrology.
Nephrology polls from X were submitted to ChatGPT-4, which generated answers without prior knowledge of the poll outcomes. Its responses were compared to the poll results (inter-rater) and a second set of responses given after a one-week interval (intra-rater) using Cohen's kappa statistic (κ). Subgroup analysis was performed based on question subject matter.
Our analysis comprised two rounds of testing ChatGPT on 271 nephrology-related questions. In the first round, ChatGPT's responses agreed with poll results for 163 of the 271 questions (60.2%; κ = 0.42, 95% CI: 0.38-0.46). In the second round, conducted to assess reproducibility, agreement improved slightly to 171 out of 271 questions (63.1%; κ = 0.46, 95% CI: 0.42-0.50). Comparison of ChatGPT's responses between the two rounds demonstrated high internal consistency, with agreement in 245 out of 271 responses (90.4%; κ = 0.86, 95% CI: 0.82-0.90). Subgroup analysis revealed stronger performance in the combined areas of homeostasis, nephrolithiasis, and pharmacology (κ = 0.53, 95% CI: 0.47-0.59 in both rounds), compared to other nephrology subfields.
ChatGPT-4 demonstrates modest capability in replicating prevailing professional opinion in nephrology polls overall, with varying performance levels between question topics and excellent internal consistency. This study provides insights into the potential and limitations of using ChatGPT in medical decision making.
专业意见调查已成为在X平台的#AskRenal社区中就复杂的肾脏病问题寻求建议的一种流行方式。ChatGPT是一个具有卓越问题解决能力的大型语言模型,但其为现实世界临床场景提供解决方案的能力仍未得到证实。本研究旨在评估ChatGPT的回答与当前肾脏病领域主流医学观点的契合程度。
将来自X平台的肾脏病相关调查提交给ChatGPT-4,它在不知道调查结果的情况下生成答案。使用科恩kappa统计量(κ)将其回答与调查结果进行比较(评分者间),并在一周间隔后给出的另一组回答进行比较(评分者内)。根据问题主题进行亚组分析。
我们的分析包括对ChatGPT进行两轮关于271个肾脏病相关问题的测试。在第一轮中,ChatGPT对271个问题中的163个问题的回答与调查结果一致(60.2%;κ = 0.42,95%可信区间:0.38 - 0.46)。在第二轮中,为评估可重复性而进行,一致性略有提高,271个问题中有171个问题回答一致(63.1%;κ = 0.46,95%可信区间:0.42 - 0.50)。两轮之间ChatGPT回答的比较显示出高度的内部一致性,271个回答中有245个回答一致(90.4%;κ = 0.86,95%可信区间:0.82 - 0.90)。亚组分析显示,与其他肾脏病亚领域相比,在稳态、肾结石和药理学综合领域表现更强(两轮中κ均为0.53,95%可信区间:0.47 - 0.59)。
ChatGPT-4总体上在复制肾脏病调查中的主流专业意见方面表现出一定能力,不同问题主题的表现水平有所不同,且具有出色的内部一致性。本研究为在医疗决策中使用ChatGPT的潜力和局限性提供了见解。