Division of Nephrology and Hypertension, Department of Internal Medicine, St. Marianna University School of Medicine, 2-16-1 Sugao, Miyamae-Ku, Kawasaki, Kanagawa, 216-8511, Japan.
Clin Exp Nephrol. 2024 May;28(5):465-469. doi: 10.1007/s10157-023-02451-w. Epub 2024 Feb 14.
Large language models (LLMs) have impacted advances in artificial intelligence. While LLMs have demonstrated high performance in general medical examinations, their performance in specialized areas such as nephrology is unclear. This study aimed to evaluate ChatGPT and Bard in their potential nephrology applications.
Ninety-nine questions from the Self-Assessment Questions for Nephrology Board Renewal from 2018 to 2022 were presented to two versions of ChatGPT (GPT-3.5 and GPT-4) and Bard. We calculated the correct answer rates for the five years, each year, and question categories and checked whether they exceeded the pass criterion. The correct answer rates were compared with those of the nephrology residents.
The overall correct answer rates for GPT-3.5, GPT-4, and Bard were 31.3% (31/99), 54.5% (54/99), and 32.3% (32/99), respectively, thus GPT-4 significantly outperformed GPT-3.5 (p < 0.01) and Bard (p < 0.01). GPT-4 passed in three years, barely meeting the minimum threshold in two. GPT-4 demonstrated significantly higher performance in problem-solving, clinical, and non-image questions than GPT-3.5 and Bard. GPT-4's performance was between third- and fourth-year nephrology residents.
GPT-4 outperformed GPT-3.5 and Bard and met the Nephrology Board renewal standards in specific years, albeit marginally. These results highlight LLMs' potential and limitations in nephrology. As LLMs advance, nephrologists should understand their performance for future applications.
大型语言模型(LLMs)已经推动了人工智能的进步。虽然 LLM 在一般医学检查中表现出了很高的性能,但它们在肾脏病学等专业领域的性能尚不清楚。本研究旨在评估 ChatGPT 和 Bard 在肾脏病学应用中的潜力。
将 2018 年至 2022 年的肾脏病学委员会更新的自我评估问题 99 个呈现给两个版本的 ChatGPT(GPT-3.5 和 GPT-4)和 Bard。我们计算了五年、每年和问题类别中的正确答案率,并检查它们是否超过了及格标准。将正确答案率与肾脏病学住院医师的进行了比较。
GPT-3.5、GPT-4 和 Bard 的总体正确答案率分别为 31.3%(31/99)、54.5%(54/99)和 32.3%(32/99),因此 GPT-4 显著优于 GPT-3.5(p<0.01)和 Bard(p<0.01)。GPT-4 在三年中通过,勉强达到了两年中的最低门槛。GPT-4 在解决问题、临床和非图像问题方面的表现明显优于 GPT-3.5 和 Bard。GPT-4 的表现介于第三年和第四年的肾脏病学住院医师之间。
GPT-4 在特定年份超过了 GPT-3.5 和 Bard,并达到了肾脏病学委员会更新的标准,尽管只是勉强达到。这些结果突出了 LLM 在肾脏病学中的潜力和局限性。随着 LLM 的进步,肾脏病学家应该了解它们的性能,以便为未来的应用做好准备。