Hirosawa Takanobu, Harada Yukinori, Mizuta Kazuya, Sakamoto Tetsu, Tokumasu Kazuki, Shimizu Taro
Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan.
Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan.
JMIR Form Res. 2024 Jun 26;8:e59267. doi: 10.2196/59267.
The potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists.
This study aims to assess the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists and to compare its performance with that of physicians for case report series.
We used a database of differential-diagnosis lists from case reports in the American Journal of Case Reports, corresponding to final diagnoses. These lists were generated by 3 AI systems: GPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 (LLaMA2). The primary outcome was focused on whether GPT-4's evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, 2 independent physicians also evaluated the lists, with any inconsistencies resolved by another physician.
The 3 AIs generated a total of 1176 differential diagnosis lists from 392 case descriptions. GPT-4's evaluations concurred with those of the physicians in 966 out of 1176 lists (82.1%). The Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians' evaluations.
GPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential diagnosis lists with final diagnoses suggests its potential to aid clinical decision-making support through diagnostic feedback. While GPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process.
人工智能(AI)聊天机器人,尤其是配备GPT-4的ChatGPT(OpenAI),在辅助医学诊断方面的潜力是一个新兴的研究领域。然而,目前尚不清楚AI聊天机器人在评估最终诊断是否包含在鉴别诊断列表中表现如何。
本研究旨在评估GPT-4从鉴别诊断列表中识别最终诊断的能力,并将其与医师对病例报告系列的表现进行比较。
我们使用了《美国病例报告杂志》病例报告中的鉴别诊断列表数据库,这些列表对应最终诊断。这些列表由3个人工智能系统生成:GPT-4、谷歌巴德(目前是谷歌Gemini)和Meta AI 2的大语言模型(LLaMA2)。主要结果集中在GPT-4的评估是否能在这些列表中识别出最终诊断。这些人工智能均未接受额外的医学培训或强化。为作比较,2名独立的医师也对这些列表进行了评估,任何不一致之处由另一名医师解决。
这3个人工智能从392个病例描述中总共生成了1176个鉴别诊断列表。在1176个列表中,GPT-4的评估与医师的评估在966个列表中一致(82.1%)。Cohen κ系数为0.63(95%CI 0.56 - 0.69),表明GPT-4与医师的评估之间存在中等至良好的一致性。
GPT-4在从鉴别诊断列表中识别最终诊断方面表现出中等至良好的一致性,与医师对病例报告系列的表现相当。它将鉴别诊断列表与最终诊断进行比较的能力表明其有潜力通过诊断反馈辅助临床决策支持。虽然GPT-4在评估方面表现出中等至良好的一致性,但其在现实场景中的应用以及在不同临床环境中的进一步验证对于充分了解其在诊断过程中的效用至关重要。