Hirosawa Takanobu, Kawamura Ren, Harada Yukinori, Mizuta Kazuya, Tokumasu Kazuki, Kaji Yuki, Suzuki Tomoharu, Shimizu Taro
Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan.
Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan.
JMIR Med Inform. 2023 Oct 9;11:e48808. doi: 10.2196/48808.
The diagnostic accuracy of differential diagnoses generated by artificial intelligence chatbots, including ChatGPT models, for complex clinical vignettes derived from general internal medicine (GIM) department case reports is unknown.
This study aims to evaluate the accuracy of the differential diagnosis lists generated by both third-generation ChatGPT (ChatGPT-3.5) and fourth-generation ChatGPT (ChatGPT-4) by using case vignettes from case reports published by the Department of GIM of Dokkyo Medical University Hospital, Japan.
We searched PubMed for case reports. Upon identification, physicians selected diagnostic cases, determined the final diagnosis, and displayed them into clinical vignettes. Physicians typed the determined text with the clinical vignettes in the ChatGPT-3.5 and ChatGPT-4 prompts to generate the top 10 differential diagnoses. The ChatGPT models were not specially trained or further reinforced for this task. Three GIM physicians from other medical institutions created differential diagnosis lists by reading the same clinical vignettes. We measured the rate of correct diagnosis within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and the top diagnosis.
In total, 52 case reports were analyzed. The rates of correct diagnosis by ChatGPT-4 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 83% (43/52), 81% (42/52), and 60% (31/52), respectively. The rates of correct diagnosis by ChatGPT-3.5 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 73% (38/52), 65% (34/52), and 42% (22/52), respectively. The rates of correct diagnosis by ChatGPT-4 were comparable to those by physicians within the top 10 (43/52, 83% vs 39/52, 75%, respectively; P=.47) and within the top 5 (42/52, 81% vs 35/52, 67%, respectively; P=.18) differential diagnosis lists and top diagnosis (31/52, 60% vs 26/52, 50%, respectively; P=.43) although the difference was not significant. The ChatGPT models' diagnostic accuracy did not significantly vary based on open access status or the publication date (before 2011 vs 2022).
This study demonstrates the potential diagnostic accuracy of differential diagnosis lists generated using ChatGPT-3.5 and ChatGPT-4 for complex clinical vignettes from case reports published by the GIM department. The rate of correct diagnoses within the top 10 and top 5 differential diagnosis lists generated by ChatGPT-4 exceeds 80%. Although derived from a limited data set of case reports from a single department, our findings highlight the potential utility of ChatGPT-4 as a supplementary tool for physicians, particularly for those affiliated with the GIM department. Further investigations should explore the diagnostic accuracy of ChatGPT by using distinct case materials beyond its training data. Such efforts will provide a comprehensive insight into the role of artificial intelligence in enhancing clinical decision-making.
包括ChatGPT模型在内的人工智能聊天机器人针对源自普通内科(GIM)科室病例报告的复杂临床案例生成的鉴别诊断的诊断准确性尚不清楚。
本研究旨在通过使用日本独协医科大学医院GIM科室发表的病例报告中的病例 vignettes,评估第三代ChatGPT(ChatGPT-3.5)和第四代ChatGPT(ChatGPT-4)生成的鉴别诊断列表的准确性。
我们在PubMed上搜索病例报告。确定后,医生选择诊断病例,确定最终诊断,并将其整理成临床 vignettes。医生将确定的文本与临床 vignettes 一起输入到ChatGPT-3.5和ChatGPT-4提示中,以生成前10个鉴别诊断。ChatGPT模型未针对此任务进行专门训练或进一步强化。来自其他医疗机构的三名GIM医生通过阅读相同的临床 vignettes 创建鉴别诊断列表。我们测量了前10个鉴别诊断列表、前5个鉴别诊断列表和首要诊断中的正确诊断率。
总共分析了52份病例报告。ChatGPT-4在前10个鉴别诊断列表、前5个鉴别诊断列表和首要诊断中的正确诊断率分别为83%(43/52)、81%(42/52)和60%(31/52)。ChatGPT-3.5在前10个鉴别诊断列表、前5个鉴别诊断列表和首要诊断中的正确诊断率分别为73%(38/52)、65%(34/52)和42%(22/52)。ChatGPT-4在前10个(分别为43/52,83%对39/52,75%;P = 0.47)和前5个(分别为42/52,81%对35/52,67%;P = 0.18)鉴别诊断列表以及首要诊断(分别为31/52,60%对26/52,50%;P = 0.43)中的正确诊断率与医生的相当,尽管差异不显著。ChatGPT模型的诊断准确性并未因开放获取状态或出版日期(2011年之前对2022年)而有显著差异。
本研究证明了使用ChatGPT-3.5和ChatGPT-4为GIM科室发表的病例报告中的复杂临床 vignettes生成的鉴别诊断列表具有潜在的诊断准确性。ChatGPT-4生成的前10个和前5个鉴别诊断列表中的正确诊断率超过80%。尽管我们的发现来自单个科室的有限病例报告数据集,但突出了ChatGPT-4作为医生辅助工具的潜在效用,特别是对于GIM科室的医生。进一步的研究应通过使用其训练数据以外的不同病例材料来探索ChatGPT的诊断准确性。这些努力将全面洞察人工智能在增强临床决策中的作用。