Hirosawa Takanobu, Harada Yukinori, Mizuta Kazuya, Sakamoto Tetsu, Tokumasu Kazuki, Shimizu Taro
Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan.
Department of General Medicine, Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama University, Okayama, Japan.
Digit Health. 2024 Jul 21;10:20552076241265215. doi: 10.1177/20552076241265215. eCollection 2024 Jan-Dec.
Diagnostic performance of generative artificial intelligences (AIs) using large language models (LLMs) across comprehensive medical specialties is still unknown.
We aimed to evaluate the diagnostic performance of generative AIs using LLMs in complex case series across comprehensive medical fields.
We analyzed published case reports from the from January 2022 to March 2023. We excluded pediatric cases and those primarily focused on management. We utilized three generative AIs to generate the top 10 differential-diagnosis (DDx) lists from case descriptions: the fourth-generation chat generative pre-trained transformer (ChatGPT-4), Google Gemini (previously Bard), and LLM Meta AI 2 (LLaMA2) chatbot. Two independent physicians assessed the inclusion of the final diagnosis in the lists generated by the AIs.
Out of 557 consecutive case reports, 392 were included. The inclusion rates of the final diagnosis within top 10 DDx lists were 86.7% (340/392) for ChatGPT-4, 68.6% (269/392) for Google Gemini, and 54.6% (214/392) for LLaMA2 chatbot. The top diagnoses matched the final diagnoses in 54.6% (214/392) for ChatGPT-4, 31.4% (123/392) for Google Gemini, and 23.0% (90/392) for LLaMA2 chatbot. ChatGPT-4 showed higher diagnostic accuracy than Google Gemini ( < 0.001) and LLaMA2 chatbot ( < 0.001). Additionally, Google Gemini outperformed LLaMA2 chatbot within the top 10 DDx lists ( < 0.001) and as the top diagnosis ( = 0.010).
This study demonstrated the diagnostic performance of generative AIs including ChatGPT-4, Google Gemini, and LLaMA2 chatbot. ChatGPT-4 exhibited higher diagnostic accuracy than the other platforms. These findings suggest the importance of understanding the differences in diagnostic performance among generative AIs, especially in complex case series across comprehensive medical fields, like general medicine.
使用大语言模型(LLMs)的生成式人工智能(AIs)在综合医学专业中的诊断性能仍不明确。
我们旨在评估使用LLMs的生成式AIs在综合医学领域复杂病例系列中的诊断性能。
我们分析了2022年1月至2023年3月发表的病例报告。我们排除了儿科病例和主要关注治疗的病例。我们利用三种生成式AIs根据病例描述生成前10种鉴别诊断(DDx)列表:第四代聊天生成预训练变换器(ChatGPT-4)、谷歌Gemini(以前的Bard)和LLM Meta AI 2(LLaMA2)聊天机器人。两名独立的医生评估最终诊断是否包含在AIs生成的列表中。
在557例连续病例报告中,392例被纳入。ChatGPT-4在前10种DDx列表中最终诊断的纳入率为86.7%(340/392),谷歌Gemini为68.6%(269/392),LLaMA2聊天机器人为54.6%(214/392)。ChatGPT-4的首要诊断与最终诊断匹配率为54.6%(214/392),谷歌Gemini为31.4%(123/392),LLaMA2聊天机器人为23.0%(90/392)。ChatGPT-4的诊断准确性高于谷歌Gemini(<0.001)和LLaMA2聊天机器人(<0.001)。此外,在10种DDx列表中(<0.001)以及作为首要诊断时(=0.010),谷歌Gemini的表现优于LLaMA2聊天机器人。
本研究展示了包括ChatGPT-4、谷歌Gemini和LLaMA2聊天机器人在内的生成式AIs的诊断性能。ChatGPT-4的诊断准确性高于其他平台。这些发现表明了解生成式AIs之间诊断性能差异的重要性,特别是在综合医学领域的复杂病例系列中,如普通医学。