Ishida Masanori, Gonoi Wataru, Nyunoya Keisuke, Abe Hiroyuki, Shirota Go, Okimoto Naomasa, Fujimoto Kotaro, Kurokawa Mariko, Nakai Motoki, Saito Kazuhiro, Ushiku Tetsuo, Abe Osamu
Department of Radiology, Tokyo Medical University Hospital, Tokyo, JPN.
Department of Radiology, The University of Tokyo Hospital, Tokyo, JPN.
Cureus. 2024 Aug 20;16(8):e67306. doi: 10.7759/cureus.67306. eCollection 2024 Aug.
This study evaluates the diagnostic performance of the latest large language models (LLMs), GPT-4o (OpenAI, San Francisco, CA, USA) and Claude 3 Opus (Anthropic, San Francisco, CA, USA), in determining causes of death from medical histories and postmortem CT findings.
We included 100 adult cases whose postmortem CT scans were diagnosable for the causes of death using the gold standard of autopsy results. Their medical histories and postmortem CT findings were compiled, and clinical and imaging diagnoses of both the underlying and immediate causes of death, as well as their personal information, were carefully separated from the database to be shown to the LLMs. Both GPT-4o and Claude 3 Opus generated the top three differential diagnoses for each of the underlying or immediate causes of death based on the following three prompts: 1) medical history only; 2) postmortem CT findings only; and 3) both medical history and postmortem CT findings. The diagnostic performance of the LLMs was compared using McNemar's test.
For the underlying cause of death, GPT-4o achieved primary diagnostic accuracy rates of 78%, 72%, and 78%, while Claude 3 Opus achieved 72%, 56%, and 75% for prompts 1, 2, and 3, respectively. Including any of the top three differential diagnoses, GPT-4o's accuracy rates were 92%, 90%, and 92%, while Claude 3 Opus's rates were 93%, 69%, and 93% for prompts 1, 2, and 3, respectively. For the immediate cause of death, GPT-4o's primary diagnostic accuracy rates were 55%, 58%, and 62%, while Claude 3 Opus's rates were 60%, 62%, and 63% for prompts 1,2, and 3, respectively. For any of the top three differential diagnoses, GPT-4o's accuracy rates were 88% for prompt 1 and 91% for prompts 2 and 3, whereas Claude 3 Opus's rates were 92% for all three prompts. Significant differences between the models were observed for prompt two in diagnosing the underlying cause of death (p = 0.03 and <0.01 for the primary and top three differential diagnoses, respectively).
Both GPT-4o and Claude 3 Opus demonstrated relatively high performance in diagnosing both the underlying and immediate causes of death using medical histories and postmortem CT findings.
本研究评估了最新的大语言模型GPT-4o(美国加利福尼亚州旧金山OpenAI公司)和Claude 3 Opus(美国加利福尼亚州旧金山Anthropic公司)根据病史和尸检CT结果确定死亡原因的诊断性能。
我们纳入了100例成年病例,其尸检CT扫描结果可根据尸检结果的金标准诊断出死亡原因。整理了他们的病史和尸检CT结果,并将死亡的潜在原因和直接原因的临床及影像诊断以及他们的个人信息从数据库中仔细分离出来,展示给大语言模型。GPT-4o和Claude 3 Opus根据以下三个提示,针对每个潜在或直接死亡原因生成了前三种鉴别诊断:1)仅病史;2)仅尸检CT结果;3)病史和尸检CT结果。使用McNemar检验比较大语言模型的诊断性能。
对于潜在死亡原因,GPT-4o在提示1、2和3下的初步诊断准确率分别为78%、72%和78%,而Claude 3 Opus的准确率分别为72%、56%和75%。包括任何一种前三种鉴别诊断,GPT-4o在提示1、2和3下的准确率分别为92%、90%和92%,而Claude 3 Opus的准确率分别为93%、69%和93%。对于直接死亡原因,GPT-4o在提示1、2和3下的初步诊断准确率分别为55%、58%和62%,而Claude 3 Opus的准确率分别为60%、62%和63%。对于任何一种前三种鉴别诊断,GPT-4o在提示1下的准确率为88%,在提示2和3下的准确率为91%,而Claude 3 Opus在所有三种提示下的准确率均为92%。在诊断潜在死亡原因的提示2方面,观察到模型之间存在显著差异(初步诊断和前三种鉴别诊断的p值分别为0.03和<0.01)。
GPT-4o和Claude 3 Opus在使用病史和尸检CT结果诊断潜在和直接死亡原因方面均表现出相对较高的性能。