Takita Hirotaka, Walston Shannon L, Mitsuyama Yasuhito, Watanabe Ko, Ishimaru Shoya, Ueda Daiju
Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-Machi, Abeno-ku, Osaka, 545-8585, Japan.
Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-Machi, Abeno-ku, Osaka, 545-8585, Japan.
Jpn J Radiol. 2025 May 14. doi: 10.1007/s11604-025-01799-1.
To compare the diagnostic performance of three proprietary large language models (LLMs)-Claude, GPT, and Gemini-in structuring free-text Japanese radiology reports for intracranial hemorrhage and skull fractures, and to assess the impact of three different prompting approaches on model accuracy.
In this retrospective study, head CT reports from the Japan Medical Imaging Database between 2018 and 2023 were collected. Two board-certified radiologists established the ground truth regarding intracranial hemorrhage and skull fractures through independent review and consensus. Each radiology report was analyzed by three LLMs using three prompting strategies-Standard, Chain of Thought, and Self Consistency prompting. Diagnostic performance (accuracy, precision, recall, and F1-score) was calculated for each LLM-prompt combination and compared using McNemar's tests with Bonferroni correction. Misclassified cases underwent qualitative error analysis.
A total of 3949 head CT reports from 3949 patients (mean age 59 ± 25 years, 56.2% male) were enrolled. Across all institutions, 856 patients (21.6%) had intracranial hemorrhage and 264 patients (6.6%) had skull fractures. All nine LLM-prompt combinations achieved very high accuracy. Claude demonstrated significantly higher accuracy for intracranial hemorrhage than GPT and Gemini, and also outperformed Gemini for skull fractures (p < 0.0001). Gemini's performance improved notably with Chain of Thought prompting. Error analysis revealed common challenges including ambiguous phrases and findings unrelated to intracranial hemorrhage or skull fractures, underscoring the importance of careful prompt design.
All three proprietary LLMs exhibited strong performance in structuring free-text head CT reports for intracranial hemorrhage and skull fractures. While the choice of prompting method influenced accuracy, all models demonstrated robust potential for clinical and research applications. Future work should refine the prompts and validate these approaches in prospective, multilingual settings.
比较三种专有大语言模型(LLMs)——Claude、GPT和Gemini——在构建关于颅内出血和颅骨骨折的日文自由文本放射学报告方面的诊断性能,并评估三种不同提示方法对模型准确性的影响。
在这项回顾性研究中,收集了2018年至2023年期间日本医学影像数据库中的头部CT报告。两名获得委员会认证的放射科医生通过独立审查和达成共识,确定了关于颅内出血和颅骨骨折的真实情况。每份放射学报告由三个大语言模型使用三种提示策略——标准提示、思维链提示和自一致性提示——进行分析。计算每个大语言模型 - 提示组合的诊断性能(准确性、精确性、召回率和F1分数),并使用带有Bonferroni校正的McNemar检验进行比较。对错误分类的病例进行定性错误分析。
共纳入了来自3949名患者(平均年龄59±25岁,56.2%为男性)的3949份头部CT报告。在所有机构中,856名患者(21.6%)有颅内出血,264名患者(6.6%)有颅骨骨折。所有九种大语言模型 - 提示组合都达到了非常高的准确性。Claude在颅内出血方面的准确性显著高于GPT和Gemini,在颅骨骨折方面也优于Gemini(p<0.0001)。Gemini在思维链提示下的性能有显著提高。错误分析揭示了常见的挑战,包括模糊的短语以及与颅内出血或颅骨骨折无关的发现,强调了精心设计提示的重要性。
所有三种专有大语言模型在构建关于颅内出血和颅骨骨折的自由文本头部CT报告方面都表现出强大的性能。虽然提示方法的选择会影响准确性,但所有模型在临床和研究应用中都显示出强大的潜力。未来的工作应该优化提示,并在前瞻性、多语言环境中验证这些方法。