Bulut Bensu, Öz Medine Akkan, Genç Murat, Gür Ayşenur, Yortanlı Mehmet, Yortanlı Betül Çiğdem, Sariyildiz Oguz, Yazıcı Ramiz, Mutlu Hüseyin, Kotanoglu Mustafa Sirri, Cinar Eray, Uykan Zekeriya
Department of Emergency Medicine, Ankara Gulhane Training and Research Hospital, Health Science University, Ankara, Turkey.
Department of Emergency Medicine, Ankara Training and Research Hospital, Ankara, Turkey.
PLoS One. 2025 Sep 12;20(9):e0331962. doi: 10.1371/journal.pone.0331962. eCollection 2025.
This study evaluates the diagnostic performance of three multimodal large language models (LLMs)-ChatGPT-4o, Gemini 2.0, and Claude 3.5-in identifying pneumothorax from chest radiographs.
In this retrospective analysis, 172 pneumothorax cases (148 patients aged >12 years, 24 patients aged ≤12 years) with both chest radiographs and confirmatory thoracic CT were included from a tertiary emergency department. Patients were categorized by age and pneumothorax size (small/large). Each radiograph was presented to all three LLMs accompanied by basic symptoms (dyspnea or chest pain), with each model analyzing each image three times. Diagnostic accuracy was evaluated using overall accuracy (all three responses correct), strict accuracy (≥2 responses correct), and ideal accuracy (≥1 response correct), alongside response consistency assessment using Fleiss' Kappa.
In patients older than 12 years, ChatGPT-4o demonstrated the highest overall accuracy (69.6%), followed by Claude 3.5 (64.9%) and Gemini 2.0 (57.4%). Performance was significantly poorer in pediatric patients across all models (20.8%, 12.5%, and 20.8%, respectively). For large pneumothorax in adults, ChatGPT-4o showed significantly higher accuracy compared to small pneumothorax (81.6% vs. 42.2%; p < 0.001). Regarding consistency, Gemini 2.0 demonstrated excellent reliability for large pneumothorax (Kappa = 1.00), while Claude 3.5 showed moderate consistency across both pneumothorax sizes.
This study, the first to evaluate these three current multimodal LLMs in pneumothorax identification across different age groups, demonstrates promising results for potential clinical applications, particularly for adult patients with large pneumothorax. However, performance limitations in pediatric cases and with small pneumothoraces highlight the need for further validation before clinical implementation.
本研究评估了三种多模态大语言模型(LLMs)——ChatGPT-4o、Gemini 2.0和Claude 3.5——从胸部X光片中识别气胸的诊断性能。
在这项回顾性分析中,从一家三级急诊科纳入了172例同时有胸部X光片和确诊性胸部CT的气胸病例(148例年龄>12岁的患者,24例年龄≤12岁的患者)。患者按年龄和气胸大小(小/大)进行分类。将每张X光片以及基本症状(呼吸困难或胸痛)呈现给所有三种大语言模型,每个模型对每张图像分析三次。使用总体准确率(所有三个回答均正确)、严格准确率(≥2个回答正确)和理想准确率(≥1个回答正确)评估诊断准确性,并使用Fleiss' Kappa进行回答一致性评估。
在12岁以上的患者中,ChatGPT-4o表现出最高的总体准确率(69.6%),其次是Claude 3.5(64.9%)和Gemini 2.0(57.4%)。所有模型在儿科患者中的表现明显较差(分别为20.8%、12.5%和20.8%)。对于成人的大气胸,ChatGPT-4o的准确率明显高于小气胸(81.6%对42.2%;p<0.001)。关于一致性,Gemini 2.0对大气胸表现出极佳的可靠性(Kappa=1.00),而Claude 3.5在两种气胸大小中均表现出中等一致性。
本研究首次评估了这三种当前的多模态大语言模型在不同年龄组气胸识别中的性能,显示出在潜在临床应用方面有前景的结果,特别是对于患有大气胸的成年患者。然而,儿科病例和小气胸的性能局限性凸显了在临床应用前需要进一步验证。