• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

放射学解读的新前沿:评估大语言模型在气胸诊断中的有效性。

New frontiers in radiologic interpretation: evaluating the effectiveness of large language models in pneumothorax diagnosis.

作者信息

Bulut Bensu, Öz Medine Akkan, Genç Murat, Gür Ayşenur, Yortanlı Mehmet, Yortanlı Betül Çiğdem, Sariyildiz Oguz, Yazıcı Ramiz, Mutlu Hüseyin, Kotanoglu Mustafa Sirri, Cinar Eray, Uykan Zekeriya

机构信息

Department of Emergency Medicine, Ankara Gulhane Training and Research Hospital, Health Science University, Ankara, Turkey.

Department of Emergency Medicine, Ankara Training and Research Hospital, Ankara, Turkey.

出版信息

PLoS One. 2025 Sep 12;20(9):e0331962. doi: 10.1371/journal.pone.0331962. eCollection 2025.

DOI:10.1371/journal.pone.0331962
PMID:40938938
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12431401/
Abstract

BACKGROUND

This study evaluates the diagnostic performance of three multimodal large language models (LLMs)-ChatGPT-4o, Gemini 2.0, and Claude 3.5-in identifying pneumothorax from chest radiographs.

METHODS

In this retrospective analysis, 172 pneumothorax cases (148 patients aged >12 years, 24 patients aged ≤12 years) with both chest radiographs and confirmatory thoracic CT were included from a tertiary emergency department. Patients were categorized by age and pneumothorax size (small/large). Each radiograph was presented to all three LLMs accompanied by basic symptoms (dyspnea or chest pain), with each model analyzing each image three times. Diagnostic accuracy was evaluated using overall accuracy (all three responses correct), strict accuracy (≥2 responses correct), and ideal accuracy (≥1 response correct), alongside response consistency assessment using Fleiss' Kappa.

RESULTS

In patients older than 12 years, ChatGPT-4o demonstrated the highest overall accuracy (69.6%), followed by Claude 3.5 (64.9%) and Gemini 2.0 (57.4%). Performance was significantly poorer in pediatric patients across all models (20.8%, 12.5%, and 20.8%, respectively). For large pneumothorax in adults, ChatGPT-4o showed significantly higher accuracy compared to small pneumothorax (81.6% vs. 42.2%; p < 0.001). Regarding consistency, Gemini 2.0 demonstrated excellent reliability for large pneumothorax (Kappa = 1.00), while Claude 3.5 showed moderate consistency across both pneumothorax sizes.

CONCLUSION

This study, the first to evaluate these three current multimodal LLMs in pneumothorax identification across different age groups, demonstrates promising results for potential clinical applications, particularly for adult patients with large pneumothorax. However, performance limitations in pediatric cases and with small pneumothoraces highlight the need for further validation before clinical implementation.

摘要

背景

本研究评估了三种多模态大语言模型(LLMs)——ChatGPT-4o、Gemini 2.0和Claude 3.5——从胸部X光片中识别气胸的诊断性能。

方法

在这项回顾性分析中,从一家三级急诊科纳入了172例同时有胸部X光片和确诊性胸部CT的气胸病例(148例年龄>12岁的患者,24例年龄≤12岁的患者)。患者按年龄和气胸大小(小/大)进行分类。将每张X光片以及基本症状(呼吸困难或胸痛)呈现给所有三种大语言模型,每个模型对每张图像分析三次。使用总体准确率(所有三个回答均正确)、严格准确率(≥2个回答正确)和理想准确率(≥1个回答正确)评估诊断准确性,并使用Fleiss' Kappa进行回答一致性评估。

结果

在12岁以上的患者中,ChatGPT-4o表现出最高的总体准确率(69.6%),其次是Claude 3.5(64.9%)和Gemini 2.0(57.4%)。所有模型在儿科患者中的表现明显较差(分别为20.8%、12.5%和20.8%)。对于成人的大气胸,ChatGPT-4o的准确率明显高于小气胸(81.6%对42.2%;p<0.001)。关于一致性,Gemini 2.0对大气胸表现出极佳的可靠性(Kappa=1.00),而Claude 3.5在两种气胸大小中均表现出中等一致性。

结论

本研究首次评估了这三种当前的多模态大语言模型在不同年龄组气胸识别中的性能,显示出在潜在临床应用方面有前景的结果,特别是对于患有大气胸的成年患者。然而,儿科病例和小气胸的性能局限性凸显了在临床应用前需要进一步验证。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3dfa/12431401/c0ce38f263b1/pone.0331962.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3dfa/12431401/e91f53a924f0/pone.0331962.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3dfa/12431401/c0ce38f263b1/pone.0331962.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3dfa/12431401/e91f53a924f0/pone.0331962.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3dfa/12431401/c0ce38f263b1/pone.0331962.g002.jpg

相似文献

1
New frontiers in radiologic interpretation: evaluating the effectiveness of large language models in pneumothorax diagnosis.放射学解读的新前沿:评估大语言模型在气胸诊断中的有效性。
PLoS One. 2025 Sep 12;20(9):e0331962. doi: 10.1371/journal.pone.0331962. eCollection 2025.
2
A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study.一种作为甲状腺结节恶性风险端到端分类器的多模态大语言模型:可用性研究
JMIR Form Res. 2025 Aug 19;9:e70863. doi: 10.2196/70863.
3
Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。
Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.
4
Performance analysis of large language models in multi-disease detection from chest computed tomography reports: a comparative study: Experimental Research.基于胸部计算机断层扫描报告的多疾病检测中大型语言模型的性能分析:一项比较研究:实验研究
Int J Surg. 2025 Jun 5. doi: 10.1097/JS9.0000000000002582.
5
Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.大语言模型在放射实践认证考试中的性能比较分析
Radiol Technol. 2025 May-Jun;96(5):334-342.
6
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
7
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
8
A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
9
Comparative Analysis of Large Language Models in Dermatological Diagnosis: An Evaluation of Diagnostic Accuracy.大语言模型在皮肤病诊断中的比较分析:诊断准确性评估
Cureus. 2025 Sep 11;17(9):e92089. doi: 10.7759/cureus.92089. eCollection 2025 Sep.
10
Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.由大语言模型模拟的合成医患对话:多维评估
Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.

本文引用的文献

1
Evaluating ChatGPT's Performance in Classifying Pertrochanteric Fractures Based on Arbeitsgemeinschaft für Osteosynthesefragen/Orthopedic Trauma Association (AO/OTA) Standards.基于骨科学术协会/骨科创伤协会(AO/OTA)标准评估ChatGPT在转子间骨折分类中的表现。
Cureus. 2025 Jan 27;17(1):e78068. doi: 10.7759/cureus.78068. eCollection 2025 Jan.
2
Multimodal large language models address clinical queries in laryngeal cancer surgery: a comparative evaluation of image interpretation across different models.多模态大语言模型处理喉癌手术中的临床问题:不同模型图像解读的比较评估
Int J Surg. 2025 Mar 1;111(3):2727-2730. doi: 10.1097/JS9.0000000000002234.
3
Generating credible referenced medical research: A comparative study of openAI's GPT-4 and Google's gemini.
生成可信的引用医学研究:OpenAI的GPT-4与谷歌的Gemini的比较研究
Comput Biol Med. 2025 Feb;185:109545. doi: 10.1016/j.compbiomed.2024.109545. Epub 2024 Dec 12.
4
Revolution or risk?-Assessing the potential and challenges of GPT-4V in radiologic image interpretation.革命还是风险?——评估GPT-4V在放射影像解读中的潜力与挑战
Eur Radiol. 2025 Mar;35(3):1111-1121. doi: 10.1007/s00330-024-11115-6. Epub 2024 Oct 18.
5
Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)'s ability to interpret radiological images.迈入大型多模态模型时代:ChatGPT-4V(ision)解读放射影像能力的初步研究。
Int J Surg. 2024 Jul 1;110(7):4096-4102. doi: 10.1097/JS9.0000000000001359.
6
On the challenges and perspectives of foundation models for medical image analysis.论医学图像分析基础模型的挑战与前景
Med Image Anal. 2024 Jan;91:102996. doi: 10.1016/j.media.2023.102996. Epub 2023 Oct 12.
7
The future landscape of large language models in medicine.医学领域大语言模型的未来前景。
Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.
8
Medical visual question answering: A survey.医学视觉问答:综述。
Artif Intell Med. 2023 Sep;143:102611. doi: 10.1016/j.artmed.2023.102611. Epub 2023 Jun 8.
9
Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现:当前优势和局限性的深入了解。
Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.
10
Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning.基于条件推理和对比学习的医学视觉问答。
IEEE Trans Med Imaging. 2023 May;42(5):1532-1545. doi: 10.1109/TMI.2022.3232411. Epub 2023 May 2.