• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

作者信息

Jung Jinho, Phillipi Michael, Tran Bryant, Chen Kasha, Chan Nathan, Ho Erwin, Sun Shawn, Houshyar Roozbeh

机构信息

University of California, Irvine, Orange, 101 The City Drive South, Rt. 140, 5005, 92868, CA, USA.

California University of Science and Medicine, Colton, USA.

出版信息

Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.

DOI:10.1007/s00247-025-06317-z
PMID:40650735
Abstract

BACKGROUND

Large language models (LLM) have shown promise in assisting medical decision-making. However, there is limited literature exploring the diagnostic accuracy of LLMs in generating differential diagnoses from text-based image descriptions and clinical presentations in pediatric radiology.

OBJECTIVE

To examine the performance of multiple proprietary LLMs in producing accurate differential diagnoses for text-based pediatric radiological cases without imaging.

MATERIALS AND METHODS

One hundred sixty-four cases were retrospectively selected from a pediatric radiology textbook and converted into two formats: (1) image description only, and (2) image description with clinical presentation. The ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro algorithms were given these inputs and tasked with providing a top 1 diagnosis and a top 3 differential diagnoses. Accuracy of responses was assessed by comparison with the original literature. Top 1 accuracy was defined as whether the top 1 diagnosis matched the textbook, and top 3 differential accuracy was defined as the number of diagnoses in the model-generated top 3 differential that matched any of the top 3 diagnoses in the textbook. McNemar's test, Cochran's Q test, Friedman test, and Wilcoxon signed-rank test were used to compare algorithms and assess the impact of added clinical information, respectively.

RESULTS

There was no significant difference in top 1 accuracy between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro when only image descriptions were provided (56.1% [95% CI 48.4-63.5], 64.6% [95% CI 57.1-71.5], 61.6% [95% CI 54.0-68.7]; P = 0.11). Adding clinical presentation to image description significantly improved top 1 accuracy for ChatGPT-4 V (64.0% [95% CI 56.4-71.0], P = 0.02) and Claude 3.5 Sonnet (80.5% [95% CI 73.8-85.8], P < 0.001). For image description and clinical presentation cases, Claude 3.5 Sonnet significantly outperformed both ChatGPT-4 V and Gemini 1.5 Pro (P < 0.001). For top 3 differential accuracy, no significant differences were observed between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro, regardless of whether the cases included only image descriptions (1.29 [95% CI 1.16-1.41], 1.35 [95% CI 1.23-1.48], 1.37 [95% CI 1.25-1.49]; P = 0.60) or both image descriptions and clinical presentations (1.33 [95% CI 1.20-1.45], 1.52 [95% CI 1.41-1.64], 1.48 [95% 1.36-1.59]; P = 0.72). Only Claude 3.5 Sonnet performed significantly better when clinical presentation was added (P < 0.001).

CONCLUSION

Commercial LLMs performed similarly on pediatric radiology cases in providing top 1 accuracy and top 3 differential accuracy when only a text-based image description was used. Adding clinical presentation significantly improved top 1 accuracy for ChatGPT-4 V and Claude 3.5 Sonnet, with Claude showing the largest improvement. Claude 3.5 Sonnet outperformed both ChatGPT-4 V and Gemini 1.5 Pro in top 1 accuracy when both image and clinical data were provided. No significant differences were found in top 3 differential accuracy across models in any condition.

摘要

背景

大语言模型(LLM)在辅助医疗决策方面已显示出前景。然而,关于LLM根据儿科放射学中基于文本的图像描述和临床表现生成鉴别诊断的诊断准确性的文献有限。

目的

检验多种专有LLM在为无影像学的基于文本的儿科放射病例生成准确鉴别诊断方面的性能。

材料与方法

从一本儿科放射学教科书中回顾性选取164个病例,并将其转换为两种格式:(1)仅图像描述,以及(2)带有临床表现的图像描述。将这些输入提供给ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro算法,并要求它们提供 top 1诊断和 top 3鉴别诊断。通过与原始文献比较来评估回答的准确性。top 1准确性定义为top 1诊断是否与教科书匹配,top 3鉴别准确性定义为模型生成的top 3鉴别诊断中与教科书中top 3诊断中的任何一个匹配的诊断数量。使用McNemar检验、Cochran Q检验、Friedman检验和Wilcoxon符号秩检验分别比较算法并评估添加临床信息的影响。

结果

当仅提供图像描述时,ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro在top 1准确性方面无显著差异(56.1% [95% CI 48.4 - 63.5],64.6% [95% CI 57. – 71.5],61.6% [95% CI 54.0 - 68.7];P = 0.11)。在图像描述中添加临床表现显著提高了ChatGPT-4 V(64.0% [95% CI 56.4 - 71.0],P = 0.02)和Claude 3.5 Sonnet(80.5% [95% CI 73.8 - 85.8],P < 0.001)的top 1准确性。对于图像描述和临床表现病例,Claude 3.5 Sonnet显著优于ChatGPT-4 V和Gemini 1.5 Pro(P < 0.001)。对于top 3鉴别准确性,ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro之间未观察到显著差异,无论病例是仅包括图像描述(1.29 [95% CI 1.16 - 1.41],1.35 [95% CI 1.23 - 1.48],1.37 [95% CI 1.25 - 1.49];P = 0.60)还是同时包括图像描述和临床表现(1.33 [95% CI 1.20 - 1.45],1.52 [95% CI 1.41 - 1.64],1.48 [95% 1.36 - 1.59];P = 0.72)。仅在添加临床表现时,Claude 3.5 Sonnet表现显著更好(P < 0.001)。

结论

当仅使用基于文本的图像描述时,商业LLM在儿科放射病例的top 1准确性和top 3鉴别准确性方面表现相似。添加临床表现显著提高了ChatGPT-4 V和Claude 3.5 Sonnet的top 1准确性,Claude的提升最大。当同时提供图像和临床数据时,Claude 3.5 Sonnet在top 1准确性方面优于ChatGPT-4 V和Gemini 1.5 Pro。在任何条件下,各模型在top 3鉴别准确性方面均未发现显著差异。

相似文献

1
Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。
Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.
2
An Institutional Large Language Model for Musculoskeletal MRI Improves Protocol Adherence and Accuracy.用于肌肉骨骼磁共振成像的机构大语言模型可提高方案依从性和准确性。
J Bone Joint Surg Am. 2025 Jul 8. doi: 10.2106/JBJS.24.01429.
3
Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.五个解决欧洲放射学文凭(EDiR)基于文本问题的先进聊天机器人:性能和一致性的差异。
Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.
4
Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study.使用大语言模型提高在线患者教育材料的可读性:横断面研究。
J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.
5
Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions.来自数字和人工来源的信息:聊天机器人与临床医生对正畸问题回答的比较。
Am J Orthod Dentofacial Orthop. 2025 May 6. doi: 10.1016/j.ajodo.2025.04.008.
6
Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.由大语言模型模拟的合成医患对话:多维评估
Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.
7
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
8
Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses.用于临床诊断的专用人工智能专家系统与具有大语言模型的生成式人工智能对比
JAMA Netw Open. 2025 May 1;8(5):e2512994. doi: 10.1001/jamanetworkopen.2025.12994.
9
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
10
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

本文引用的文献

1
Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports.在1933份欧洲放射学会病例报告中对开源语言模型的诊断性能进行基准测试。
NPJ Digit Med. 2025 Feb 12;8(1):97. doi: 10.1038/s41746-025-01488-3.
2
Influence of prior probability information on large language model performance in radiological diagnosis.先验概率信息对大语言模型在放射诊断中性能的影响。
Jpn J Radiol. 2025 Feb 5. doi: 10.1007/s11604-025-01743-3.
3
Comparing Large Language Model and Human Reader Accuracy with Image Challenge Case Image Inputs.
比较大语言模型和人类读者在图像挑战病例图像输入方面的准确性。
Radiology. 2024 Dec;313(3):e241668. doi: 10.1148/radiol.241668.
4
Large Language Models with Vision on Diagnostic Radiology Board Exam Style Questions.具备视觉能力的大语言模型用于诊断放射学委员会考试风格的问题。
Acad Radiol. 2025 May;32(5):3096-3102. doi: 10.1016/j.acra.2024.11.028. Epub 2024 Dec 4.
5
Generative pre-trained transformer (GPT)-4 support for differential diagnosis in neuroradiology.生成式预训练变换器(GPT)-4在神经放射学鉴别诊断中的应用
Quant Imaging Med Surg. 2024 Oct 1;14(10):7551-7560. doi: 10.21037/qims-24-200. Epub 2024 Sep 23.
6
Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors.基于GPT-4的ChatGPT与放射科医生在使用脑肿瘤真实世界放射学报告方面的诊断性能比较分析。
Eur Radiol. 2025 Apr;35(4):1938-1947. doi: 10.1007/s00330-024-11032-8. Epub 2024 Aug 28.
7
Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.Claude 3 Opus 和 Claude 3.5 Sonnet 基于病史和放射科“诊断请”病例关键图像的诊断性能。
Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.
8
Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases.比较放射科医生与 GPT-4V 和 Gemini Pro Vision 使用诊断请案例的图像输入的诊断准确性。
Radiology. 2024 Jul;312(1):e240273. doi: 10.1148/radiol.240273.
9
Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在“诊断请”案例中的诊断性能。
Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.
10
Accuracy of ChatGPT generated diagnosis from patient's medical history and imaging findings in neuroradiology cases.ChatGPT根据患者病史和影像学检查结果对神经放射学病例进行诊断的准确性。
Neuroradiology. 2024 Jan;66(1):73-79. doi: 10.1007/s00234-023-03252-4. Epub 2023 Nov 23.