大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。

Jung Jinho, Phillipi Michael, Tran Bryant, Chen Kasha, Chan Nathan, Ho Erwin, Sun Shawn, Houshyar Roozbeh

University of California, Irvine, Orange, 101 The City Drive South, Rt. 140, 5005, 92868, CA, USA.

California University of Science and Medicine, Colton, USA.

Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.

BACKGROUND

Large language models (LLM) have shown promise in assisting medical decision-making. However, there is limited literature exploring the diagnostic accuracy of LLMs in generating differential diagnoses from text-based image descriptions and clinical presentations in pediatric radiology.

OBJECTIVE

To examine the performance of multiple proprietary LLMs in producing accurate differential diagnoses for text-based pediatric radiological cases without imaging.

MATERIALS AND METHODS

One hundred sixty-four cases were retrospectively selected from a pediatric radiology textbook and converted into two formats: (1) image description only, and (2) image description with clinical presentation. The ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro algorithms were given these inputs and tasked with providing a top 1 diagnosis and a top 3 differential diagnoses. Accuracy of responses was assessed by comparison with the original literature. Top 1 accuracy was defined as whether the top 1 diagnosis matched the textbook, and top 3 differential accuracy was defined as the number of diagnoses in the model-generated top 3 differential that matched any of the top 3 diagnoses in the textbook. McNemar's test, Cochran's Q test, Friedman test, and Wilcoxon signed-rank test were used to compare algorithms and assess the impact of added clinical information, respectively.

RESULTS

There was no significant difference in top 1 accuracy between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro when only image descriptions were provided (56.1% [95% CI 48.4-63.5], 64.6% [95% CI 57.1-71.5], 61.6% [95% CI 54.0-68.7]; P = 0.11). Adding clinical presentation to image description significantly improved top 1 accuracy for ChatGPT-4 V (64.0% [95% CI 56.4-71.0], P = 0.02) and Claude 3.5 Sonnet (80.5% [95% CI 73.8-85.8], P < 0.001). For image description and clinical presentation cases, Claude 3.5 Sonnet significantly outperformed both ChatGPT-4 V and Gemini 1.5 Pro (P < 0.001). For top 3 differential accuracy, no significant differences were observed between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro, regardless of whether the cases included only image descriptions (1.29 [95% CI 1.16-1.41], 1.35 [95% CI 1.23-1.48], 1.37 [95% CI 1.25-1.49]; P = 0.60) or both image descriptions and clinical presentations (1.33 [95% CI 1.20-1.45], 1.52 [95% CI 1.41-1.64], 1.48 [95% 1.36-1.59]; P = 0.72). Only Claude 3.5 Sonnet performed significantly better when clinical presentation was added (P < 0.001).

CONCLUSION

Commercial LLMs performed similarly on pediatric radiology cases in providing top 1 accuracy and top 3 differential accuracy when only a text-based image description was used. Adding clinical presentation significantly improved top 1 accuracy for ChatGPT-4 V and Claude 3.5 Sonnet, with Claude showing the largest improvement. Claude 3.5 Sonnet outperformed both ChatGPT-4 V and Gemini 1.5 Pro in top 1 accuracy when both image and clinical data were provided. No significant differences were found in top 3 differential accuracy across models in any condition.

背景

大语言模型（LLM）在辅助医疗决策方面已显示出前景。然而，关于LLM根据儿科放射学中基于文本的图像描述和临床表现生成鉴别诊断的诊断准确性的文献有限。

目的

检验多种专有LLM在为无影像学的基于文本的儿科放射病例生成准确鉴别诊断方面的性能。

材料与方法

从一本儿科放射学教科书中回顾性选取164个病例，并将其转换为两种格式：（1）仅图像描述，以及（2）带有临床表现的图像描述。将这些输入提供给ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro算法，并要求它们提供 top 1诊断和 top 3鉴别诊断。通过与原始文献比较来评估回答的准确性。top 1准确性定义为top 1诊断是否与教科书匹配，top 3鉴别准确性定义为模型生成的top 3鉴别诊断中与教科书中top 3诊断中的任何一个匹配的诊断数量。使用McNemar检验、Cochran Q检验、Friedman检验和Wilcoxon符号秩检验分别比较算法并评估添加临床信息的影响。

结果

当仅提供图像描述时，ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro在top 1准确性方面无显著差异（56.1% [95% CI 48.4 - 63.5]，64.6% [95% CI 57. – 71.5]，61.6% [95% CI 54.0 - 68.7]；P = 0.11）。在图像描述中添加临床表现显著提高了ChatGPT-4 V（64.0% [95% CI 56.4 - 71.0]，P = 0.02）和Claude 3.5 Sonnet（80.5% [95% CI 73.8 - 85.8]，P < 0.001）的top 1准确性。对于图像描述和临床表现病例，Claude 3.5 Sonnet显著优于ChatGPT-4 V和Gemini 1.5 Pro（P < 0.001）。对于top 3鉴别准确性，ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro之间未观察到显著差异，无论病例是仅包括图像描述（1.29 [95% CI 1.16 - 1.41]，1.35 [95% CI 1.23 - 1.48]，1.37 [95% CI 1.25 - 1.49]；P = 0.60）还是同时包括图像描述和临床表现（1.33 [95% CI 1.20 - 1.45]，1.52 [95% CI 1.41 - 1.64]，1.48 [95% 1.36 - 1.59]；P = 0.72）。仅在添加临床表现时，Claude 3.5 Sonnet表现显著更好（P < 0.001）。

结论

当仅使用基于文本的图像描述时，商业LLM在儿科放射病例的top 1准确性和top 3鉴别准确性方面表现相似。添加临床表现显著提高了ChatGPT-4 V和Claude 3.5 Sonnet的top 1准确性，Claude的提升最大。当同时提供图像和临床数据时，Claude 3.5 Sonnet在top 1准确性方面优于ChatGPT-4 V和Gemini 1.5 Pro。在任何条件下，各模型在top 3鉴别准确性方面均未发现显著差异。

相似文献

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.

An Institutional Large Language Model for Musculoskeletal MRI Improves Protocol Adherence and Accuracy.

J Bone Joint Surg Am. 2025 Jul 8. doi: 10.2106/JBJS.24.01429.

Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.

Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.

Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study.

J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.

Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions.

Am J Orthod Dentofacial Orthop. 2025 May 6. doi: 10.1016/j.ajodo.2025.04.008.

Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.

Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses.

JAMA Netw Open. 2025 May 1;8(5):e2512994. doi: 10.1001/jamanetworkopen.2025.12994.

Prescription of Controlled Substances: Benefits and Risks

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

本文引用的文献

Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports.

NPJ Digit Med. 2025 Feb 12;8(1):97. doi: 10.1038/s41746-025-01488-3.

Influence of prior probability information on large language model performance in radiological diagnosis.

Jpn J Radiol. 2025 Feb 5. doi: 10.1007/s11604-025-01743-3.

Comparing Large Language Model and Human Reader Accuracy with Image Challenge Case Image Inputs.

Radiology. 2024 Dec;313(3):e241668. doi: 10.1148/radiol.241668.

Large Language Models with Vision on Diagnostic Radiology Board Exam Style Questions.

Acad Radiol. 2025 May;32(5):3096-3102. doi: 10.1016/j.acra.2024.11.028. Epub 2024 Dec 4.

Generative pre-trained transformer (GPT)-4 support for differential diagnosis in neuroradiology.

Quant Imaging Med Surg. 2024 Oct 1;14(10):7551-7560. doi: 10.21037/qims-24-200. Epub 2024 Sep 23.

Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors.

Eur Radiol. 2025 Apr;35(4):1938-1947. doi: 10.1007/s00330-024-11032-8. Epub 2024 Aug 28.

Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.

Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.

Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases.

Radiology. 2024 Jul;312(1):e240273. doi: 10.1148/radiol.240273.

Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.

Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.

Accuracy of ChatGPT generated diagnosis from patient's medical history and imaging findings in neuroradiology cases.

Neuroradiology. 2024 Jan;66(1):73-79. doi: 10.1007/s00234-023-03252-4. Epub 2023 Nov 23.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.

An Institutional Large Language Model for Musculoskeletal MRI Improves Protocol Adherence and Accuracy.

J Bone Joint Surg Am. 2025 Jul 8. doi: 10.2106/JBJS.24.01429.

Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.

Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.

Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study.

J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.

Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions.

Am J Orthod Dentofacial Orthop. 2025 May 6. doi: 10.1016/j.ajodo.2025.04.008.

Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.

Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.

Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses.

JAMA Netw Open. 2025 May 1;8(5):e2512994. doi: 10.1001/jamanetworkopen.2025.12994.

Prescription of Controlled Substances: Benefits and Risks

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

本文引用的文献

Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports.

NPJ Digit Med. 2025 Feb 12;8(1):97. doi: 10.1038/s41746-025-01488-3.

Influence of prior probability information on large language model performance in radiological diagnosis.

Jpn J Radiol. 2025 Feb 5. doi: 10.1007/s11604-025-01743-3.

Comparing Large Language Model and Human Reader Accuracy with Image Challenge Case Image Inputs.

Radiology. 2024 Dec;313(3):e241668. doi: 10.1148/radiol.241668.

Large Language Models with Vision on Diagnostic Radiology Board Exam Style Questions.

Acad Radiol. 2025 May;32(5):3096-3102. doi: 10.1016/j.acra.2024.11.028. Epub 2024 Dec 4.

Generative pre-trained transformer (GPT)-4 support for differential diagnosis in neuroradiology.

Quant Imaging Med Surg. 2024 Oct 1;14(10):7551-7560. doi: 10.21037/qims-24-200. Epub 2024 Sep 23.

Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors.

Eur Radiol. 2025 Apr;35(4):1938-1947. doi: 10.1007/s00330-024-11032-8. Epub 2024 Aug 28.

Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.

Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.

Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases.

Radiology. 2024 Jul;312(1):e240273. doi: 10.1148/radiol.240273.

Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.

Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.

Accuracy of ChatGPT generated diagnosis from patient's medical history and imaging findings in neuroradiology cases.

Neuroradiology. 2024 Jan;66(1):73-79. doi: 10.1007/s00234-023-03252-4. Epub 2023 Nov 23.

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

背景

目的

材料与方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献