Jung Jinho, Phillipi Michael, Tran Bryant, Chen Kasha, Chan Nathan, Ho Erwin, Sun Shawn, Houshyar Roozbeh
University of California, Irvine, Orange, 101 The City Drive South, Rt. 140, 5005, 92868, CA, USA.
California University of Science and Medicine, Colton, USA.
Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.
Large language models (LLM) have shown promise in assisting medical decision-making. However, there is limited literature exploring the diagnostic accuracy of LLMs in generating differential diagnoses from text-based image descriptions and clinical presentations in pediatric radiology.
To examine the performance of multiple proprietary LLMs in producing accurate differential diagnoses for text-based pediatric radiological cases without imaging.
One hundred sixty-four cases were retrospectively selected from a pediatric radiology textbook and converted into two formats: (1) image description only, and (2) image description with clinical presentation. The ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro algorithms were given these inputs and tasked with providing a top 1 diagnosis and a top 3 differential diagnoses. Accuracy of responses was assessed by comparison with the original literature. Top 1 accuracy was defined as whether the top 1 diagnosis matched the textbook, and top 3 differential accuracy was defined as the number of diagnoses in the model-generated top 3 differential that matched any of the top 3 diagnoses in the textbook. McNemar's test, Cochran's Q test, Friedman test, and Wilcoxon signed-rank test were used to compare algorithms and assess the impact of added clinical information, respectively.
There was no significant difference in top 1 accuracy between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro when only image descriptions were provided (56.1% [95% CI 48.4-63.5], 64.6% [95% CI 57.1-71.5], 61.6% [95% CI 54.0-68.7]; P = 0.11). Adding clinical presentation to image description significantly improved top 1 accuracy for ChatGPT-4 V (64.0% [95% CI 56.4-71.0], P = 0.02) and Claude 3.5 Sonnet (80.5% [95% CI 73.8-85.8], P < 0.001). For image description and clinical presentation cases, Claude 3.5 Sonnet significantly outperformed both ChatGPT-4 V and Gemini 1.5 Pro (P < 0.001). For top 3 differential accuracy, no significant differences were observed between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro, regardless of whether the cases included only image descriptions (1.29 [95% CI 1.16-1.41], 1.35 [95% CI 1.23-1.48], 1.37 [95% CI 1.25-1.49]; P = 0.60) or both image descriptions and clinical presentations (1.33 [95% CI 1.20-1.45], 1.52 [95% CI 1.41-1.64], 1.48 [95% 1.36-1.59]; P = 0.72). Only Claude 3.5 Sonnet performed significantly better when clinical presentation was added (P < 0.001).
Commercial LLMs performed similarly on pediatric radiology cases in providing top 1 accuracy and top 3 differential accuracy when only a text-based image description was used. Adding clinical presentation significantly improved top 1 accuracy for ChatGPT-4 V and Claude 3.5 Sonnet, with Claude showing the largest improvement. Claude 3.5 Sonnet outperformed both ChatGPT-4 V and Gemini 1.5 Pro in top 1 accuracy when both image and clinical data were provided. No significant differences were found in top 3 differential accuracy across models in any condition.
大语言模型(LLM)在辅助医疗决策方面已显示出前景。然而,关于LLM根据儿科放射学中基于文本的图像描述和临床表现生成鉴别诊断的诊断准确性的文献有限。
检验多种专有LLM在为无影像学的基于文本的儿科放射病例生成准确鉴别诊断方面的性能。
从一本儿科放射学教科书中回顾性选取164个病例,并将其转换为两种格式:(1)仅图像描述,以及(2)带有临床表现的图像描述。将这些输入提供给ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro算法,并要求它们提供 top 1诊断和 top 3鉴别诊断。通过与原始文献比较来评估回答的准确性。top 1准确性定义为top 1诊断是否与教科书匹配,top 3鉴别准确性定义为模型生成的top 3鉴别诊断中与教科书中top 3诊断中的任何一个匹配的诊断数量。使用McNemar检验、Cochran Q检验、Friedman检验和Wilcoxon符号秩检验分别比较算法并评估添加临床信息的影响。
当仅提供图像描述时,ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro在top 1准确性方面无显著差异(56.1% [95% CI 48.4 - 63.5],64.6% [95% CI 57. – 71.5],61.6% [95% CI 54.0 - 68.7];P = 0.11)。在图像描述中添加临床表现显著提高了ChatGPT-4 V(64.0% [95% CI 56.4 - 71.0],P = 0.02)和Claude 3.5 Sonnet(80.5% [95% CI 73.8 - 85.8],P < 0.001)的top 1准确性。对于图像描述和临床表现病例,Claude 3.5 Sonnet显著优于ChatGPT-4 V和Gemini 1.5 Pro(P < 0.001)。对于top 3鉴别准确性,ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro之间未观察到显著差异,无论病例是仅包括图像描述(1.29 [95% CI 1.16 - 1.41],1.35 [95% CI 1.23 - 1.48],1.37 [95% CI 1.25 - 1.49];P = 0.60)还是同时包括图像描述和临床表现(1.33 [95% CI 1.20 - 1.45],1.52 [95% CI 1.41 - 1.64],1.48 [95% 1.36 - 1.59];P = 0.72)。仅在添加临床表现时,Claude 3.5 Sonnet表现显著更好(P < 0.001)。
当仅使用基于文本的图像描述时,商业LLM在儿科放射病例的top 1准确性和top 3鉴别准确性方面表现相似。添加临床表现显著提高了ChatGPT-4 V和Claude 3.5 Sonnet的top 1准确性,Claude的提升最大。当同时提供图像和临床数据时,Claude 3.5 Sonnet在top 1准确性方面优于ChatGPT-4 V和Gemini 1.5 Pro。在任何条件下,各模型在top 3鉴别准确性方面均未发现显著差异。