Güneş Yasin Celal, Cesur Turay, Çamur Eren, Günbey Karabekmez Leman
Kırıkkale Yüksek İhtisas Hospital, Clinic of Radiology, Kırıkkale, Türkiye.
Mamak State Hospital, Clinic of Radiology, Ankara, Türkiye.
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions.
This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5 edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests.
Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) ( < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) ( > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists ( > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs ( < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) ( < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus ( < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories ( < 0.05).
Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses.
This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.
本研究旨在评估大语言模型(LLMs)和多模态大语言模型在解读乳腺影像报告和数据系统(BI-RADS)分类以及针对基于文本和视觉的问题提供乳腺放射学临床管理建议方面的表现。
这项横断面观察性研究包括两个步骤。第一步,我们使用100个与《BI-RADS图谱》第5版相关的基于文本的多项选择题(MCQs),比较了十个大语言模型(即ChatGPT 4o、ChatGPT 4、ChatGPT 3.5、谷歌Gemini 1.5 Pro、谷歌Gemini 1.0、微软Copilot、Perplexity、Claude 3.5 Sonnet、Claude 3 Opus和Claude 3 Opus 200K)、普通放射科医生和一位乳腺放射科医生。第二步,我们评估了五个多模态大语言模型(ChatGPT 4o、ChatGPT 4V、Claude 3.5 Sonnet、Claude 3 Opus和谷歌Gemini 1.5 Pro)在对100幅乳腺超声图像进行BI-RADS分类并提供临床管理建议方面的表现。使用McNemar检验和卡方检验分析按问题类型划分的正确答案和准确率的比较。使用Kruskal-Wallis检验和Wilcoxon检验分析管理得分。
Claude 3.5 Sonnet在基于文本的多项选择题中准确率最高(90%),其次是ChatGPT 4o(89%),优于所有其他大语言模型和普通放射科医生(分别为78%和76%)(P<0.05),但Claude 3 Opus模型和乳腺放射科医生(82%)除外(P>0.05)。表现较差的大语言模型包括谷歌Gemini 1.0(61%)和ChatGPT 3.5(60%)。不同类别间的表现,大语言模型或放射科医生之间均无显著差异(P>0.05)。对于乳腺超声图像,Claude 3.5 Sonnet的准确率达到59%,显著高于其他多模态大语言模型(P<0.05)。使用3点李克特量表评估管理建议,Claude 3.5 Sonnet得分最高(均值:2.12±0.97)(P<0.05)。除Claude 3 Opus外,不同BI-RADS类别间的准确率差异显著(P<0.05)。Gemini 1.5 Pro未能正确回答任何BI-RADS 5类问题。同样,ChatGPT 4V未能正确回答任何BI-RADS 1类问题,使其在这些类别中准确率最低(P<0.05)。
尽管Claude 3.5 Sonnet和ChatGPT 4o等大语言模型在基于文本的BI-RADS评估中显示出前景,但其在视觉诊断方面的局限性表明,应谨慎使用并在放射科医生的监督下使用,以避免误诊。
本研究表明,虽然大语言模型在基于文本的BI-RADS评估中表现出强大能力,但其视觉诊断能力目前有限,在临床实践中需要进一步发展并谨慎应用。