Pristoupil Jakub, Oleaga Laura, Junquero Vanesa, Merino Cristina, Ozbek Suha Sureyya, Lambert Lukas
Department of Imaging Methods, Motol University Hospital and Second Faculty of Medicine, Charles University, Prague, Czech Republic.
Department of Radiology, Clinical Diagnostic Imaging Centre, Hospital Clínic de Barcelona, Barcelona, Spain.
Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.
We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions.
ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0-1.0).
Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination.
Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial.
Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings.
Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency. ChatGPT-4o ranked second, showing strong but slightly less reliable performance. All chatbots surpassed EDiR candidates in text-based EDiR questions.
我们比较了五个由大语言模型驱动的聊天机器人在解决欧洲放射学文凭(EDiR)基于文本的多项选择题时的表现、信心和回答一致性。
使用来自之前两次EDiR考试的52道基于文本的多项选择题,对ChatGPT-4o、ChatGPT-4o-mini、Copilot、Gemini和Claude 3.5 Sonnet进行了两轮测试。提示聊天机器人将每个答案评估为正确或错误,并在0(完全没有信心)到10(最有信心)的范围内对其信心水平进行评分。每个问题的分数使用一个加权公式计算,该公式考虑了正确和错误答案(范围0.0 - 1.0)。
与ChatGPT-4o(0.76±0.31)、ChatGPT-4o-mini(0.64±0.35)、Copilot(0.62±0.37)和Gemini(0.54±0.39)相比,Claude 3.5 Sonnet每个问题的得分最高(0.84±0.26,平均值±标准差)(p<0.001)。Claude 3.5 Sonnet在回答问题时自我报告的信心评分为9.0±0.9,其次是ChatGPT-4o(8.7±1.1),相比之下ChatGPT-4o-mini为(8.2±1.3)、Copilot为(8.2±2.2)、Gemini为(8.2±1.6,p<0.001)。Claude 3.5 Sonnet表现出更高的一致性,在两轮测试之间有5.4%的情况改变了回答,相比之下ChatGPT-4o为(6.5%)、ChatGPT-4o-mini为(8.8%)、Copilot为(13.8%)、Gemini为(18.5%)。所有聊天机器人的表现均优于之前EDiR考试的人类考生,在考试的这一部分达到了及格分数。
Claude 3.5 Sonnet表现出更高的准确性、信心和一致性,ChatGPT-4o的表现也几乎与之相当。评估模型之间的性能差异很大。
聊天机器人在解决基于EDiR测试的问题时,性能、一致性和信心存在差异,这凸显了谨慎部署的必要性,特别是在高风险的临床和教育环境中。
Claude 3.5 Sonnet在准确性和回答一致性方面优于其他聊天机器人。ChatGPT-4o排名第二,表现强劲但可靠性稍低。在基于文本的EDiR问题上,所有聊天机器人都超过了EDiR考生。