五个解决欧洲放射学文凭（EDiR）基于文本问题的先进聊天机器人：性能和一致性的差异。

Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.

作者信息

Pristoupil Jakub, Oleaga Laura, Junquero Vanesa, Merino Cristina, Ozbek Suha Sureyya, Lambert Lukas

机构信息

Department of Imaging Methods, Motol University Hospital and Second Faculty of Medicine, Charles University, Prague, Czech Republic.

Department of Radiology, Clinical Diagnostic Imaging Centre, Hospital Clínic de Barcelona, Barcelona, Spain.

出版信息

Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.

DOI:10.1186/s41747-025-00591-0

PMID:40830600

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12364795/

Abstract

BACKGROUND

We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions.

METHODS

ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0-1.0).

RESULTS

Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination.

CONCLUSION

Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial.

RELEVANCE STATEMENT

Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings.

KEY POINTS

Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency. ChatGPT-4o ranked second, showing strong but slightly less reliable performance. All chatbots surpassed EDiR candidates in text-based EDiR questions.

摘要

背景

我们比较了五个由大语言模型驱动的聊天机器人在解决欧洲放射学文凭（EDiR）基于文本的多项选择题时的表现、信心和回答一致性。

方法

使用来自之前两次EDiR考试的52道基于文本的多项选择题，对ChatGPT-4o、ChatGPT-4o-mini、Copilot、Gemini和Claude 3.5 Sonnet进行了两轮测试。提示聊天机器人将每个答案评估为正确或错误，并在0（完全没有信心）到10（最有信心）的范围内对其信心水平进行评分。每个问题的分数使用一个加权公式计算，该公式考虑了正确和错误答案（范围0.0 - 1.0）。

结果

与ChatGPT-4o（0.76±0.31）、ChatGPT-4o-mini（0.64±0.35）、Copilot（0.62±0.37）和Gemini（0.54±0.39）相比，Claude 3.5 Sonnet每个问题的得分最高（0.84±0.26，平均值±标准差）（p<0.001）。Claude 3.5 Sonnet在回答问题时自我报告的信心评分为9.0±0.9，其次是ChatGPT-4o（8.7±1.1），相比之下ChatGPT-4o-mini为（8.2±1.3）、Copilot为（8.2±2.2）、Gemini为（8.2±1.6，p<0.001）。Claude 3.5 Sonnet表现出更高的一致性，在两轮测试之间有5.4%的情况改变了回答，相比之下ChatGPT-4o为（6.5%）、ChatGPT-4o-mini为（8.8%）、Copilot为（13.8%）、Gemini为（18.5%）。所有聊天机器人的表现均优于之前EDiR考试的人类考生，在考试的这一部分达到了及格分数。