Suppr超能文献

五个解决欧洲放射学文凭(EDiR)基于文本问题的先进聊天机器人:性能和一致性的差异。

Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.

作者信息

Pristoupil Jakub, Oleaga Laura, Junquero Vanesa, Merino Cristina, Ozbek Suha Sureyya, Lambert Lukas

机构信息

Department of Imaging Methods, Motol University Hospital and Second Faculty of Medicine, Charles University, Prague, Czech Republic.

Department of Radiology, Clinical Diagnostic Imaging Centre, Hospital Clínic de Barcelona, Barcelona, Spain.

出版信息

Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.

Abstract

BACKGROUND

We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions.

METHODS

ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0-1.0).

RESULTS

Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination.

CONCLUSION

Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial.

RELEVANCE STATEMENT

Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings.

KEY POINTS

Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency. ChatGPT-4o ranked second, showing strong but slightly less reliable performance. All chatbots surpassed EDiR candidates in text-based EDiR questions.

摘要

背景

我们比较了五个由大语言模型驱动的聊天机器人在解决欧洲放射学文凭(EDiR)基于文本的多项选择题时的表现、信心和回答一致性。

方法

使用来自之前两次EDiR考试的52道基于文本的多项选择题,对ChatGPT-4o、ChatGPT-4o-mini、Copilot、Gemini和Claude 3.5 Sonnet进行了两轮测试。提示聊天机器人将每个答案评估为正确或错误,并在0(完全没有信心)到10(最有信心)的范围内对其信心水平进行评分。每个问题的分数使用一个加权公式计算,该公式考虑了正确和错误答案(范围0.0 - 1.0)。

结果

与ChatGPT-4o(0.76±0.31)、ChatGPT-4o-mini(0.64±0.35)、Copilot(0.62±0.37)和Gemini(0.54±0.39)相比,Claude 3.5 Sonnet每个问题的得分最高(0.84±0.26,平均值±标准差)(p<0.001)。Claude 3.5 Sonnet在回答问题时自我报告的信心评分为9.0±0.9,其次是ChatGPT-4o(8.7±1.1),相比之下ChatGPT-4o-mini为(8.2±1.3)、Copilot为(8.2±2.2)、Gemini为(8.2±1.6,p<0.001)。Claude 3.5 Sonnet表现出更高的一致性,在两轮测试之间有5.4%的情况改变了回答,相比之下ChatGPT-4o为(6.5%)、ChatGPT-4o-mini为(8.8%)、Copilot为(13.8%)、Gemini为(18.5%)。所有聊天机器人的表现均优于之前EDiR考试的人类考生,在考试的这一部分达到了及格分数。

结论

Claude 3.5 Sonnet表现出更高的准确性、信心和一致性,ChatGPT-4o的表现也几乎与之相当。评估模型之间的性能差异很大。

相关性声明

聊天机器人在解决基于EDiR测试的问题时,性能、一致性和信心存在差异,这凸显了谨慎部署的必要性,特别是在高风险的临床和教育环境中。

关键点

Claude 3.5 Sonnet在准确性和回答一致性方面优于其他聊天机器人。ChatGPT-4o排名第二,表现强劲但可靠性稍低。在基于文本的EDiR问题上,所有聊天机器人都超过了EDiR考生。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c943/12364795/9035dea64e37/41747_2025_591_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验