Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada.
Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
Can Assoc Radiol J. 2024 May;75(2):344-350. doi: 10.1177/08465371231193716. Epub 2023 Aug 14.
Bard by Google, a direct competitor to ChatGPT, was recently released. Understanding the relative performance of these different chatbots can provide important insight into their strengths and weaknesses as well as which roles they are most suited to fill. In this project, we aimed to compare the most recent version of ChatGPT, ChatGPT-4, and Bard by Google, in their ability to accurately respond to radiology board examination practice questions.
Text-based questions were collected from the 2017-2021 American College of Radiology's Diagnostic Radiology In-Training (DXIT) examinations. ChatGPT-4 and Bard were queried, and their comparative accuracies, response lengths, and response times were documented. Subspecialty-specific performance was analyzed as well.
318 questions were included in our analysis. ChatGPT answered significantly more accurately than Bard (87.11% vs 70.44%, < .0001). ChatGPT's response length was significantly shorter than Bard's (935.28 ± 440.88 characters vs 1437.52 ± 415.91 characters, < .0001). ChatGPT's response time was significantly longer than Bard's (26.79 ± 3.27 seconds vs 7.55 ± 1.88 seconds, < .0001). ChatGPT performed superiorly to Bard in neuroradiology, (100.00% vs 86.21%, = .03), general & physics (85.39% vs 68.54%, < .001), nuclear medicine (80.00% vs 56.67%, < .01), pediatric radiology (93.75% vs 68.75%, = .03), and ultrasound (100.00% vs 63.64%, < .001). In the remaining subspecialties, there were no significant differences between ChatGPT and Bard's performance.
ChatGPT displayed superior radiology knowledge compared to Bard. While both chatbots display reasonable radiology knowledge, they should be used with conscious knowledge of their limitations and fallibility. Both chatbots provided incorrect or illogical answer explanations and did not always address the educational content of the question.
谷歌的 Bard 是 ChatGPT 的直接竞争对手,最近刚刚发布。了解这些不同的聊天机器人的相对性能,可以为它们的优势和劣势以及它们最适合扮演的角色提供重要的见解。在这个项目中,我们旨在比较 ChatGPT 的最新版本 ChatGPT-4 和 Bard,以评估它们在准确回答放射学委员会考试实践问题方面的能力。
从 2017 年至 2021 年美国放射学院的诊断放射学住院医师培训 (DXIT) 考试中收集基于文本的问题。查询了 ChatGPT-4 和 Bard,并记录了它们的准确率、回复长度和回复时间。还分析了专科特定的性能。
我们的分析共纳入 318 个问题。ChatGPT 的回答准确率明显高于 Bard(87.11% 对 70.44%,<.0001)。ChatGPT 的回复长度明显短于 Bard(935.28 ± 440.88 个字符对 1437.52 ± 415.91 个字符,<.0001)。ChatGPT 的回复时间明显长于 Bard(26.79 ± 3.27 秒对 7.55 ± 1.88 秒,<.0001)。ChatGPT 在神经放射学(100.00% 对 86.21%,=.03)、普通放射学与物理学(85.39% 对 68.54%,<.001)、核医学(80.00% 对 56.67%,<.01)、儿科放射学(93.75% 对 68.75%,=.03)和超声(100.00% 对 63.64%,<.001)方面的表现优于 Bard。在其余专科中,ChatGPT 和 Bard 的表现没有显著差异。
ChatGPT 显示出比 Bard 更高的放射学知识。虽然这两个聊天机器人都显示出了合理的放射学知识,但在使用时应意识到它们的局限性和易错性。两个聊天机器人都提供了不正确或不合逻辑的答案解释,并且并不总是能解决问题的教育内容。