ChatGPT 和 Bard 在基于文本的放射学知识评估中的比较性能。

Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment.

机构信息

Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada.

Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.

出版信息

Can Assoc Radiol J. 2024 May;75(2):344-350. doi: 10.1177/08465371231193716. Epub 2023 Aug 14.

DOI:10.1177/08465371231193716

PMID:37578849

Abstract

PURPOSE

Bard by Google, a direct competitor to ChatGPT, was recently released. Understanding the relative performance of these different chatbots can provide important insight into their strengths and weaknesses as well as which roles they are most suited to fill. In this project, we aimed to compare the most recent version of ChatGPT, ChatGPT-4, and Bard by Google, in their ability to accurately respond to radiology board examination practice questions.

METHODS

Text-based questions were collected from the 2017-2021 American College of Radiology's Diagnostic Radiology In-Training (DXIT) examinations. ChatGPT-4 and Bard were queried, and their comparative accuracies, response lengths, and response times were documented. Subspecialty-specific performance was analyzed as well.

RESULTS

318 questions were included in our analysis. ChatGPT answered significantly more accurately than Bard (87.11% vs 70.44%, < .0001). ChatGPT's response length was significantly shorter than Bard's (935.28 ± 440.88 characters vs 1437.52 ± 415.91 characters, < .0001). ChatGPT's response time was significantly longer than Bard's (26.79 ± 3.27 seconds vs 7.55 ± 1.88 seconds, < .0001). ChatGPT performed superiorly to Bard in neuroradiology, (100.00% vs 86.21%, = .03), general & physics (85.39% vs 68.54%, < .001), nuclear medicine (80.00% vs 56.67%, < .01), pediatric radiology (93.75% vs 68.75%, = .03), and ultrasound (100.00% vs 63.64%, < .001). In the remaining subspecialties, there were no significant differences between ChatGPT and Bard's performance.

CONCLUSION

ChatGPT displayed superior radiology knowledge compared to Bard. While both chatbots display reasonable radiology knowledge, they should be used with conscious knowledge of their limitations and fallibility. Both chatbots provided incorrect or illogical answer explanations and did not always address the educational content of the question.

摘要

目的

谷歌的 Bard 是 ChatGPT 的直接竞争对手，最近刚刚发布。了解这些不同的聊天机器人的相对性能，可以为它们的优势和劣势以及它们最适合扮演的角色提供重要的见解。在这个项目中，我们旨在比较 ChatGPT 的最新版本 ChatGPT-4 和 Bard，以评估它们在准确回答放射学委员会考试实践问题方面的能力。

方法

从 2017 年至 2021 年美国放射学院的诊断放射学住院医师培训 (DXIT) 考试中收集基于文本的问题。查询了 ChatGPT-4 和 Bard，并记录了它们的准确率、回复长度和回复时间。还分析了专科特定的性能。

结果

我们的分析共纳入 318 个问题。ChatGPT 的回答准确率明显高于 Bard（87.11% 对 70.44%，<.0001）。ChatGPT 的回复长度明显短于 Bard（935.28 ± 440.88 个字符对 1437.52 ± 415.91 个字符，<.0001）。ChatGPT 的回复时间明显长于 Bard（26.79 ± 3.27 秒对 7.55 ± 1.88 秒，<.0001）。ChatGPT 在神经放射学（100.00% 对 86.21%，=.03）、普通放射学与物理学（85.39% 对 68.54%，<.001）、核医学（80.00% 对 56.67%，<.01）、儿科放射学（93.75% 对 68.75%，=.03）和超声（100.00% 对 63.64%，<.001）方面的表现优于 Bard。在其余专科中，ChatGPT 和 Bard 的表现没有显著差异。

结论

ChatGPT 显示出比 Bard 更高的放射学知识。虽然这两个聊天机器人都显示出了合理的放射学知识，但在使用时应意识到它们的局限性和易错性。两个聊天机器人都提供了不正确或不合逻辑的答案解释，并且并不总是能解决问题的教育内容。

相似文献

Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment.

Can Assoc Radiol J. 2024 May;75(2):344-350. doi: 10.1177/08465371231193716. Epub 2023 Aug 14.

Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment.

Eye (Lond). 2024 Sep;38(13):2530-2535. doi: 10.1038/s41433-024-03067-4. Epub 2024 Apr 13.

Artificial Intelligence Chatbots' Understanding of the Risks and Benefits of Computed Tomography and Magnetic Resonance Imaging Scenarios.

Can Assoc Radiol J. 2024 Aug;75(3):518-524. doi: 10.1177/08465371231220561. Epub 2024 Jan 6.

How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard.

Radiology. 2023 Jun;307(5):e230922. doi: 10.1148/radiol.230922.

Performance of ChatGPT-4 and Bard chatbots in responding to common patient questions on prostate cancer Lu-PSMA-617 therapy.

Front Oncol. 2024 Jul 12;14:1386718. doi: 10.3389/fonc.2024.1386718. eCollection 2024.

Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients.

Vascular. 2025 Feb;33(1):229-237. doi: 10.1177/17085381241240550. Epub 2024 Mar 18.

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.

Evaluating the Accuracy of ChatGPT and Google BARD in Fielding Oculoplastic Patient Queries: A Comparative Study on Artificial versus Human Intelligence.

Ophthalmic Plast Reconstr Surg. 2024;40(3):303-311. doi: 10.1097/IOP.0000000000002567. Epub 2024 Jan 12.

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.

Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

引用本文的文献

Evaluating the Reliability of OpenAI's ChatGPT-4 in Providing Pre-colonoscopy Patient Guidance.

Cureus. 2025 Jun 21;17(6):e86512. doi: 10.7759/cureus.86512. eCollection 2025 Jun.

Performance of ChatGPT-3.5 and ChatGPT-4 in Solving Questions Based on Core Concepts in Cardiovascular Physiology.

Cureus. 2025 May 6;17(5):e83552. doi: 10.7759/cureus.83552. eCollection 2025 May.

Evaluating the performance of artificial intelligence in summarizing pre-coded text to support evidence synthesis: a comparison between chatbots and humans.

BMC Med Res Methodol. 2025 May 30;25(1):150. doi: 10.1186/s12874-025-02532-2.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Evaluating AI-based breastfeeding chatbots: quality, readability, and reliability analysis.

PLoS One. 2025 Mar 17;20(3):e0319782. doi: 10.1371/journal.pone.0319782. eCollection 2025.

A Future of Self-Directed Patient Internet Research: Large Language Model-Based Tools Versus Standard Search Engines.

Ann Biomed Eng. 2025 May;53(5):1199-1208. doi: 10.1007/s10439-025-03701-6. Epub 2025 Mar 3.

Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review.

Front Digit Health. 2025 Feb 3;7:1482712. doi: 10.3389/fdgth.2025.1482712. eCollection 2025.

Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology.

Pharmacoepidemiol Drug Saf. 2025 Feb;34(2):e70111. doi: 10.1002/pds.70111.

ChatGPT-4 Omni's superiority in answering multiple-choice oral radiology questions.

BMC Oral Health. 2025 Feb 1;25(1):173. doi: 10.1186/s12903-025-05554-w.

Analyzing evaluation methods for large language models in the medical field: a scoping review.

BMC Med Inform Decis Mak. 2024 Nov 29;24(1):366. doi: 10.1186/s12911-024-02709-7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ChatGPT 和 Bard 在基于文本的放射学知识评估中的比较性能。

Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献