• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT、GPT-4 和 Bard 在日本放射学会官方董事会考试中的表现评估。

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.

机构信息

Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan.

Department of Radiology, Tohoku Medical and Pharmaceutical University, Sendai, Japan.

出版信息

Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

DOI:10.1007/s11604-023-01491-2
PMID:37792149
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10811006/
Abstract

PURPOSE

Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE).

MATERIALS AND METHODS

In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar's test was used to compare the proportion of correct responses between the LLMs. Fisher's exact test was used to assess the performance of GPT-4 for each topic category.

RESULTS

ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% (p < 0.001) and Google Bard by 26.2% (p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard (p < 0.001). The categorical analysis by question pattern revealed GPT-4's superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001).

CONCLUSION

ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.

摘要

目的

本研究评估了大型语言模型(LLM)在回答临床放射学实践中的问题时的准确性。我们使用日本放射学委员会考试(JRBE)的问题比较了 ChatGPT、GPT-4 和 Google Bard 的性能。

材料和方法

总共使用了日本放射学会授权的 JRBE 2022 年的 103 个问题。这些问题根据模式、所需思维水平和主题进行了分类。使用 McNemar 检验比较 LLM 之间正确回答的比例。使用 Fisher 精确检验评估 GPT-4 对每个主题类别的性能。

结果

ChatGPT、GPT-4 和 Google Bard 分别正确回答了 40.8%(103 个问题中的 42 个)、65.0%(103 个问题中的 67 个)和 38.8%(103 个问题中的 40 个)。GPT-4 显著优于 ChatGPT(24.2%,p<0.001)和 Google Bard(26.2%,p<0.001)。在思维水平的分类分析中,GPT-4 正确回答了 79.7%的低阶问题,明显高于 ChatGPT 或 Google Bard(p<0.001)。通过问题模式的分类分析,GPT-4 在单项回答问题中优于 ChatGPT(67.4%比 46.5%,p=0.004)和 Google Bard(39.5%,p<0.001)。通过主题的分类分析,GPT-4 在核医学方面优于 ChatGPT(40%,p=0.013)和 Google Bard(26.7%,p=0.004)。在上述未提及的类别中,LLM 之间没有观察到显著差异。GPT-4 在核医学(93.3%)方面的表现明显优于诊断放射学(55.8%;p<0.001)。GPT-4 在低阶问题上的表现也优于高阶问题(79.7%比 45.5%,p<0.001)。

结论

基于 GPT-4 的 ChatGPTplus 在回答 JRBE 的日本问题时得分为 65%,优于 ChatGPT 和 Google Bard。这突出了在日本放射学领域使用 LLM 来解决高级临床问题的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c94/10811006/bbfb52b66153/11604_2023_1491_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c94/10811006/62b4b1857ebe/11604_2023_1491_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c94/10811006/bbfb52b66153/11604_2023_1491_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c94/10811006/62b4b1857ebe/11604_2023_1491_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c94/10811006/bbfb52b66153/11604_2023_1491_Fig2_HTML.jpg

相似文献

1
Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.ChatGPT、GPT-4 和 Bard 在日本放射学会官方董事会考试中的表现评估。
Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.
2
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
3
The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study.GPT-3.5、GPT-4和Bard在日本国家牙科医师考试中的表现:一项比较研究。
Cureus. 2023 Dec 12;15(12):e50369. doi: 10.7759/cureus.50369. eCollection 2023 Dec.
4
Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal.ChatGPT 和 Bard 在肾病学委员会更新的自我评估问题中的表现。
Clin Exp Nephrol. 2024 May;28(5):465-469. doi: 10.1007/s10157-023-02451-w. Epub 2024 Feb 14.
5
How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard.人工智能如何回答常见肺癌问题:ChatGPT 与 Google Bard 对比。
Radiology. 2023 Jun;307(5):e230922. doi: 10.1148/radiol.230922.
6
Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions With Image Analysis Insights.推进医学教育:生成式人工智能模型在耳鼻喉科委员会备考问题上的表现及图像分析见解
Cureus. 2024 Jul 9;16(7):e64204. doi: 10.7759/cureus.64204. eCollection 2024 Jul.
7
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
8
Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard.大语言模型能否通过欧洲神经放射学会课程的官方高级考试?OpenAI chatGPT 3.5、OpenAI GPT4与谷歌巴德的直接比较。
Neuroradiology. 2024 Aug;66(8):1245-1250. doi: 10.1007/s00234-024-03371-6. Epub 2024 May 6.
9
Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat.评估领先的大语言模型在日本国家牙科保健员考试中的功效:ChatGPT、Bard和必应聊天的比较分析。
J Dent Sci. 2024 Oct;19(4):2262-2267. doi: 10.1016/j.jds.2024.02.019. Epub 2024 Feb 29.
10
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

引用本文的文献

1
Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.五个解决欧洲放射学文凭(EDiR)基于文本问题的先进聊天机器人:性能和一致性的差异。
Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.
2
Performance of Large Language Models in Recognizing Brain MRI Sequences: A Comparative Analysis of ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro.大语言模型在识别脑部磁共振成像序列方面的表现:ChatGPT-4o、Claude 4 Opus和Gemini 2.5 Pro的比较分析
Diagnostics (Basel). 2025 Jul 30;15(15):1919. doi: 10.3390/diagnostics15151919.
3

本文引用的文献

1
Chatbot Confabulations Are Not Hallucinations.聊天机器人的虚构并非幻觉。
JAMA Intern Med. 2023 Oct 1;183(10):1177. doi: 10.1001/jamainternmed.2023.4231.
2
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
3
Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现:比较研究。
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.
大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究
JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.
4
Diagnostic Performance of a Large Language Model for Determining the Cause of Death: A Comparative Analysis of Clinical History, Postmortem Computed Tomography Findings, and Their Integration.用于确定死因的大语言模型的诊断性能:临床病史、尸检计算机断层扫描结果及其整合的比较分析
Cureus. 2025 May 8;17(5):e83721. doi: 10.7759/cureus.83721. eCollection 2025 May.
5
Performance of ChatGPT-3.5 and ChatGPT-4 in Solving Questions Based on Core Concepts in Cardiovascular Physiology.ChatGPT-3.5和ChatGPT-4在基于心血管生理学核心概念解决问题方面的表现。
Cureus. 2025 May 6;17(5):e83552. doi: 10.7759/cureus.83552. eCollection 2025 May.
6
Evaluating the influence of prompt formulation on the reliability and repeatability of ChatGPT in implant-supported prostheses.评估提示词制定对ChatGPT在种植体支持式修复体方面的可靠性和可重复性的影响。
PLoS One. 2025 May 30;20(5):e0323086. doi: 10.1371/journal.pone.0323086. eCollection 2025.
7
Performance evaluation of large language models for the national nursing examination in Japan.日本国家护士考试中大型语言模型的性能评估
Digit Health. 2025 May 27;11:20552076251346571. doi: 10.1177/20552076251346571. eCollection 2025 Jan-Dec.
8
Pilot Study on Using Large Language Models for Educational Resource Development in Japanese Radiological Technologist Exams.利用大语言模型进行日本放射技师考试教育资源开发的初步研究。
Med Sci Educ. 2025 Jan 18;35(2):919-927. doi: 10.1007/s40670-024-02251-1. eCollection 2025 Apr.
9
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
10
Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams.ChatGPT-4o与Gemini在放射诊断学培训考试中的性能对比分析
Cureus. 2025 Mar 20;17(3):e80874. doi: 10.7759/cureus.80874. eCollection 2025 Mar.
JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.
4
How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard.人工智能如何回答常见肺癌问题:ChatGPT 与 Google Bard 对比。
Radiology. 2023 Jun;307(5):e230922. doi: 10.1148/radiol.230922.
5
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
6
Performance of ChatGPT on the pharmacist licensing examination in Taiwan.ChatGPT 在台湾药剂师执照考试中的表现。
J Chin Med Assoc. 2023 Jul 1;86(7):653-658. doi: 10.1097/JCMA.0000000000000942. Epub 2023 Jul 5.
7
GPT-4 in Radiology: Improvements in Advanced Reasoning.GPT-4 在放射学中的应用:高级推理能力的提升。
Radiology. 2023 Jun;307(5):e230987. doi: 10.1148/radiol.230987. Epub 2023 May 16.
8
Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现:当前优势和局限性的深入了解。
Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.
9
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
10
Artificial Hallucinations in ChatGPT: Implications in Scientific Writing.ChatGPT中的人工幻觉:对科学写作的影响
Cureus. 2023 Feb 19;15(2):e35179. doi: 10.7759/cureus.35179. eCollection 2023 Feb.