Suppr超能文献

ChatGPT、GPT-4 和 Bard 在日本放射学会官方董事会考试中的表现评估。

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.

机构信息

Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan.

Department of Radiology, Tohoku Medical and Pharmaceutical University, Sendai, Japan.

出版信息

Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

Abstract

PURPOSE

Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE).

MATERIALS AND METHODS

In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar's test was used to compare the proportion of correct responses between the LLMs. Fisher's exact test was used to assess the performance of GPT-4 for each topic category.

RESULTS

ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% (p < 0.001) and Google Bard by 26.2% (p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard (p < 0.001). The categorical analysis by question pattern revealed GPT-4's superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001).

CONCLUSION

ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.

摘要

目的

本研究评估了大型语言模型(LLM)在回答临床放射学实践中的问题时的准确性。我们使用日本放射学委员会考试(JRBE)的问题比较了 ChatGPT、GPT-4 和 Google Bard 的性能。

材料和方法

总共使用了日本放射学会授权的 JRBE 2022 年的 103 个问题。这些问题根据模式、所需思维水平和主题进行了分类。使用 McNemar 检验比较 LLM 之间正确回答的比例。使用 Fisher 精确检验评估 GPT-4 对每个主题类别的性能。

结果

ChatGPT、GPT-4 和 Google Bard 分别正确回答了 40.8%(103 个问题中的 42 个)、65.0%(103 个问题中的 67 个)和 38.8%(103 个问题中的 40 个)。GPT-4 显著优于 ChatGPT(24.2%,p<0.001)和 Google Bard(26.2%,p<0.001)。在思维水平的分类分析中,GPT-4 正确回答了 79.7%的低阶问题,明显高于 ChatGPT 或 Google Bard(p<0.001)。通过问题模式的分类分析,GPT-4 在单项回答问题中优于 ChatGPT(67.4%比 46.5%,p=0.004)和 Google Bard(39.5%,p<0.001)。通过主题的分类分析,GPT-4 在核医学方面优于 ChatGPT(40%,p=0.013)和 Google Bard(26.7%,p=0.004)。在上述未提及的类别中,LLM 之间没有观察到显著差异。GPT-4 在核医学(93.3%)方面的表现明显优于诊断放射学(55.8%;p<0.001)。GPT-4 在低阶问题上的表现也优于高阶问题(79.7%比 45.5%,p<0.001)。

结论

基于 GPT-4 的 ChatGPTplus 在回答 JRBE 的日本问题时得分为 65%,优于 ChatGPT 和 Google Bard。这突出了在日本放射学领域使用 LLM 来解决高级临床问题的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c94/10811006/62b4b1857ebe/11604_2023_1491_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验