Suppr超能文献

将 ChatGPT-3.5、ChatGPT-4、Bing Chat 和 Bard 用于韩国急诊医学 board 考试题库的问题解决性能比较。

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.

机构信息

Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea.

Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea.

出版信息

Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325.

Abstract

Large language models (LLMs) have been deployed in diverse fields, and the potential for their application in medicine has been explored through numerous studies. This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Emergency Medicine Board Examination question bank in the Korean language. Of the 2353 questions in the question bank, 150 questions were randomly selected, and 27 containing figures were excluded. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. The answers and explanations obtained by inputting the 123 questions into the LLMs were analyzed and compared. ChatGPT-4 (75.6%) and Bing Chat (70.7%) showed higher correct response rates than ChatGPT-3.5 (56.9%) and Bard (51.2%). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (75.6%, 68.3%, 52.8%, and 50.4%, respectively). ChatGPT-4 and Bing Chat outperformed ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language.

摘要

大型语言模型(LLMs)已经在各个领域得到应用,其在医学领域的应用潜力已经通过大量研究得到了探索。本研究旨在评估和比较 ChatGPT-3.5、ChatGPT-4、Bing Chat 和 Bard 在韩语版急诊医学委员会考试题库中的性能。在题库的 2353 个问题中,随机抽取了 150 个问题,排除了 27 个包含图表的问题。需要分析、创造性思维、评估和综合能力的问题被归类为高阶问题,而只需要回答回忆、记忆和事实信息的问题被归类为低阶问题。将 123 个问题输入到 LLM 中获得的答案和解释进行了分析和比较。ChatGPT-4(75.6%)和 Bing Chat(70.7%)的正确回答率高于 ChatGPT-3.5(56.9%)和 Bard(51.2%)。ChatGPT-4 对高阶问题的正确回答率最高,为 76.5%,而 Bard 和 Bing Chat 对低阶问题的回答率最高,为 71.4%。答案解释的恰当性方面,ChatGPT-4 和 Bing Chat 明显优于 ChatGPT-3.5 和 Bard(分别为 75.6%、68.3%、52.8%和 50.4%)。ChatGPT-4 和 Bing Chat 在回答韩语版随机选择的急诊医学委员会考试问题方面优于 ChatGPT-3.5 和 Bard。

相似文献

2
A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.
Indian Dermatol Online J. 2025 Feb 27;16(2):241-247. doi: 10.4103/idoj.idoj_221_24. eCollection 2025 Mar-Apr.
5
Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.
Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.
7
Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.
Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.
8
Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.
Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.
10
Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.
Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.

引用本文的文献

3
User-centric AI: evaluating the usability of generative AI applications through user reviews on app stores.
PeerJ Comput Sci. 2024 Oct 25;10:e2421. doi: 10.7717/peerj-cs.2421. eCollection 2024.
5
Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study.
Int Dent J. 2025 Feb;75(1):176-184. doi: 10.1016/j.identj.2024.09.002. Epub 2024 Oct 6.
6
Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination.
Heliyon. 2024 Jul 18;10(14):e34851. doi: 10.1016/j.heliyon.2024.e34851. eCollection 2024 Jul 30.

本文引用的文献

1
Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.
Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.
2
ChatGPT Versus Human Performance on Emergency Medicine Board Preparation Questions.
Ann Emerg Med. 2024 Jan;83(1):87-88. doi: 10.1016/j.annemergmed.2023.08.010. Epub 2023 Sep 19.
3
Fabrication and errors in the bibliographic citations generated by ChatGPT.
Sci Rep. 2023 Sep 7;13(1):14045. doi: 10.1038/s41598-023-41032-5.
4
Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment.
Can Assoc Radiol J. 2024 May;75(2):344-350. doi: 10.1177/08465371231193716. Epub 2023 Aug 14.
8
ChatGPT: A Valuable Tool for Emergency Medical Assistance.
Ann Emerg Med. 2023 Sep;82(3):411-413. doi: 10.1016/j.annemergmed.2023.04.027. Epub 2023 Jun 17.
9
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
10
ChatGPT failed Taiwan's Family Medicine Board Exam.
J Chin Med Assoc. 2023 Aug 1;86(8):762-766. doi: 10.1097/JCMA.0000000000000946. Epub 2023 Jun 9.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验