• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

将 ChatGPT-3.5、ChatGPT-4、Bing Chat 和 Bard 用于韩国急诊医学 board 考试题库的问题解决性能比较。

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.

机构信息

Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea.

Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea.

出版信息

Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325.

DOI:10.1097/MD.0000000000037325
PMID:38428889
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10906566/
Abstract

Large language models (LLMs) have been deployed in diverse fields, and the potential for their application in medicine has been explored through numerous studies. This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Emergency Medicine Board Examination question bank in the Korean language. Of the 2353 questions in the question bank, 150 questions were randomly selected, and 27 containing figures were excluded. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. The answers and explanations obtained by inputting the 123 questions into the LLMs were analyzed and compared. ChatGPT-4 (75.6%) and Bing Chat (70.7%) showed higher correct response rates than ChatGPT-3.5 (56.9%) and Bard (51.2%). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (75.6%, 68.3%, 52.8%, and 50.4%, respectively). ChatGPT-4 and Bing Chat outperformed ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language.

摘要

大型语言模型(LLMs)已经在各个领域得到应用,其在医学领域的应用潜力已经通过大量研究得到了探索。本研究旨在评估和比较 ChatGPT-3.5、ChatGPT-4、Bing Chat 和 Bard 在韩语版急诊医学委员会考试题库中的性能。在题库的 2353 个问题中,随机抽取了 150 个问题,排除了 27 个包含图表的问题。需要分析、创造性思维、评估和综合能力的问题被归类为高阶问题,而只需要回答回忆、记忆和事实信息的问题被归类为低阶问题。将 123 个问题输入到 LLM 中获得的答案和解释进行了分析和比较。ChatGPT-4(75.6%)和 Bing Chat(70.7%)的正确回答率高于 ChatGPT-3.5(56.9%)和 Bard(51.2%)。ChatGPT-4 对高阶问题的正确回答率最高,为 76.5%,而 Bard 和 Bing Chat 对低阶问题的回答率最高,为 71.4%。答案解释的恰当性方面,ChatGPT-4 和 Bing Chat 明显优于 ChatGPT-3.5 和 Bard(分别为 75.6%、68.3%、52.8%和 50.4%)。ChatGPT-4 和 Bing Chat 在回答韩语版随机选择的急诊医学委员会考试问题方面优于 ChatGPT-3.5 和 Bard。

相似文献

1
Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.将 ChatGPT-3.5、ChatGPT-4、Bing Chat 和 Bard 用于韩国急诊医学 board 考试题库的问题解决性能比较。
Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325.
2
A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.大语言模型与人类受试者在皮肤病学方面表现的比较分析
Indian Dermatol Online J. 2025 Feb 27;16(2):241-247. doi: 10.4103/idoj.idoj_221_24. eCollection 2025 Mar-Apr.
3
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力:ChatGPT、谷歌巴德和微软必应的比较研究
Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.
4
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。
J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.
5
Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.大语言模型在血液学病例解决中的应用:ChatGPT-3.5、谷歌巴德和微软必应的比较研究
Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.
6
ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents.ChatGPT、Bard和必应聊天是大型语言处理模型,它们回答骨科住院医师培训考试问题的准确率与骨科外科一年级住院医师相似。
Arthroscopy. 2025 Mar;41(3):557-562. doi: 10.1016/j.arthro.2024.08.023. Epub 2024 Aug 28.
7
Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.大语言模型在多专科招聘评估(MSRA)考试中的表现评估。
Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.
8
Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.大语言模型(ChatGPT、必应搜索和谷歌巴德)在解决生理学病例 vignettes 中的表现。
Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.
9
Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.人工智能在减重手术中的表现:ChatGPT-4、Bing 和 Bard 在《美国代谢与减重外科学会减重手术教科书》减重手术问题中的比较分析。
Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8.
10
Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.生成式大型语言模型在眼科 Board 式问题中的表现。
Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.

引用本文的文献

1
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
2
Assessing the Current Limitations of Large Language Models in Advancing Health Care Education.评估大语言模型在推进医疗保健教育方面的当前局限性。
JMIR Form Res. 2025 Jan 16;9:e51319. doi: 10.2196/51319.
3
User-centric AI: evaluating the usability of generative AI applications through user reviews on app stores.以用户为中心的人工智能:通过应用商店中的用户评论评估生成式人工智能应用的可用性。
PeerJ Comput Sci. 2024 Oct 25;10:e2421. doi: 10.7717/peerj-cs.2421. eCollection 2024.
4
Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions.韩国国家口腔卫生士考试中韩语和英语问题的大语言模型回答准确率的比较分析
Int J Dent Hyg. 2025 May;23(2):267-276. doi: 10.1111/idh.12848. Epub 2024 Oct 16.
5
Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study.大语言模型在韩国牙科执照考试中的表现:一项比较研究。
Int Dent J. 2025 Feb;75(1):176-184. doi: 10.1016/j.identj.2024.09.002. Epub 2024 Oct 6.
6
Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination.评估ChatGPT-3.5和ChatGPT-4在台湾整形外科医师资格考试中的表现。
Heliyon. 2024 Jul 18;10(14):e34851. doi: 10.1016/j.heliyon.2024.e34851. eCollection 2024 Jul 30.

本文引用的文献

1
Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.ChatGPT、GPT-4 和 Bard 在日本放射学会官方董事会考试中的表现评估。
Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.
2
ChatGPT Versus Human Performance on Emergency Medicine Board Preparation Questions.ChatGPT与人类在急诊医学委员会备考问题上的表现对比。
Ann Emerg Med. 2024 Jan;83(1):87-88. doi: 10.1016/j.annemergmed.2023.08.010. Epub 2023 Sep 19.
3
Fabrication and errors in the bibliographic citations generated by ChatGPT.ChatGPT生成的文献引用中的编造与错误。
Sci Rep. 2023 Sep 7;13(1):14045. doi: 10.1038/s41598-023-41032-5.
4
Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment.ChatGPT 和 Bard 在基于文本的放射学知识评估中的比较性能。
Can Assoc Radiol J. 2024 May;75(2):344-350. doi: 10.1177/08465371231193716. Epub 2023 Aug 14.
5
ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology.基于皮肤病学专业证书考试的 ChatGPT-3.5 和 ChatGPT-4 皮肤科知识水平。
Clin Exp Dermatol. 2024 Jun 25;49(7):686-691. doi: 10.1093/ced/llad255.
6
Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现:比较研究。
JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.
7
Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT's Answers to Common Questions About Lung Cancer and Lung Cancer Screening.使用ChatGPT、GPT-4和Bard来提高ChatGPT对肺癌及肺癌筛查常见问题回答的可读性。
AJR Am J Roentgenol. 2023 Nov;221(5):701-704. doi: 10.2214/AJR.23.29622. Epub 2023 Jun 21.
8
ChatGPT: A Valuable Tool for Emergency Medical Assistance.ChatGPT:紧急医疗援助的宝贵工具。
Ann Emerg Med. 2023 Sep;82(3):411-413. doi: 10.1016/j.annemergmed.2023.04.027. Epub 2023 Jun 17.
9
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
10
ChatGPT failed Taiwan's Family Medicine Board Exam.ChatGPT 未能通过台湾家庭医学专科医师甄试。
J Chin Med Assoc. 2023 Aug 1;86(8):762-766. doi: 10.1097/JCMA.0000000000000946. Epub 2023 Jun 9.