• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

与GPT-3.5、GPT-4和GPT-4o相比,定制生成式预训练变换器(Custom GPTs)在提升性能和证据方面如何?一项关于急诊医学专科考试的研究。

Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination.

作者信息

Liu Chiu-Liang, Ho Chien-Ta, Wu Tzu-Chi

机构信息

Graduate Institute of Technology Management, National Chung-Hsing University, Taichung 402202, Taiwan.

College of Health Sciences, Central Taiwan University of Science and Technology, Taichung 406053, Taiwan.

出版信息

Healthcare (Basel). 2024 Aug 30;12(17):1726. doi: 10.3390/healthcare12171726.

DOI:10.3390/healthcare12171726
PMID:39273750
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11394718/
Abstract

Given the widespread application of ChatGPT, we aim to evaluate its proficiency in the emergency medicine specialty written examination. Additionally, we compare the performance of GPT-3.5, GPT-4, GPTs, and GPT-4o. The research seeks to ascertain whether custom GPTs possess the essential capabilities and access to knowledge bases necessary for providing accurate information, and to explore the effectiveness and potential of personalized knowledge bases in supporting the education of medical residents. We evaluated the performance of ChatGPT-3.5, GPT-4, custom GPTs, and GPT-4o on the Emergency Medicine Specialist Examination in Taiwan. Two hundred single-choice exam questions were provided to these AI models, and their responses were recorded. Correct rates were compared among the four models, and the McNemar test was applied to paired model data to determine if there were significant changes in performance. Out of 200 questions, GPT-3.5, GPT-4, custom GPTs, and GPT-4o correctly answered 77, 105, 119, and 138 questions, respectively. GPT-4o demonstrated the highest performance, significantly better than GPT-4, which, in turn, outperformed GPT-3.5, while custom GPTs exhibited superior performance compared to GPT-4 but inferior performance compared to GPT-4o, with all < 0.05. In the emergency medicine specialty written exam, our findings highlight the value and potential of large language models (LLMs), and highlight their strengths and limitations, especially in question types and image-inclusion capabilities. Not only do GPT-4o and custom GPTs facilitate exam preparation, but they also elevate the evidence level in responses and source accuracy, demonstrating significant potential to transform educational frameworks and clinical practices in medicine.

摘要

鉴于ChatGPT的广泛应用,我们旨在评估其在急诊医学专业笔试中的能力。此外,我们比较了GPT-3.5、GPT-4、GPTs和GPT-4o的表现。该研究旨在确定定制的GPTs是否具备提供准确信息所需的基本能力和知识库访问权限,并探索个性化知识库在支持住院医师教育方面的有效性和潜力。我们评估了ChatGPT-3.5、GPT-4、定制GPTs和GPT-4o在台湾急诊医学专科考试中的表现。向这些人工智能模型提供了200道单项选择题,并记录了它们的答案。比较了四个模型的正确率,并对配对模型数据应用McNemar检验以确定表现是否有显著变化。在200道题中,GPT-3.5、GPT-4、定制GPTs和GPT-4o分别正确回答了77、105、119和138道题。GPT-4o表现最佳,明显优于GPT-4,而GPT-4又优于GPT-3.5,定制GPTs的表现优于GPT-4但劣于GPT-4o,所有p<0.05。在急诊医学专业笔试中,我们的研究结果突出了大语言模型(LLMs)的价值和潜力,并突出了它们的优势和局限性,特别是在题型和包含图像的能力方面。GPT-4o和定制GPTs不仅有助于考试准备,还提高了回答中的证据水平和来源准确性,显示出在改变医学教育框架和临床实践方面的巨大潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64e2/11394718/d10f1381f3a1/healthcare-12-01726-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64e2/11394718/d10f1381f3a1/healthcare-12-01726-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64e2/11394718/d10f1381f3a1/healthcare-12-01726-g001.jpg

相似文献

1
Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination.与GPT-3.5、GPT-4和GPT-4o相比,定制生成式预训练变换器(Custom GPTs)在提升性能和证据方面如何?一项关于急诊医学专科考试的研究。
Healthcare (Basel). 2024 Aug 30;12(17):1726. doi: 10.3390/healthcare12171726.
2
The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: A comparison with cardiologists and emergency medicine specialists.Gemini、GPT-4 和 GPT-4o 在心电图分析中的准确性:与心脏病专家和急诊医学专家的比较。
Am J Emerg Med. 2024 Oct;84:68-73. doi: 10.1016/j.ajem.2024.07.043. Epub 2024 Jul 30.
3
Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam.评估人工智能在核心脏病学方面的能力:大型语言模型参加资格考试准备。
medRxiv. 2024 Jul 16:2024.07.16.24310297. doi: 10.1101/2024.07.16.24310297.
4
GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.GPT-4o与人类考生:波兰牙科最终考试中的表现分析
Cureus. 2024 Sep 6;16(9):e68813. doi: 10.7759/cureus.68813. eCollection 2024 Sep.
5
Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada.生成式预训练转换器(GPTs)在加拿大家庭医生学院认证考试中的表现。
Fam Med Community Health. 2024 May 28;12(Suppl 1):e002626. doi: 10.1136/fmch-2023-002626.
6
GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study.GPT-4o 在回答模拟的欧洲介入放射学委员会考试方面的能力与德国医学生和专家相比,以及其在介入放射学方面生成考试项目的能力:一项描述性研究。
J Educ Eval Health Prof. 2024;21:21. doi: 10.3352/jeehp.2024.21.21. Epub 2024 Aug 20.
7
Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估:比较研究
JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.
8
Capabilities of GPT-4o and Gemini 1.5 Pro in Gram stain and bacterial shape identification.GPT-4o 和 Gemini 1.5 Pro 在革兰氏染色和细菌形态识别方面的能力。
Future Microbiol. 2024;19(15):1283-1292. doi: 10.1080/17460913.2024.2381967. Epub 2024 Jul 29.
9
Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。
Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.
10
Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023.评估人工智能在医学知识测试中的效能:一项使用2020年至2023年台湾内科医师考试试题的研究。
Digit Health. 2024 Oct 18;10:20552076241291404. doi: 10.1177/20552076241291404. eCollection 2024 Jan-Dec.

引用本文的文献

1
Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用:性能评估研究
JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.
2
Exploring the possibilities and limitations of customized large language model to support and improve cervical cancer screening.探索定制大语言模型以支持和改进宫颈癌筛查的可能性与局限性。
BMC Med Inform Decis Mak. 2025 Jul 1;25(1):242. doi: 10.1186/s12911-025-03088-3.
3

本文引用的文献

1
Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination.评估ChatGPT-3.5和ChatGPT-4在台湾整形外科医师资格考试中的表现。
Heliyon. 2024 Jul 18;10(14):e34851. doi: 10.1016/j.heliyon.2024.e34851. eCollection 2024 Jul 30.
2
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
3
Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.
A custom ChatGPT can accurately answer questions from an international expert osteotomy consensus statement.
定制的ChatGPT可以准确回答来自国际专家截骨术共识声明的问题。
Eur J Orthop Surg Traumatol. 2025 Jun 16;35(1):247. doi: 10.1007/s00590-025-04373-7.
4
Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.推进医学人工智能:GPT-4和GPT-4o在台湾医学执照考试中超越GPT-3.5。
PLoS One. 2025 Jun 4;20(6):e0324841. doi: 10.1371/journal.pone.0324841. eCollection 2025.
5
Performance analysis of an emergency triage system in ophthalmology using a customized CHATBOT.使用定制聊天机器人对眼科急诊分诊系统进行性能分析
Digit Health. 2025 May 11;11:20552076251320298. doi: 10.1177/20552076251320298. eCollection 2025 Jan-Dec.
6
Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination.评估GPT-3.5、GPT-4和GPT-4o在中国国家医师资格考试中的表现。
Sci Rep. 2025 Apr 23;15(1):14119. doi: 10.1038/s41598-025-98949-2.
7
An artificial intelligence perspective on geriatric syndromes: assessing the information accuracy and readability of ChatGPT.从人工智能角度看老年综合征:评估ChatGPT的信息准确性和可读性。
Eur Geriatr Med. 2025 Apr 21. doi: 10.1007/s41999-025-01202-2.
8
Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry.评估不同大语言模型在口腔修复学中的准确性、可靠性、一致性和可读性。
J Esthet Restor Dent. 2025 Jul;37(7):1740-1752. doi: 10.1111/jerd.13447. Epub 2025 Mar 2.
9
Assessing the ability of GPT-4o to visually recognize medications and provide patient education.评估 GPT-4o 视觉识别药物并提供患者教育的能力。
Sci Rep. 2024 Nov 5;14(1):26749. doi: 10.1038/s41598-024-78577-y.
视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。
Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.
4
Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.评估ChatGPT-4的诊断准确性:视觉数据整合的影响。
JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.
5
Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.GPT-4V 在回答日本耳鼻喉科学委员会认证考试问题方面的表现:评估研究。
JMIR Med Educ. 2024 Mar 28;10:e57054. doi: 10.2196/57054.
6
Comparison of emergency medicine specialist, cardiologist, and chat-GPT in electrocardiography assessment.比较急诊医学专家、心脏病专家和 Chat-GPT 在心电图评估中的表现。
Am J Emerg Med. 2024 Jun;80:51-60. doi: 10.1016/j.ajem.2024.03.017. Epub 2024 Mar 15.
7
Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V(视觉)在日本国家医师资格考试中的能力:评估研究。
JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.
8
Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists.将 ChatGPT GPT-4、Bard 和 Llama-2 在台湾精神科医师执照考试中的表现与多中心精神科医生的鉴别诊断进行比较。
Psychiatry Clin Neurosci. 2024 Jun;78(6):347-352. doi: 10.1111/pcn.13656. Epub 2024 Feb 26.
9
Performance of ChatGPT on Stage 1 of the Taiwanese medical licensing exam.ChatGPT在台湾医师执照考试第一阶段的表现。
Digit Health. 2024 Feb 16;10:20552076241233144. doi: 10.1177/20552076241233144. eCollection 2024 Jan-Dec.
10
Twelve tips on creating and using custom GPTs to enhance health professions education.关于创建和使用定制 GPT 以增强健康职业教育的 12 点建议。
Med Teach. 2024 Jun;46(6):752-756. doi: 10.1080/0142159X.2024.2305365. Epub 2024 Jan 29.