• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

推进医学教育:生成式人工智能模型在耳鼻喉科委员会备考问题上的表现及图像分析见解

Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions With Image Analysis Insights.

作者信息

Terwilliger Emma, Bcharah George, Bcharah Hend, Bcharah Estefana, Richardson Clare, Scheffler Patrick

机构信息

Otolaryngology, Mayo Clinic Alix School of Medicine, Scottsdale, USA.

Otolaryngology, Andrew Taylor Still University School of Osteopathic Medicine, Mesa, USA.

出版信息

Cureus. 2024 Jul 9;16(7):e64204. doi: 10.7759/cureus.64204. eCollection 2024 Jul.

DOI:10.7759/cureus.64204
PMID:39130878
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11315421/
Abstract

Objective  To evaluate and compare the performance of Chat Generative Pre-Trained Transformer (ChatGPT), GPT-4, and Google Bard on United States otolaryngology board-style questions to scale their ability to act as an adjunctive study tool and resource for students and doctors. Methods A 1077 text question and 60 image-based questions from the otolaryngology board exam preparation tool BoardVitals were inputted into ChatGPT, GPT-4, and Google Bard. The questions were scaled true or false, depending on whether the artificial intelligence (AI) modality provided the correct response. Data analysis was performed in R Studio. Results  GPT-4 scored the highest at 78.7% compared to ChatGPT and Bard at 55.3% and 61.7% (p<0.001), respectively. In terms of question difficulty, all three AI models performed best on easy questions (ChatGPT: 69.7%, GPT-4: 92.5%, and Bard: 76.4%) and worst on hard questions (ChatGPT: 42.3%, GPT-4: 61.3%, and Bard: 45.6%). Across all difficulty levels, GPT-4 did better than Bard and ChatGPT (p<0.0001). GPT-4 outperformed ChatGPT and Bard in all subspecialty sections, with significantly higher scores (p<0.05) on all sections except allergy (p>0.05). On image-based questions, GPT-4 performed better than Bard (56.7% vs 46.4%, p=0.368) and had better overall image interpretation capabilities. Conclusion This study showed that the GPT-4 model performed better than both ChatGPT and Bard on the United States otolaryngology board practice questions. Although the GPT-4 results were promising, AI should still be used with caution when being implemented in medical education or patient care settings.

摘要

目的 评估并比较聊天生成预训练变换器(ChatGPT)、GPT-4和谷歌巴德(Google Bard)在美国耳鼻喉科委员会风格问题上的表现,以衡量它们作为学生和医生辅助学习工具及资源的能力。方法 将来自耳鼻喉科委员会考试备考工具BoardVitals的1077道文本问题和60道基于图像的问题输入ChatGPT、GPT-4和谷歌巴德。根据人工智能(AI)模式是否给出正确答案,问题被判定为真或假。在R Studio中进行数据分析。结果 GPT-4得分最高,为78.7%,而ChatGPT和巴德分别为55.3%和61.7%(p<0.001)。在问题难度方面,所有三个AI模型在简单问题上表现最佳(ChatGPT:69.7%,GPT-4:92.5%,巴德:76.4%),在难题上表现最差(ChatGPT:42.3%,GPT-4:61.3%,巴德:45.6%)。在所有难度级别上,GPT-4的表现优于巴德和ChatGPT(p<0.0001)。GPT-4在所有亚专业领域的表现均优于ChatGPT和巴德,除过敏领域外(p>0.05),在所有领域的得分均显著更高(p<0.05)。在基于图像的问题上,GPT-4的表现优于巴德(56.7%对46.4%,p = 0.368),并且具有更好的整体图像解释能力。结论 本研究表明,在美国耳鼻喉科委员会练习题上,GPT-4模型的表现优于ChatGPT和巴德。尽管GPT-4的结果很有前景,但在医学教育或患者护理环境中应用AI时仍应谨慎使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/29f6/11315421/5febbf557de4/cureus-0016-00000064204-i02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/29f6/11315421/39462ec6e760/cureus-0016-00000064204-i01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/29f6/11315421/5febbf557de4/cureus-0016-00000064204-i02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/29f6/11315421/39462ec6e760/cureus-0016-00000064204-i01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/29f6/11315421/5febbf557de4/cureus-0016-00000064204-i02.jpg

相似文献

1
Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions With Image Analysis Insights.推进医学教育:生成式人工智能模型在耳鼻喉科委员会备考问题上的表现及图像分析见解
Cureus. 2024 Jul 9;16(7):e64204. doi: 10.7759/cureus.64204. eCollection 2024 Jul.
2
Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study.人工智能驱动的聊天机器人在回答骨科研究生考试问题中的有效性——一项观察性研究。
Int Orthop. 2024 Aug;48(8):1963-1969. doi: 10.1007/s00264-024-06182-9. Epub 2024 Apr 15.
3
Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.ChatGPT、GPT-4 和 Bard 在日本放射学会官方董事会考试中的表现评估。
Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.
4
GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.GPT-4人工智能模型在类似神经外科书面考试的问题上表现优于ChatGPT、医学生和神经外科住院医师。
World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18.
5
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
6
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
7
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
8
Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions.人类与人工智能:ChatGPT-4在临床化学选择题方面表现优于必应、巴德、ChatGPT-3.5和人类。
Adv Med Educ Pract. 2024 Sep 20;15:857-871. doi: 10.2147/AMEP.S479801. eCollection 2024.
9
Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level.生成式人工智能的表现达到了骨科住院医师二年级的水平。
Cureus. 2024 Mar 13;16(3):e56104. doi: 10.7759/cureus.56104. eCollection 2024 Mar.
10
Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal.ChatGPT 和 Bard 在肾病学委员会更新的自我评估问题中的表现。
Clin Exp Nephrol. 2024 May;28(5):465-469. doi: 10.1007/s10157-023-02451-w. Epub 2024 Feb 14.

引用本文的文献

1
Applications of Natural Language Processing in Otolaryngology: A Scoping Review.自然语言处理在耳鼻咽喉科的应用:一项范围综述
Laryngoscope. 2025 Sep;135(9):3049-3063. doi: 10.1002/lary.32198. Epub 2025 May 1.
2
Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry.评估不同大语言模型在口腔修复学中的准确性、可靠性、一致性和可读性。
J Esthet Restor Dent. 2025 Jul;37(7):1740-1752. doi: 10.1111/jerd.13447. Epub 2025 Mar 2.
3
Advancements in AI Medical Education: Assessing ChatGPT's Performance on USMLE-Style Questions Across Topics and Difficulty Levels.

本文引用的文献

1
Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation.使用GPT-4视觉进行中耳疾病分类的多模态人工智能的可行性:定性研究与验证
JMIR AI. 2024 May 31;3:e58342. doi: 10.2196/58342.
2
Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.GPT-4V 在回答日本耳鼻喉科学委员会认证考试问题方面的表现:评估研究。
JMIR Med Educ. 2024 Mar 28;10:e57054. doi: 10.2196/57054.
3
Physician views of artificial intelligence in otolaryngology and rhinology: A mixed methods study.
人工智能医学教育的进展:评估ChatGPT在不同主题和难度级别的美国医师执照考试(USMLE)风格问题上的表现。
Cureus. 2024 Dec 24;16(12):e76309. doi: 10.7759/cureus.76309. eCollection 2024 Dec.
4
Response to: comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o: correspondence.回应:人工智能模型在风湿病学委员会级问题中的比较表现:评估谷歌Gemini和ChatGPT-4o:通信
Clin Rheumatol. 2024 Dec;43(12):4023-4024. doi: 10.1007/s10067-024-07199-6. Epub 2024 Oct 22.
耳鼻喉科和鼻科学领域医生对人工智能的看法:一项混合方法研究。
Laryngoscope Investig Otolaryngol. 2023 Oct 31;8(6):1468-1475. doi: 10.1002/lio2.1177. eCollection 2023 Dec.
4
Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination.评估 GPT-3.5 和 GPT-4 在波兰医学期末考试中的表现。
Sci Rep. 2023 Nov 22;13(1):20512. doi: 10.1038/s41598-023-46995-z.
5
Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.评估ChatGPT在骨科住院医师培训考试中的表现。
JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.
6
Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment.ChatGPT 和 Bard 在基于文本的放射学知识评估中的比较性能。
Can Assoc Radiol J. 2024 May;75(2):344-350. doi: 10.1177/08465371231193716. Epub 2023 Aug 14.
7
Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现:比较研究。
JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.
8
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
9
ChatGPT's quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions.ChatGPT 在不同耳鼻喉科亚专业中的测验技能:对 2576 道选择题和多选题进行 board certification 准备的分析。
Eur Arch Otorhinolaryngol. 2023 Sep;280(9):4271-4278. doi: 10.1007/s00405-023-08051-4. Epub 2023 Jun 7.
10
ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story?ChatGPT参加欧洲核心心脏病学考试:一个人工智能的成功故事?
Eur Heart J Digit Health. 2023 Apr 24;4(3):279-281. doi: 10.1093/ehjdh/ztad029. eCollection 2023 May.