• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型中耳鼻喉科知识的比较评估

Comparative Assessment of Otolaryngology Knowledge Among Large Language Models.

作者信息

Merlino Dante J, Brufau Santiago R, Saieed George, Van Abel Kathryn M, Price Daniel L, Archibald David J, Ator Gregory A, Carlson Matthew L

机构信息

Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A.

The Center for Plastic Surgery at Castle Rock, Castle Rock, Colorado, U.S.A.

出版信息

Laryngoscope. 2025 Feb;135(2):629-634. doi: 10.1002/lary.31781. Epub 2024 Sep 21.

DOI:10.1002/lary.31781
PMID:39305216
Abstract

OBJECTIVE

The purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT-3.5 and GPT-4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology-head and neck surgery.

METHODS

A dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers.

RESULTS

GPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty-nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively.

CONCLUSION

Large language models vary in their understanding of otolaryngology-specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well-suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood.

LEVEL OF EVIDENCE

NA Laryngoscope, 135:629-634, 2025.

摘要

目的

本研究旨在评估来自OpenAI(GPT - 3.5和GPT - 4)、谷歌(PaLM2和MedPaLM)以及Meta的一个开源模型(Llama3:70b)在回答耳鼻咽喉 - 头颈外科领域的临床测试多项选择题时的表现。

方法

使用了一个包含4566道耳鼻咽喉科问题的数据集;每个模型都被提供一个标准化的提示,随后是一个问题。对所有模型都答错的100个问题进行了进一步探究,以深入了解答错的原因。

结果

GPT4最准确,在4566个问题中正确回答了3520个(77.1%)。MedPaLM在4566个问题中正确回答了3223个(70.6%),而Llama3:70b、GPT3.5和PaLM2在4566个问题中分别正确回答了3052个(66.8%)、2672个(58.5%)和2583个(56.5%)。所有模型都答错了369个问题。提供推理的提示提高了所有模型的准确性:GPT4有31%的时间从错误答案变为正确答案,而GPT3.5、Llama3、PaLM2和MedPaLM分别有25%、18%、19%和17%的时间纠正了它们的回答。

结论

大语言模型对耳鼻咽喉科特定临床知识的理解各不相同。OpenAI的GPT4对耳鼻咽喉科领域的核心概念以及详细信息有很强的理解。只要采取适当的预防措施并了解潜在的局限性,它在该领域的基线理解使其非常适合在与头颈外科教育相关的角色中发挥作用。

证据水平

NA 《喉镜》,135:629 - 634,2025年。

相似文献

1
Comparative Assessment of Otolaryngology Knowledge Among Large Language Models.大型语言模型中耳鼻喉科知识的比较评估
Laryngoscope. 2025 Feb;135(2):629-634. doi: 10.1002/lary.31781. Epub 2024 Sep 21.
2
Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.GPT-4V 在回答日本耳鼻喉科学委员会认证考试问题方面的表现:评估研究。
JMIR Med Educ. 2024 Mar 28;10:e57054. doi: 10.2196/57054.
3
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
4
Comparative Performance of ChatGPT 3.5 and GPT4 on Rhinology Standardized Board Examination Questions.ChatGPT 3.5与GPT4在鼻科学标准化委员会考试问题上的比较表现
OTO Open. 2024 Jun 27;8(2):e164. doi: 10.1002/oto2.164. eCollection 2024 Apr-Jun.
5
Performance of trauma-trained large language models on surgical assessment questions: A new approach in resource identification.经过创伤培训的大语言模型在外科评估问题上的表现:资源识别的一种新方法。
Surgery. 2025 Mar;179:108793. doi: 10.1016/j.surg.2024.08.026. Epub 2024 Sep 23.
6
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.
7
Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine.定制大语言模型提高准确性:将检索增强生成和人工智能代理与非定制模型在循证医学方面进行比较
Arthroscopy. 2025 Mar;41(3):565-573.e6. doi: 10.1016/j.arthro.2024.10.042. Epub 2024 Nov 7.
8
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
9
Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.使用心身医学考试问题评估 ChatGPT 对布鲁姆教育目标分类法的掌握程度:混合方法研究。
J Med Internet Res. 2024 Jan 23;26:e52113. doi: 10.2196/52113.
10
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.生物化学教育中的大语言模型:性能的比较评估
JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.

引用本文的文献

1
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型:回答组织学问题的比较性跨平台评估
Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.
2
ChatGPT performance in answering medical residency questions in nephrology: a pilot study in Brazil.ChatGPT在回答巴西肾脏科住院医师问题方面的表现:一项试点研究
J Bras Nefrol. 2025 Oct-Dec;47(4):e20240254. doi: 10.1590/2175-8239-JBN-2024-0254en.
3
Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.
在耳鼻喉科委员会考试中利用先进的大语言模型:使用Python和应用程序编程接口的调查
Eur Arch Otorhinolaryngol. 2025 Apr 25. doi: 10.1007/s00405-025-09404-x.