Merlino Dante J, Brufau Santiago R, Saieed George, Van Abel Kathryn M, Price Daniel L, Archibald David J, Ator Gregory A, Carlson Matthew L
Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A.
The Center for Plastic Surgery at Castle Rock, Castle Rock, Colorado, U.S.A.
Laryngoscope. 2025 Feb;135(2):629-634. doi: 10.1002/lary.31781. Epub 2024 Sep 21.
The purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT-3.5 and GPT-4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology-head and neck surgery.
A dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers.
GPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty-nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively.
Large language models vary in their understanding of otolaryngology-specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well-suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood.
NA Laryngoscope, 135:629-634, 2025.
本研究旨在评估来自OpenAI(GPT - 3.5和GPT - 4)、谷歌(PaLM2和MedPaLM)以及Meta的一个开源模型(Llama3:70b)在回答耳鼻咽喉 - 头颈外科领域的临床测试多项选择题时的表现。
使用了一个包含4566道耳鼻咽喉科问题的数据集;每个模型都被提供一个标准化的提示,随后是一个问题。对所有模型都答错的100个问题进行了进一步探究,以深入了解答错的原因。
GPT4最准确,在4566个问题中正确回答了3520个(77.1%)。MedPaLM在4566个问题中正确回答了3223个(70.6%),而Llama3:70b、GPT3.5和PaLM2在4566个问题中分别正确回答了3052个(66.8%)、2672个(58.5%)和2583个(56.5%)。所有模型都答错了369个问题。提供推理的提示提高了所有模型的准确性:GPT4有31%的时间从错误答案变为正确答案,而GPT3.5、Llama3、PaLM2和MedPaLM分别有25%、18%、19%和17%的时间纠正了它们的回答。
大语言模型对耳鼻咽喉科特定临床知识的理解各不相同。OpenAI的GPT4对耳鼻咽喉科领域的核心概念以及详细信息有很强的理解。只要采取适当的预防措施并了解潜在的局限性,它在该领域的基线理解使其非常适合在与头颈外科教育相关的角色中发挥作用。
NA 《喉镜》,135:629 - 634,2025年。