Bicknell Brenton T, Rivers Nicholas J, Skelton Adam, Sheehan Delaney, Hodges Charis, Fairburn Stevan C, Greene Benjamin J, Panuganti Bharat
UAB Heersink School of Medicine University of Alabama at Birmingham Birmingham Alabama USA.
Department of Otolaryngology-Head and Neck Surgery University of Alabama at Birmingham Birmingham Alabama USA.
OTO Open. 2025 May 5;9(2):e70125. doi: 10.1002/oto2.70125. eCollection 2025 Apr-Jun.
To develop and evaluate the effectiveness of domain-specific customization in large language models (LLMs) by assessing the performance of the ENT GPT Assistant (E-GPT-A), a model specifically tailored for otolaryngology.
Comparative analysis using multiple-choice questions (MCQs) from established otolaryngology resources.
Tertiary care academic hospital.
Two hundred forty clinical-vignette style MCQs were sourced from BoardVitals Otolaryngology and OTOQuest, covering a range of otolaryngology subspecialties (n = 40 for each). The E-GPT-A was developed using targeted instructions and customized to otolaryngology. The performance of E-GPT-A was compared against top-performing and widely used artificial intelligence (AI) LLMs, including GPT-3.5, GPT-4, Claude 2.0, and Claude 2.1. Accuracy was assessed across subspecialties, varying question difficulty tiers, and in diagnostics and management.
E-GPT-A achieved an overall accuracy of 74.6%, outperforming GPT-3.5 (60.4%), Claude 2.0 (61.7%), Claude 2.1 (60.8%), and GPT-4 (68.3%). The model performed best in allergy and rhinology (85.0%) and laryngology (82.5%), whereas showing lower accuracy in pediatrics (62.5%) and facial plastics/reconstructive surgery (67.5%). Accuracy also declined as question difficulty increased. The average correct response percentage among otolaryngologists and otolaryngology trainees was 71.1% in the question set.
This pilot study using the E-GPT-A demonstrates the potential benefits of domain-specific customizations of language models for otolaryngology. However, further development, continuous updates, and continued real-world validation are needed to fully assess the capabilities of LLMs in otolaryngology.
通过评估专门为耳鼻喉科量身定制的ENT GPT助手(E-GPT-A)的性能,开发并评估大语言模型(LLMs)中特定领域定制的有效性。
使用来自既定耳鼻喉科资源的多项选择题(MCQs)进行比较分析。
三级医疗学术医院。
从BoardVitals耳鼻喉科和OTOQuest获取了240道临床病例风格的MCQs,涵盖一系列耳鼻喉科亚专业(每个亚专业n = 40)。E-GPT-A通过有针对性的指令开发,并针对耳鼻喉科进行了定制。将E-GPT-A的性能与表现最佳且广泛使用的人工智能(AI)大语言模型进行比较,包括GPT-3.5、GPT-4、Claude 2.0和Claude 2.1。在各亚专业、不同难度层级的问题以及诊断和管理方面评估准确性。
E-GPT-A的总体准确率达到74.6%,优于GPT-3.5(60.4%)、Claude 2.0(61.7%)、Claude 2.1(60.8%)和GPT-4(68.3%)。该模型在过敏与鼻科学(85.0%)和喉科学(82.5%)方面表现最佳,而在儿科(62.5%)和面部整形/重建手术(67.5%)方面准确率较低。随着问题难度增加,准确率也有所下降。在该问题集中,耳鼻喉科医生和耳鼻喉科实习生的平均正确回答百分比为71.1%。
这项使用E-GPT-A的初步研究证明了针对耳鼻喉科对语言模型进行特定领域定制的潜在益处。然而,需要进一步开发、持续更新并持续进行实际验证,以全面评估大语言模型在耳鼻喉科的能力。