Filali Ansary Rania, Lechien Jerome R
Department of Surgery, Faculty of Medicine, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), University of Mons, 6, Mons, B7000, Belgium.
Department of Otolaryngology-Head and Neck Surgery, School of Medicine, Foch Hospital, University Paris Saclay, Paris, France.
Eur Arch Otorhinolaryngol. 2025 Jun 6. doi: 10.1007/s00405-025-09504-8.
This systematic review evaluated the diagnostic accuracy of large language models (LLMs) in otolaryngology-head and neck surgery clinical decision-making.
PubMed/MEDLINE, Cochrane Library, and Embase databases were searched for studies investigating clinical decision support accuracy of LLMs in otolaryngology.
Three investigators searched the literature for peer-reviewed studies investigating the application of LLMs as clinical decision support for real clinical cases according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The following outcomes were considered: diagnostic accuracy, additional examination and treatment recommendations. Study quality was assessed using the modified Methodological Index for Non-Randomized Studies (MINORS).
Of the 285 eligible publications, 17 met the inclusion criteria, accounting for 734 patients across various otolaryngology subspecialties. ChatGPT-4 was the most evaluated LLM (n = 14/17), followed by Claude-3/3.5 (n = 2/17), and Gemini (n = 2/17). Primary diagnostic accuracy ranged from 45.7 to 80.2% across different LLMs, with Claude often outperforming ChatGPT. LLMs demonstrated lower accuracy in recommending appropriate additional examinations (10-29%) and treatments (16.7-60%), with substantial subspecialty variability. Treatment recommendation accuracy was highest in head and neck oncology (55-60%) and lowest in rhinology (16.7%). There was substantial heterogeneity across studies for the inclusion criteria, information entered in the application programming interface, and the methods of accuracy assessment.
LLMs demonstrate promising moderate diagnostic accuracy in otolaryngology clinical decision support, with higher performance in providing diagnoses than in suggesting appropriate additional examinations and treatments. Emerging findings support that Claude often outperforms ChatGPT. Methodological standardization is needed for future research.
NA.
本系统评价评估了大语言模型(LLMs)在耳鼻咽喉头颈外科临床决策中的诊断准确性。
检索了PubMed/MEDLINE、Cochrane图书馆和Embase数据库,以查找研究LLMs在耳鼻咽喉科临床决策支持准确性的研究。
三名研究人员根据系统评价和Meta分析的首选报告项目(PRISMA)指南,在文献中检索同行评审的研究,这些研究调查了LLMs作为真实临床病例临床决策支持的应用。考虑了以下结果:诊断准确性、额外检查和治疗建议。使用改良的非随机研究方法学指数(MINORS)评估研究质量。
在285篇符合条件的出版物中,17篇符合纳入标准,涉及耳鼻咽喉科各亚专业的734名患者。ChatGPT-4是评估最多的大语言模型(n = 14/17),其次是Claude-3/3.5(n = 2/17)和Gemini(n = 2/17)。不同大语言模型的主要诊断准确性在45.7%至80.2%之间,Claude的表现通常优于ChatGPT。大语言模型在推荐适当的额外检查(10 - 29%)和治疗(16.7 - 60%)方面准确性较低,各亚专业之间存在很大差异。治疗建议准确性在头颈肿瘤学中最高(55 - 60%),在鼻科学中最低(16.7%)。在纳入标准、应用程序编程接口中输入的信息以及准确性评估方法方面,各研究之间存在很大异质性。
大语言模型在耳鼻咽喉科临床决策支持中显示出有前景的中等诊断准确性,在提供诊断方面的表现优于建议适当的额外检查和治疗。新出现的研究结果支持Claude通常优于ChatGPT。未来的研究需要方法学标准化。
无。