大型语言模型在管理牙源性鼻窦炎临床场景中的可靠性：初步多学科评估。

Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation.

机构信息

Otolaryngology Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy.

Maxillofacial Surgery Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy.

出版信息

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):1835-1841. doi: 10.1007/s00405-023-08372-4. Epub 2024 Jan 8.

DOI:10.1007/s00405-023-08372-4

PMID:38189967

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10943141/

Abstract

PURPOSE

This study aimed to evaluate the utility of large language model (LLM) artificial intelligence tools, Chat Generative Pre-Trained Transformer (ChatGPT) versions 3.5 and 4, in managing complex otolaryngological clinical scenarios, specifically for the multidisciplinary management of odontogenic sinusitis (ODS).

METHODS

A prospective, structured multidisciplinary specialist evaluation was conducted using five ad hoc designed ODS-related clinical scenarios. LLM responses to these scenarios were critically reviewed by a multidisciplinary panel of eight specialist evaluators (2 ODS experts, 2 rhinologists, 2 general otolaryngologists, and 2 maxillofacial surgeons). Based on the level of disagreement from panel members, a Total Disagreement Score (TDS) was calculated for each LLM response, and TDS comparisons were made between ChatGPT3.5 and ChatGPT4, as well as between different evaluators.

RESULTS

While disagreement to some degree was demonstrated in 73/80 evaluator reviews of LLMs' responses, TDSs were significantly lower for ChatGPT4 compared to ChatGPT3.5. Highest TDSs were found in the case of complicated ODS with orbital abscess, presumably due to increased case complexity with dental, rhinologic, and orbital factors affecting diagnostic and therapeutic options. There were no statistically significant differences in TDSs between evaluators' specialties, though ODS experts and maxillofacial surgeons tended to assign higher TDSs.

CONCLUSIONS

LLMs like ChatGPT, especially newer versions, showed potential for complimenting evidence-based clinical decision-making, but substantial disagreement was still demonstrated between LLMs and clinical specialists across most case examples, suggesting they are not yet optimal in aiding clinical management decisions. Future studies will be important to analyze LLMs' performance as they evolve over time.

摘要

目的

本研究旨在评估大型语言模型（LLM）人工智能工具，即 Chat Generative Pre-Trained Transformer（ChatGPT）版本 3.5 和 4，在管理复杂的耳鼻喉科临床场景中的效用，特别是在牙源性鼻窦炎（ODS）的多学科管理方面。

方法

采用五个专门设计的 ODS 相关临床场景，对前瞻性、结构化的多学科专家评估进行了研究。由 8 名多学科专家评估者（2 名 ODS 专家、2 名鼻科专家、2 名耳鼻喉科专家和 2 名颌面外科医生）对这些场景中 LLM 的回答进行了批判性审查。根据小组成员的分歧程度，为每个 LLM 响应计算了总分歧评分（TDS），并比较了 ChatGPT3.5 和 ChatGPT4 之间以及不同评估者之间的 TDS。

结果

虽然在 80 名评估者对 LLM 回复的审查中，在某种程度上存在分歧，但与 ChatGPT3.5 相比，ChatGPT4 的 TDS 明显更低。在伴有眼眶脓肿的复杂 ODS 病例中，TDS 最高，这可能是由于涉及牙齿、鼻科和眼眶因素的病例复杂性增加，影响了诊断和治疗选择。在评估者的专业之间，TDS 没有统计学上的显著差异，但 ODS 专家和颌面外科医生倾向于分配更高的 TDS。

结论

像 ChatGPT 这样的 LLM，尤其是较新版本，显示出在补充基于证据的临床决策方面的潜力，但在大多数案例中，LLM 与临床专家之间仍然存在很大的分歧，这表明它们在辅助临床管理决策方面还不是最佳选择。未来的研究对于分析 LLM 随时间的演变表现将非常重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc4e/10943141/71e6349f372a/405_2023_8372_Fig1_HTML.jpg

相似文献

Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation.

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):1835-1841. doi: 10.1007/s00405-023-08372-4. Epub 2024 Jan 8.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Diagnosing odontogenic sinusitis: An international multidisciplinary consensus statement.

Int Forum Allergy Rhinol. 2021 Aug;11(8):1235-1248. doi: 10.1002/alr.22777. Epub 2021 Feb 14.

Management of odontogenic sinusitis: multidisciplinary consensus statement.

Int Forum Allergy Rhinol. 2020 Jul;10(7):901-912. doi: 10.1002/alr.22598. Epub 2020 Jun 7.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.

JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

Extrasinus Complications From Odontogenic Sinusitis: A Systematic Review.

Otolaryngol Head Neck Surg. 2022 Apr;166(4):623-632. doi: 10.1177/01945998211026268. Epub 2021 Jul 13.

Diagnosing odontogenic sinusitis of endodontic origin: A multidisciplinary literature review.

Am J Otolaryngol. 2021 May-Jun;42(3):102925. doi: 10.1016/j.amjoto.2021.102925. Epub 2021 Jan 15.

引用本文的文献

The role of ChatGPT-4o in differential diagnosis and management of vertigo-related disorders.

Sci Rep. 2025 May 28;15(1):18688. doi: 10.1038/s41598-025-96309-8.

Applications of Natural Language Processing in Otolaryngology: A Scoping Review.

Laryngoscope. 2025 Sep;135(9):3049-3063. doi: 10.1002/lary.32198. Epub 2025 May 1.

Application of machine learning in dentistry: insights, prospects and challenges.

Acta Odontol Scand. 2025 Mar 27;84:145-154. doi: 10.2340/aos.v84.43345.

Chat Generative Pre-Trained Transformer (ChatGPT) in Oral and Maxillofacial Surgery: A Narrative Review on Its Research Applications and Limitations.

J Clin Med. 2025 Feb 18;14(4):1363. doi: 10.3390/jcm14041363.

ChatGPT's role in alleviating anxiety in total knee arthroplasty consent process: a randomized controlled trial pilot study.

Int J Surg. 2025 Mar 1;111(3):2546-2557. doi: 10.1097/JS9.0000000000002223.

Enhancing AI Chatbot Responses in Health Care: The SMART Prompt Structure in Head and Neck Surgery.

OTO Open. 2025 Jan 16;9(1):e70075. doi: 10.1002/oto2.70075. eCollection 2025 Jan-Mar.

Dental maxillary sinus pathology: a CBCT-based case-control study.

Odontology. 2025 Jan 4. doi: 10.1007/s10266-024-01045-6.

Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments.

Clin Oral Investig. 2024 Oct 7;28(11):575. doi: 10.1007/s00784-024-05968-w.

A framework for human evaluation of large language models in healthcare derived from literature review.

NPJ Digit Med. 2024 Sep 28;7(1):258. doi: 10.1038/s41746-024-01258-7.

Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced.

Eur Arch Otorhinolaryngol. 2024 Sep;281(9):5001-5006. doi: 10.1007/s00405-024-08746-2. Epub 2024 May 25.

本文引用的文献

Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2063-2079. doi: 10.1007/s00405-023-08219-y. Epub 2023 Sep 12.

Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations.

JAMA Intern Med. 2023 Sep 1;183(9):1028-1030. doi: 10.1001/jamainternmed.2023.2909.

Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support.

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2081-2086. doi: 10.1007/s00405-023-08104-8. Epub 2023 Jul 5.

Feasibility of Differential Diagnosis Based on Imaging Patterns Using a Large Language Model.

Radiology. 2023 Jul;308(1):e231167. doi: 10.1148/radiol.231167.

Revolutionary Potential of ChatGPT in Constructing Intelligent Clinical Decision Support Systems.

Ann Biomed Eng. 2024 Feb;52(2):125-129. doi: 10.1007/s10439-023-03288-w. Epub 2023 Jun 18.

Total times to treatment completion and clinical outcomes in odontogenic sinusitis.

Am J Otolaryngol. 2023 Jul-Aug;44(4):103921. doi: 10.1016/j.amjoto.2023.103921. Epub 2023 May 4.

ChatGPT and Lacrimal Drainage Disorders: Performance and Scope of Improvement.

Ophthalmic Plast Reconstr Surg. 2023;39(3):221-225. doi: 10.1097/IOP.0000000000002418. Epub 2023 May 10.

Using AI-generated suggestions from ChatGPT to optimize clinical decision support.

J Am Med Inform Assoc. 2023 Jun 20;30(7):1237-1245. doi: 10.1093/jamia/ocad072.

Inflammatory endotype of odontogenic sinusitis.

Int Forum Allergy Rhinol. 2023 Jun;13(6):998-1006. doi: 10.1002/alr.23099. Epub 2022 Nov 8.

Artificial intelligence, machine learning, and deep learning in rhinology: a systematic review.

Eur Arch Otorhinolaryngol. 2023 Feb;280(2):529-542. doi: 10.1007/s00405-022-07701-3. Epub 2022 Oct 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大型语言模型在管理牙源性鼻窦炎临床场景中的可靠性：初步多学科评估。

Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献