Suppr超能文献

大型语言模型在管理牙源性鼻窦炎临床场景中的可靠性:初步多学科评估。

Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation.

机构信息

Otolaryngology Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy.

Maxillofacial Surgery Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy.

出版信息

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):1835-1841. doi: 10.1007/s00405-023-08372-4. Epub 2024 Jan 8.

Abstract

PURPOSE

This study aimed to evaluate the utility of large language model (LLM) artificial intelligence tools, Chat Generative Pre-Trained Transformer (ChatGPT) versions 3.5 and 4, in managing complex otolaryngological clinical scenarios, specifically for the multidisciplinary management of odontogenic sinusitis (ODS).

METHODS

A prospective, structured multidisciplinary specialist evaluation was conducted using five ad hoc designed ODS-related clinical scenarios. LLM responses to these scenarios were critically reviewed by a multidisciplinary panel of eight specialist evaluators (2 ODS experts, 2 rhinologists, 2 general otolaryngologists, and 2 maxillofacial surgeons). Based on the level of disagreement from panel members, a Total Disagreement Score (TDS) was calculated for each LLM response, and TDS comparisons were made between ChatGPT3.5 and ChatGPT4, as well as between different evaluators.

RESULTS

While disagreement to some degree was demonstrated in 73/80 evaluator reviews of LLMs' responses, TDSs were significantly lower for ChatGPT4 compared to ChatGPT3.5. Highest TDSs were found in the case of complicated ODS with orbital abscess, presumably due to increased case complexity with dental, rhinologic, and orbital factors affecting diagnostic and therapeutic options. There were no statistically significant differences in TDSs between evaluators' specialties, though ODS experts and maxillofacial surgeons tended to assign higher TDSs.

CONCLUSIONS

LLMs like ChatGPT, especially newer versions, showed potential for complimenting evidence-based clinical decision-making, but substantial disagreement was still demonstrated between LLMs and clinical specialists across most case examples, suggesting they are not yet optimal in aiding clinical management decisions. Future studies will be important to analyze LLMs' performance as they evolve over time.

摘要

目的

本研究旨在评估大型语言模型(LLM)人工智能工具,即 Chat Generative Pre-Trained Transformer(ChatGPT)版本 3.5 和 4,在管理复杂的耳鼻喉科临床场景中的效用,特别是在牙源性鼻窦炎(ODS)的多学科管理方面。

方法

采用五个专门设计的 ODS 相关临床场景,对前瞻性、结构化的多学科专家评估进行了研究。由 8 名多学科专家评估者(2 名 ODS 专家、2 名鼻科专家、2 名耳鼻喉科专家和 2 名颌面外科医生)对这些场景中 LLM 的回答进行了批判性审查。根据小组成员的分歧程度,为每个 LLM 响应计算了总分歧评分(TDS),并比较了 ChatGPT3.5 和 ChatGPT4 之间以及不同评估者之间的 TDS。

结果

虽然在 80 名评估者对 LLM 回复的审查中,在某种程度上存在分歧,但与 ChatGPT3.5 相比,ChatGPT4 的 TDS 明显更低。在伴有眼眶脓肿的复杂 ODS 病例中,TDS 最高,这可能是由于涉及牙齿、鼻科和眼眶因素的病例复杂性增加,影响了诊断和治疗选择。在评估者的专业之间,TDS 没有统计学上的显著差异,但 ODS 专家和颌面外科医生倾向于分配更高的 TDS。

结论

像 ChatGPT 这样的 LLM,尤其是较新版本,显示出在补充基于证据的临床决策方面的潜力,但在大多数案例中,LLM 与临床专家之间仍然存在很大的分歧,这表明它们在辅助临床管理决策方面还不是最佳选择。未来的研究对于分析 LLM 随时间的演变表现将非常重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc4e/10943141/71e6349f372a/405_2023_8372_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验