Otolaryngology Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy.
Maxillofacial Surgery Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy.
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):1835-1841. doi: 10.1007/s00405-023-08372-4. Epub 2024 Jan 8.
This study aimed to evaluate the utility of large language model (LLM) artificial intelligence tools, Chat Generative Pre-Trained Transformer (ChatGPT) versions 3.5 and 4, in managing complex otolaryngological clinical scenarios, specifically for the multidisciplinary management of odontogenic sinusitis (ODS).
A prospective, structured multidisciplinary specialist evaluation was conducted using five ad hoc designed ODS-related clinical scenarios. LLM responses to these scenarios were critically reviewed by a multidisciplinary panel of eight specialist evaluators (2 ODS experts, 2 rhinologists, 2 general otolaryngologists, and 2 maxillofacial surgeons). Based on the level of disagreement from panel members, a Total Disagreement Score (TDS) was calculated for each LLM response, and TDS comparisons were made between ChatGPT3.5 and ChatGPT4, as well as between different evaluators.
While disagreement to some degree was demonstrated in 73/80 evaluator reviews of LLMs' responses, TDSs were significantly lower for ChatGPT4 compared to ChatGPT3.5. Highest TDSs were found in the case of complicated ODS with orbital abscess, presumably due to increased case complexity with dental, rhinologic, and orbital factors affecting diagnostic and therapeutic options. There were no statistically significant differences in TDSs between evaluators' specialties, though ODS experts and maxillofacial surgeons tended to assign higher TDSs.
LLMs like ChatGPT, especially newer versions, showed potential for complimenting evidence-based clinical decision-making, but substantial disagreement was still demonstrated between LLMs and clinical specialists across most case examples, suggesting they are not yet optimal in aiding clinical management decisions. Future studies will be important to analyze LLMs' performance as they evolve over time.
本研究旨在评估大型语言模型(LLM)人工智能工具,即 Chat Generative Pre-Trained Transformer(ChatGPT)版本 3.5 和 4,在管理复杂的耳鼻喉科临床场景中的效用,特别是在牙源性鼻窦炎(ODS)的多学科管理方面。
采用五个专门设计的 ODS 相关临床场景,对前瞻性、结构化的多学科专家评估进行了研究。由 8 名多学科专家评估者(2 名 ODS 专家、2 名鼻科专家、2 名耳鼻喉科专家和 2 名颌面外科医生)对这些场景中 LLM 的回答进行了批判性审查。根据小组成员的分歧程度,为每个 LLM 响应计算了总分歧评分(TDS),并比较了 ChatGPT3.5 和 ChatGPT4 之间以及不同评估者之间的 TDS。
虽然在 80 名评估者对 LLM 回复的审查中,在某种程度上存在分歧,但与 ChatGPT3.5 相比,ChatGPT4 的 TDS 明显更低。在伴有眼眶脓肿的复杂 ODS 病例中,TDS 最高,这可能是由于涉及牙齿、鼻科和眼眶因素的病例复杂性增加,影响了诊断和治疗选择。在评估者的专业之间,TDS 没有统计学上的显著差异,但 ODS 专家和颌面外科医生倾向于分配更高的 TDS。
像 ChatGPT 这样的 LLM,尤其是较新版本,显示出在补充基于证据的临床决策方面的潜力,但在大多数案例中,LLM 与临床专家之间仍然存在很大的分歧,这表明它们在辅助临床管理决策方面还不是最佳选择。未来的研究对于分析 LLM 随时间的演变表现将非常重要。