Brown Ethan D L, Maity Apratim, Ward Max, Toscano Daniel, Baum Griffin R, Mittler Mark A, Lo Sheng-Fu Larry, D'Amico Randy S
Department of Neurological Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Lake Success, New York, USA.
Department of Neurological Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Lake Success, New York, USA.
World Neurosurg. 2024 Nov;191:e304-e332. doi: 10.1016/j.wneu.2024.08.122. Epub 2024 Aug 28.
ChatGPT has been increasingly investigated for its ability to provide clinical decision support in the management of neurosurgical pathologies. However, concerns exist regarding the validity of its responses. To assess the reliability of ChatGPT, we compared its responses against the 2023 Congress of Neurological Surgeons (CNS) guidelines for patients with Chiari I Malformation (CIM).
ChatGPT-3.5 and ChatGPT-4 were prompted with revised questions from the 2023 CNS guidelines for patients with CIM. ChatGPT provided responses were compared to CNS guideline recommendations using cosine similarity scores and reviewer assessments of 1) contradiction with guidelines, 2) recommendations not contained in guidelines, and 3) failure to include guideline recommendations. Scoping review was conducted to investigate reviewer-identified discrepancies between CNS recommendations and GPT-4 responses.
A majority of ChatGPT responses were coherent with CNS recommendations. However, moderate contradiction was observed between responses and guidelines (15.3% ChatGPT-3.5 responses, 38.5% ChatGPT-4 responses). Additionally, a tendency toward over-recommendation (30.8% ChatGPT-3.5 responses, 46.1% ChatGPT-4 responses) rather than under-recommendation (15.4% ChatGPT-3.5 responses, 7.7% ChatGPT-4 responses) was observed. Cosine similarity scores revealed moderate similarity between CNS and ChatGPT recommendations (0.553 ChatGPT-3.5, 0.549 ChatGPT-4). Scoping review revealed 19 studies relevant to CNS-ChatGPT substantive contradictions, with mixed support for recommendations contradicting official guidelines.
Moderate incoherence was observed between ChatGPT responses and CNS guidelines on the diagnosis and management of CIM. The recency of the CNS guidelines and mixed support for contradictory ChatGPT responses highlights a need for further refinement of large language models prior to their application as clinical decision support tools.
ChatGPT在神经外科疾病管理中提供临床决策支持的能力已得到越来越多的研究。然而,人们对其回答的有效性存在担忧。为评估ChatGPT的可靠性,我们将其回答与2023年神经外科医师大会(CNS)关于Chiari I型畸形(CIM)患者的指南进行了比较。
向ChatGPT-3.5和ChatGPT-4提出2023年CNS关于CIM患者指南中的修订问题。使用余弦相似度分数以及评审员对以下方面的评估,将ChatGPT提供的回答与CNS指南建议进行比较:1)与指南相矛盾;2)指南中未包含的建议;3)未纳入指南建议。进行范围审查以调查评审员确定的CNS建议与GPT-4回答之间的差异。
ChatGPT的大多数回答与CNS建议一致。然而,在回答与指南之间观察到中度矛盾(ChatGPT-3.5回答的15.3%,ChatGPT-4回答的38.5%)。此外,观察到过度推荐的趋势(ChatGPT-3.5回答的30.8%,ChatGPT-4回答的46.1%),而不是推荐不足(ChatGPT-3.5回答的15.4%,ChatGPT-