Blacker Samuel N, Kang Mia, Chakraborty Indranil, Chowdhury Tumul, Williams James, Lewis Carol, Zimmer Michael, Wilson Brad, Lele Abhijit V
Department of Anesthesiology, University of North Carolina at Chapel Hill.
Department of Anesthesiology, University of Arkansas.
J Neurosurg Anesthesiol. 2023 Dec 19. doi: 10.1097/ANA.0000000000000949.
We tested the ability of chat generative pretrained transformer (ChatGPT), an artificial intelligence chatbot, to answer questions relevant to scenarios covered in 3 clinical guidelines, published by the Society for Neuroscience in Anesthesiology and Critical Care (SNACC), which has published management guidelines: endovascular treatment of stroke, perioperative stroke (Stroke), and care of patients undergoing complex spine surgery (Spine).
Four neuroanesthesiologists independently assessed whether ChatGPT could apply 52 high-quality recommendations (HQRs) included in the 3 SNACC guidelines. HQRs were deemed present in the ChatGPT responses if noted by at least 3 of the 4 reviewers. Reviewers also identified incorrect references, potentially harmful recommendations, and whether ChatGPT cited the SNACC guidelines.
The overall reviewer agreement for the presence of HQRs in the ChatGPT answers ranged from 0% to 100%. Only 4 of 52 (8%) HQRs were deemed present by at least 3 of the 4 reviewers after 5 generic questions, and 23 (44%) HQRs were deemed present after at least 1 additional targeted question. Potentially harmful recommendations were identified for each of the 3 clinical scenarios and ChatGPT failed to cite the SNACC guidelines.
The ChatGPT answers were open to human interpretation regarding whether the responses included the HQRs. Though targeted questions resulted in the inclusion of more HQRs than generic questions, fewer than 50% of HQRs were noted even after targeted questions. This suggests that ChatGPT should not currently be considered a reliable source of information for clinical decision-making. Future iterations of ChatGPT may refine algorithms to improve its reliability as a source of clinical information.
我们测试了人工智能聊天机器人聊天生成预训练变换器(ChatGPT)回答与神经麻醉和重症监护学会(SNACC)发布的3项临床指南所涵盖场景相关问题的能力,该学会已发布管理指南:中风的血管内治疗、围手术期中风(中风)以及复杂脊柱手术患者的护理(脊柱)。
四名神经麻醉医师独立评估ChatGPT是否能够应用SNACC的3项指南中包含的52条高质量推荐(HQR)。如果4名审阅者中至少有3人指出,则认为ChatGPT的回答中存在HQR。审阅者还识别了错误参考文献、潜在有害推荐以及ChatGPT是否引用了SNACC指南。
审阅者对ChatGPT回答中HQR存在情况的总体一致性范围为0%至100%。在提出5个一般性问题后,52条(8%)HQR中只有4条被4名审阅者中的至少3人认为存在,在至少再提出1个针对性问题后,23条(44%)HQR被认为存在。针对3种临床场景均识别出了潜在有害推荐,且ChatGPT未引用SNACC指南。
ChatGPT的回答对于其是否包含HQR容易存在人为解读差异。尽管针对性问题比一般性问题纳入了更多的HQR,但即使在提出针对性问题后,也只有不到50%的HQR被指出。这表明目前ChatGPT不应被视为临床决策的可靠信息来源。ChatGPT的未来迭代可能会优化算法,以提高其作为临床信息来源的可靠性。