1Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.
2Department of Clinical Neurosciences, Project neuroArm, Hotchkiss Brain Institute University of Calgary, Calgary, Alberta, Canada.
J Neurosurg. 2023 Oct 6;140(4):1041-1053. doi: 10.3171/2023.7.JNS23573. Print 2024 Apr 1.
The objective was to assess the performance of a context-enriched large language model (LLM) compared with international neurosurgical experts on questions related to the management of vestibular schwannoma. Furthermore, another objective was to develop a chat-based platform incorporating in-text citations, references, and memory to enable accurate, relevant, and reliable information in real time.
The analysis involved 1) creating a data set through web scraping, 2) developing a chat-based platform called neuroGPT-X, 3) enlisting 8 expert neurosurgeons across international centers to independently create questions (n = 1) and to answer (n = 4) and evaluate responses (n = 3) while blinded, and 4) analyzing the evaluation results on the management of vestibular schwannoma. In the blinded phase, all answers were assessed for accuracy, coherence, relevance, thoroughness, speed, and overall rating. All experts were unblinded and provided their thoughts on the utility and limitations of the tool. In the unblinded phase, all neurosurgeons provided answers to a Likert scale survey and long-answer questions regarding the clinical utility, likelihood of use, and limitations of the tool. The tool was then evaluated on the basis of a set of 103 consensus statements on vestibular schwannoma care from the 8th Quadrennial International Conference on Vestibular Schwannoma.
Responses from the naive and context-enriched Generative Pretrained Transformer (GPT) models were consistently rated not significantly different in terms of accuracy, coherence, relevance, thoroughness, and overall performance, and they were often rated significantly higher than expert responses. Both the naive and content-enriched GPT models provided faster responses to the standardized question set than expert neurosurgeon respondents (p < 0.01). The context-enriched GPT model agreed with 98 of the 103 (95%) consensus statements. Of interest, all expert surgeons expressed concerns about the reliability of GPT in accurately addressing the nuances and controversies surrounding the management of vestibular schwannoma. Furthermore, the authors developed neuroGPT-X, a chat-based platform designed to provide point-of-care clinical support and mitigate the limitations of human memory. neuroGPT-X incorporates features such as in-text citations and references to enable accurate, relevant, and reliable information in real time.
The present study, with its subspecialist-level performance in generating written responses to complex neurosurgical problems for which evidence-based consensus for management is lacking, suggests that context-enriched LLMs show promise as a point-of-care medical resource. The authors anticipate that this work will be a springboard for expansion into more medical specialties, incorporating evidence-based clinical information and developing expert-level dialogue surrounding LLMs in healthcare.
评估上下文丰富的大型语言模型(LLM)在与国际神经外科专家就前庭神经鞘瘤管理相关问题的表现。此外,另一个目标是开发一个基于聊天的平台,该平台整合了内联引文、参考文献和记忆,以实时提供准确、相关和可靠的信息。
分析包括 1)通过网络抓取创建数据集,2)开发名为 neuroGPT-X 的基于聊天的平台,3)在国际中心招募 8 名专家神经外科医生,让他们独立创建问题(n=1)并回答(n=4)和评估答案(n=3),同时保持盲法,4)分析前庭神经鞘瘤管理方面的评估结果。在盲法阶段,所有答案都根据准确性、连贯性、相关性、彻底性、速度和总体评分进行评估。所有专家都被揭开了盲法,并就该工具的实用性和局限性发表了看法。在非盲法阶段,所有神经外科医生都对一个关于该工具的临床实用性、使用可能性和局限性的李克特量表调查和长答案问题进行了回答。然后,该工具根据第八届前庭神经鞘瘤国际会议上关于前庭神经鞘瘤护理的 103 项共识声明进行了评估。
在准确性、连贯性、相关性、彻底性和整体表现方面,来自天真和上下文丰富的生成式预训练转换器(GPT)模型的响应始终被一致评为没有显著差异,并且它们的评分往往显著高于专家响应。天真和内容丰富的 GPT 模型对标准化问题集的响应速度均快于专家神经外科医生(p<0.01)。上下文丰富的 GPT 模型与 103 项共识声明中的 98 项(95%)一致。有趣的是,所有专家外科医生都对 GPT 准确处理前庭神经鞘瘤管理方面的细微差别和争议的可靠性表示担忧。此外,作者开发了 neuroGPT-X,这是一个基于聊天的平台,旨在提供即时护理临床支持,并减轻人类记忆的局限性。neuroGPT-X 具有内联引文和参考文献等功能,可实时提供准确、相关和可靠的信息。
本研究在生成复杂神经外科问题的书面回答方面具有亚专科水平的表现,对于缺乏循证管理共识的问题,提示上下文丰富的 LLM 有望成为即时护理医疗资源。作者预计,这项工作将为扩展到更多医学专业领域、纳入循证临床信息以及围绕医疗保健中的 LLM 发展专家级对话奠定基础。