• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

神经 GPT-X:迈向临床就绪的大型语言模型。

neuroGPT-X: toward a clinic-ready large language model.

机构信息

1Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.

2Department of Clinical Neurosciences, Project neuroArm, Hotchkiss Brain Institute University of Calgary, Calgary, Alberta, Canada.

出版信息

J Neurosurg. 2023 Oct 6;140(4):1041-1053. doi: 10.3171/2023.7.JNS23573. Print 2024 Apr 1.

DOI:10.3171/2023.7.JNS23573
PMID:38564804
Abstract

OBJECTIVE

The objective was to assess the performance of a context-enriched large language model (LLM) compared with international neurosurgical experts on questions related to the management of vestibular schwannoma. Furthermore, another objective was to develop a chat-based platform incorporating in-text citations, references, and memory to enable accurate, relevant, and reliable information in real time.

METHODS

The analysis involved 1) creating a data set through web scraping, 2) developing a chat-based platform called neuroGPT-X, 3) enlisting 8 expert neurosurgeons across international centers to independently create questions (n = 1) and to answer (n = 4) and evaluate responses (n = 3) while blinded, and 4) analyzing the evaluation results on the management of vestibular schwannoma. In the blinded phase, all answers were assessed for accuracy, coherence, relevance, thoroughness, speed, and overall rating. All experts were unblinded and provided their thoughts on the utility and limitations of the tool. In the unblinded phase, all neurosurgeons provided answers to a Likert scale survey and long-answer questions regarding the clinical utility, likelihood of use, and limitations of the tool. The tool was then evaluated on the basis of a set of 103 consensus statements on vestibular schwannoma care from the 8th Quadrennial International Conference on Vestibular Schwannoma.

RESULTS

Responses from the naive and context-enriched Generative Pretrained Transformer (GPT) models were consistently rated not significantly different in terms of accuracy, coherence, relevance, thoroughness, and overall performance, and they were often rated significantly higher than expert responses. Both the naive and content-enriched GPT models provided faster responses to the standardized question set than expert neurosurgeon respondents (p < 0.01). The context-enriched GPT model agreed with 98 of the 103 (95%) consensus statements. Of interest, all expert surgeons expressed concerns about the reliability of GPT in accurately addressing the nuances and controversies surrounding the management of vestibular schwannoma. Furthermore, the authors developed neuroGPT-X, a chat-based platform designed to provide point-of-care clinical support and mitigate the limitations of human memory. neuroGPT-X incorporates features such as in-text citations and references to enable accurate, relevant, and reliable information in real time.

CONCLUSIONS

The present study, with its subspecialist-level performance in generating written responses to complex neurosurgical problems for which evidence-based consensus for management is lacking, suggests that context-enriched LLMs show promise as a point-of-care medical resource. The authors anticipate that this work will be a springboard for expansion into more medical specialties, incorporating evidence-based clinical information and developing expert-level dialogue surrounding LLMs in healthcare.

摘要

目的

评估上下文丰富的大型语言模型(LLM)在与国际神经外科专家就前庭神经鞘瘤管理相关问题的表现。此外,另一个目标是开发一个基于聊天的平台,该平台整合了内联引文、参考文献和记忆,以实时提供准确、相关和可靠的信息。

方法

分析包括 1)通过网络抓取创建数据集,2)开发名为 neuroGPT-X 的基于聊天的平台,3)在国际中心招募 8 名专家神经外科医生,让他们独立创建问题(n=1)并回答(n=4)和评估答案(n=3),同时保持盲法,4)分析前庭神经鞘瘤管理方面的评估结果。在盲法阶段,所有答案都根据准确性、连贯性、相关性、彻底性、速度和总体评分进行评估。所有专家都被揭开了盲法,并就该工具的实用性和局限性发表了看法。在非盲法阶段,所有神经外科医生都对一个关于该工具的临床实用性、使用可能性和局限性的李克特量表调查和长答案问题进行了回答。然后,该工具根据第八届前庭神经鞘瘤国际会议上关于前庭神经鞘瘤护理的 103 项共识声明进行了评估。

结果

在准确性、连贯性、相关性、彻底性和整体表现方面,来自天真和上下文丰富的生成式预训练转换器(GPT)模型的响应始终被一致评为没有显著差异,并且它们的评分往往显著高于专家响应。天真和内容丰富的 GPT 模型对标准化问题集的响应速度均快于专家神经外科医生(p<0.01)。上下文丰富的 GPT 模型与 103 项共识声明中的 98 项(95%)一致。有趣的是,所有专家外科医生都对 GPT 准确处理前庭神经鞘瘤管理方面的细微差别和争议的可靠性表示担忧。此外,作者开发了 neuroGPT-X,这是一个基于聊天的平台,旨在提供即时护理临床支持,并减轻人类记忆的局限性。neuroGPT-X 具有内联引文和参考文献等功能,可实时提供准确、相关和可靠的信息。

结论

本研究在生成复杂神经外科问题的书面回答方面具有亚专科水平的表现,对于缺乏循证管理共识的问题,提示上下文丰富的 LLM 有望成为即时护理医疗资源。作者预计,这项工作将为扩展到更多医学专业领域、纳入循证临床信息以及围绕医疗保健中的 LLM 发展专家级对话奠定基础。

相似文献

1
neuroGPT-X: toward a clinic-ready large language model.神经 GPT-X:迈向临床就绪的大型语言模型。
J Neurosurg. 2023 Oct 6;140(4):1041-1053. doi: 10.3171/2023.7.JNS23573. Print 2024 Apr 1.
2
Sexual Harassment and Prevention Training性骚扰与预防培训
3
Comparison of self-administered survey questionnaire responses collected using mobile apps versus other methods.使用移动应用程序与其他方法收集的自我管理调查问卷回复的比较。
Cochrane Database Syst Rev. 2015 Jul 27;2015(7):MR000042. doi: 10.1002/14651858.MR000042.pub2.
4
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
5
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
6
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
7
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
8
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
9
Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗:一项系统综述
Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.
10
Short-Term Memory Impairment短期记忆障碍

引用本文的文献

1
Large Language Models in Neurology Treatment Decision-Making: a Scoping Review.用于神经病学治疗决策的大语言模型:一项范围综述
J Med Syst. 2025 Sep 16;49(1):115. doi: 10.1007/s10916-025-02254-4.
2
Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies.神经外科特定的、经过同行评审的人工智能聊天机器人与通用人工智能聊天机器人在双语资格考试中的比较表现:评估准确性、一致性和错误最小化策略。
Acta Neurochir (Wien). 2025 Sep 9;167(1):241. doi: 10.1007/s00701-025-06628-y.
3
Development and evaluation of large-language models (LLMs) for oncology: A scoping review.
用于肿瘤学的大语言模型的开发与评估:一项范围综述。
PLOS Digit Health. 2025 Aug 7;4(8):e0000980. doi: 10.1371/journal.pdig.0000980. eCollection 2025 Aug.
4
Large Language Model Architectures in Health Care: Scoping Review of Research Perspectives.医疗保健中的大语言模型架构:研究视角的范围综述
J Med Internet Res. 2025 Jun 19;27:e70315. doi: 10.2196/70315.
5
Medical accuracy of artificial intelligence chatbots in oncology: a scoping review.人工智能聊天机器人在肿瘤学中的医学准确性:一项范围综述。
Oncologist. 2025 Apr 4;30(4). doi: 10.1093/oncolo/oyaf038.
6
Assessing the accuracy of the GPT-4 model in multidisciplinary tumor board decision prediction.评估GPT-4模型在多学科肿瘤病例讨论决策预测中的准确性。
Clin Transl Oncol. 2025 Mar 25. doi: 10.1007/s12094-025-03905-1.
7
Multimodal Metaverse Healthcare: A Collaborative Representation and Adaptive Fusion Approach for Generative Artificial-Intelligence-Driven Diagnosis.多模态元宇宙医疗保健:一种用于生成式人工智能驱动诊断的协作表示与自适应融合方法。
Research (Wash D C). 2025 Mar 12;8:0616. doi: 10.34133/research.0616. eCollection 2025.
8
Artificial Intelligence for Patient Safety and Surgical Education in Neurosurgery.用于神经外科患者安全与手术教育的人工智能
JMA J. 2025 Jan 15;8(1):76-85. doi: 10.31662/jmaj.2024-0141. Epub 2024 Aug 30.
9
The performance of ChatGPT versus neurosurgery residents in neurosurgical board examination-like questions: a systematic review and meta-analysis.ChatGPT与神经外科住院医师在类似神经外科委员会考试问题上的表现:一项系统评价和荟萃分析。
Neurosurg Rev. 2024 Dec 7;47(1):892. doi: 10.1007/s10143-024-03144-y.
10
Medical education with large language models in ophthalmology: custom instructions and enhanced retrieval capabilities.医学教育与大语言模型在眼科学中的应用:定制指令和增强检索功能。
Br J Ophthalmol. 2024 Sep 20;108(10):1354-1361. doi: 10.1136/bjo-2023-325046.