神经外科中的大语言模型：系统评价和荟萃分析。

Large language models in neurosurgery: a systematic review and meta-analysis.

机构信息

Harvard Medical School, Harvard University, Boston, MA, 02115, USA.

Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.

出版信息

Acta Neurochir (Wien). 2024 Nov 23;166(1):475. doi: 10.1007/s00701-024-06372-9.

DOI:10.1007/s00701-024-06372-9

PMID:39579215

Abstract

BACKGROUND

Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.

METHODS

We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery ("large language model" OR "LLM" OR "ChatGPT" OR "GPT-3" OR "GPT3" OR "GPT-3.5" OR "GPT3.5" OR "GPT-4" OR "GPT4" OR "LLAMA" OR "MISTRAL" OR "BARD") AND "neurosurgery". The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.

RESULTS

Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.

CONCLUSIONS

Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.

摘要

背景

大型语言模型（LLM）在神经外科领域受到越来越多的关注，具有显著提高该领域水平的潜力。然而，各种神经外科任务中 LLM 的广度和性能尚未得到系统的研究，而且 LLM 存在自身的挑战和独特的术语。我们旨在确定关键模型，为可重复性制定报告指南，并突出 LLM 在神经外科文献中的关键应用领域的进展。

方法

我们使用与 LLM 和神经外科相关的术语（“大型语言模型”或“LLM”或“ChatGPT”或“GPT-3”或“GPT3”或“GPT-3.5”或“GPT3.5”或“GPT-4”或“GPT4”或“LLAMA”或“MISTRAL”或“BARD”）以及“神经外科”在 PubMed 和 Google Scholar 上进行了检索。最后一组文章根据出版年份、应用领域、使用的特定 LLM、用于评估 LLM 性能的对照/比较组、是否报告特定的 LLM 提示、使用的提示策略类型、是否可以完整复制 LLM 查询（包括使用的提示和任何附加数据）、幻觉的测量以及报告的性能测量进行了审查。

结果

符合纳入标准的文章有 51 篇，分为六个应用领域，最常见的是直接用于临床的文本生成（n=14，27.5%）、回答标准化考试问题（n=12，23.5%）和临床判断与决策支持（n=11，21.6%）。使用最频繁的 LLM 是 GPT-3.5（n=30，58.8%）、GPT-4（n=20，39.2%）、Bard（n=9，17.6%）和 Bing（n=6，11.8%）。大多数研究（n=43，84.3%）直接使用现成的 LLM，而 8 项研究（15.7%）进行了高级预训练或微调。

结论

大型语言模型在复杂任务中表现出先进的能力，有潜力改变神经外科。然而，研究通常只涉及基本应用，而忽略了提高 LLM 性能，面临着可重复性问题。标准化详细报告、考虑 LLM 的随机性以及使用基本验证之外的先进方法对于取得进展至关重要。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

神经外科中的大语言模型：系统评价和荟萃分析。

Large language models in neurosurgery: a systematic review and meta-analysis.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

神经外科中的大语言模型：系统评价和荟萃分析。

Large language models in neurosurgery: a systematic review and meta-analysis.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献