Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States.
Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States.
J Clin Neurosci. 2024 May;123:151-156. doi: 10.1016/j.jocn.2024.03.021. Epub 2024 Apr 4.
Although prior work demonstrated the surprising accuracy of Large Language Models (LLMs) on neurosurgery board-style questions, their use in day-to-day clinical situations warrants further investigation. This study assessed GPT-4.0's responses to common clinical questions across various subspecialties of neurosurgery.
A panel of attending neurosurgeons formulated 35 general neurosurgical questions spanning neuro-oncology, spine, vascular, functional, pediatrics, and trauma. All questions were input into GPT-4.0 with a prespecified, standard prompt. Responses were evaluated by two attending neurosurgeons, each on a standardized scale for accuracy, safety, and helpfulness. Citations were indexed and evaluated against identifiable database references.
GPT-4.0 responses were consistent with current medical guidelines and accounted for recent advances in the field 92.8 % and 78.6 % of the time respectively. Neurosurgeons reported GPT-4.0 responses providing unrealistic information or potentially risky information 14.3 % and 7.1 % of the time respectively. Assessed on 5-point scales, responses suggested that GPT-4.0 was clinically useful (4.0 ± 0.6), relevant (4.7 ± 0.3), and coherent (4.9 ± 0.2). The depth of clinical responses varied (3.7 ± 0.6), and "red flag" symptoms were missed 7.1 % of the time. Moreover, GPT-4.0 cited 86 references (2.46 citations per answer), of which only 50 % were deemed valid, and 77.1 % of responses contained at least one inappropriate citation.
Current general LLM technology can offer generally accurate, safe, and helpful neurosurgical information, but may not fully evaluate medical literature or recent field advances. Citation generation and usage remains unreliable. As this technology becomes more ubiquitous, clinicians will need to exercise caution when dealing with it in practice.
尽管先前的工作表明大型语言模型(LLM)在神经外科委员会式问题上的准确率令人惊讶,但它们在日常临床情况下的应用仍需要进一步研究。本研究评估了 GPT-4.0 对神经外科各个亚专业常见临床问题的反应。
一组主治神经外科医生制定了 35 个涵盖神经肿瘤学、脊柱、血管、功能、儿科和创伤的一般神经外科问题。所有问题都用一个预设的标准提示输入到 GPT-4.0 中。两位主治神经外科医生分别根据准确性、安全性和有用性的标准量表对回复进行评估。引用索引并与可识别的数据库参考进行评估。
GPT-4.0 的回复与当前的医疗指南一致,分别有 92.8%和 78.6%的时间反映了该领域的最新进展。神经外科医生报告说,GPT-4.0 的回复提供不切实际的信息或潜在风险信息的时间分别为 14.3%和 7.1%。根据 5 分制量表评估,回复表明 GPT-4.0 具有临床实用性(4.0±0.6)、相关性(4.7±0.3)和一致性(4.9±0.2)。临床回复的深度不同(3.7±0.6),并且“红旗”症状漏诊的时间为 7.1%。此外,GPT-4.0 引用了 86 个参考文献(每个答案引用 2.46 个),其中只有 50%被认为是有效的,并且 77.1%的回复包含至少一个不恰当的引用。
当前的通用 LLM 技术可以提供一般准确、安全和有用的神经外科信息,但可能无法充分评估医学文献或该领域的最新进展。引文生成和使用仍然不可靠。随着这项技术变得更加普及,临床医生在实践中使用时需要谨慎。