评估大语言模型将腰椎影像报告简化为面向患者文本的能力：一项关于GPT-4的初步研究

Assessing the ability of large language models to simplify lumbar spine imaging reports into patient-facing text: a pilot study of GPT-4.

作者信息

Khazanchi Rushmin, Chen Austin R, Desai Parth, Herrera Daniel, Staub Jacob R, Follett Matthew A, Krushelnytskyy Mykhaylo, Kemeny Hanna, Hsu Wellington K, Patel Alpesh A, Divi Srikanth N

机构信息

Feinberg School of Medicine, Northwestern University, 676 N. St. Clair Street, Suite 1350, Chicago, IL, 60611, USA.

Department of Orthopaedic Surgery, Northwestern University, Chicago, IL, USA.

出版信息

Skeletal Radiol. 2025 Sep 9. doi: 10.1007/s00256-025-05027-9.

DOI:10.1007/s00256-025-05027-9

PMID:40921880

Abstract

OBJECTIVE

To assess the ability of large language models (LLMs) to accurately simplify lumbar spine magnetic resonance imaging (MRI) reports.

MATERIALS AND METHODS

Patients who underwent lumbar decompression and/or fusion surgery in 2022 at one tertiary academic medical center were queried using appropriate CPT codes. We then identified all patients with a preoperative ICD diagnosis of lumbar spondylolisthesis and extracted the latest preoperative spine MRI radiology report text. The GPT-4 API was deployed on deidentified reports with a prompt to produce translations and evaluated for accuracy and readability. An enhanced GPT prompt was constructed using high-scoring reports and evaluated on low-scoring reports.

RESULTS

Of 93 included reports, GPT effectively reduced the average reading level (11.47 versus 8.50, p < 0.001). While most reports had no accuracy issues, 34% of translations omitted at least one clinically relevant piece of information, while 6% produced a clinically significant inaccuracy in the translation. An enhanced prompt model using high scoring reports-maintained reading level while significantly improving omission rate (p < 0.0001). However, even in the enhanced prompt model, GPT made several errors regarding location of stenosis, description of prior spine surgery, and description of other spine pathologies.

CONCLUSION

GPT-4 effectively simplifies the reading level of lumbar spine MRI reports. The model tends to omit key information in its translations, which can be mitigated with enhanced prompting. Further validation in the domain of spine radiology needs to be performed to facilitate clinical integration.

摘要

目的

评估大语言模型（LLMs）准确简化腰椎磁共振成像（MRI）报告的能力。

材料与方法

使用适当的现行程序编码（CPT）查询2022年在一家三级学术医疗中心接受腰椎减压和/或融合手术的患者。然后，我们识别出所有术前国际疾病分类（ICD）诊断为腰椎滑脱的患者，并提取最新的术前脊柱MRI放射学报告文本。将GPT-4应用程序编程接口（API）部署在去识别化的报告上，通过提示生成译文，并对准确性和可读性进行评估。使用高分报告构建增强型GPT提示，并在低分报告上进行评估。

结果

在纳入的93份报告中，GPT有效地降低了平均阅读难度（11.47对8.50，p<0.001）。虽然大多数报告没有准确性问题，但34%的译文遗漏了至少一条临床相关信息，而6%的译文在翻译中产生了具有临床意义的不准确信息。使用高分报告的增强提示模型在保持阅读难度的同时，显著提高了遗漏率（p<0.0001）。然而，即使在增强提示模型中，GPT在狭窄部位、既往脊柱手术描述和其他脊柱病变描述方面仍出现了一些错误。