Suppr超能文献

为检测放射学报告中的错误而训练的生成式大语言模型。

Generative Large Language Models Trained for Detecting Errors in Radiology Reports.

作者信息

Sun Cong, Teichman Kurt, Zhou Yiliang, Critelli Brian, Nauheim David, Keir Graham, Wang Xindi, Zhong Judy, Flanders Adam E, Shih George, Peng Yifan

机构信息

Department of Population Health Sciences, Weill Cornell Medicine, 575 Lexington Ave, New York, NY 10022.

Department of Radiology, Weill Cornell Medicine, New York, NY.

出版信息

Radiology. 2025 May;315(2):e242575. doi: 10.1148/radiol.242575.

Abstract

Background Large language models (LLMs) offer promising solutions, yet their application in medical proofreading, particularly in detecting errors within radiology reports, remains underexplored. Purpose To develop and evaluate generative LLMs for detecting errors in radiology reports during medical proofreading. Materials and Methods In this retrospective study, a dataset was constructed with two parts. The first part included 1656 synthetic chest radiology reports generated by GPT-4 (OpenAI) using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC chest radiograph (MIMIC-CXR) database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3 (Meta AI), GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using F1 scores, 95% CIs, and paired-sample tests on the constructed dataset, with the prediction results further assessed by radiologists. Results Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance, with the following F1 scores: 0.769 (95% CI: 0.757, 0.771) for negation errors, 0.772 (95% CI: 0.762, 0.780) for left/right errors, 0.750 (95% CI: 0.736, 0.763) for interval change errors, 0.828 (95% CI: 0.822, 0.832) for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model (50 for each error type). Of these, 99 were confirmed by both radiologists to contain errors detected by the models, and 163 were confirmed by at least one radiologist to contain model-detected errors. Conclusion Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports. © RSNA, 2025 See also the editorial by Marrocchio and Sverzellati in this issue.

摘要

背景 大语言模型(LLMs)提供了有前景的解决方案,但其在医学校对中的应用,尤其是在检测放射学报告中的错误方面,仍未得到充分探索。目的 开发并评估用于医学校对时检测放射学报告错误的生成式大语言模型。材料与方法 在这项回顾性研究中,构建了一个由两部分组成的数据集。第一部分包括1656份由GPT-4(OpenAI)使用指定提示生成的合成胸部放射学报告,其中828份为无错误的合成报告,828份包含错误。第二部分包括614份报告:2011年至2016年来自MIMIC胸部X光片(MIMIC-CXR)数据库的307份无错误报告,以及基于这些MIMIC-CXR报告和指定提示由GPT-4生成的307份相应的含错误合成报告。所有错误被分为四种类型:否定、左右、区间变化和转录错误。然后,使用零样本提示、少样本提示或微调策略对包括Llama-3(Meta AI)、GPT-4和BiomedBERT在内的几个模型进行优化。最后,在构建的数据集上使用F1分数、95%置信区间和配对样本检验评估这些模型的性能,放射科医生进一步评估预测结果。结果 使用零样本提示,微调后的Llama-3-70B-Instruct模型表现最佳,其否定错误的F1分数为0.769(95%置信区间:0.757,0.771),左右错误为0.772(95%置信区间:0.762,0.780),区间变化错误为0.750(95%置信区间:0.736,0.763),转录错误为0.828(95%置信区间:0.822,0.832),总体为0.780。在实际评估阶段,两名放射科医生审查了模型随机输出的200份报告(每种错误类型50份)。其中,99份被两名放射科医生均确认为包含模型检测到的错误,163份被至少一名放射科医生确认为包含模型检测到的错误。结论 在合成的和MIMIC-CXR放射学报告上进行微调的生成式大语言模型极大地增强了放射学报告中的错误检测。© RSNA,2025 另见本期Marrocchio和Sverzellati的社论。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验