Woźnicki Piotr, Laqua Caroline, Fiku Ina, Hekalo Amar, Truhn Daniel, Engelhardt Sandy, Kather Jakob, Foersch Sebastian, D'Antonoli Tugba Akinci, Pinto Dos Santos Daniel, Baeßler Bettina, Laqua Fabian Christopher
Department of Diagnostic and Interventional Radiology, University Hospital Würzburg, Würzburg, Germany.
Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
Eur Radiol. 2025 Apr;35(4):2018-2029. doi: 10.1007/s00330-024-11074-y. Epub 2024 Oct 10.
Structured reporting enhances comparability, readability, and content detail. Large language models (LLMs) could convert free text into structured data without disrupting radiologists' reporting workflow. This study evaluated an on-premise, privacy-preserving LLM for automatically structuring free-text radiology reports.
We developed an approach to controlling the LLM output, ensuring the validity and completeness of structured reports produced by a locally hosted Llama-2-70B-chat model. A dataset with de-identified narrative chest radiograph (CXR) reports was compiled retrospectively. It included 202 English reports from a publicly available MIMIC-CXR dataset and 197 German reports from our university hospital. Senior radiologist prepared a detailed, fully structured reporting template with 48 question-answer pairs. All reports were independently structured by the LLM and two human readers. Bayesian inference (Markov chain Monte Carlo sampling) was used to estimate the distributions of Matthews correlation coefficient (MCC), with [-0.05, 0.05] as the region of practical equivalence (ROPE).
The LLM generated valid structured reports in all cases, achieving an average MCC of 0.75 (94% HDI: 0.70-0.80) and F1 score of 0.70 (0.70-0.80) for English, and 0.66 (0.62-0.70) and 0.68 (0.64-0.72) for German reports, respectively. The MCC differences between LLM and humans were within ROPE for both languages: 0.01 (-0.05 to 0.07), 0.01 (-0.05 to 0.07) for English, and -0.01 (-0.07 to 0.05), 0.00 (-0.06 to 0.06) for German, indicating approximately comparable performance.
Locally hosted, open-source LLMs can automatically structure free-text radiology reports with approximately human accuracy. However, the understanding of semantics varied across languages and imaging findings.
Question Why has structured reporting not been widely adopted in radiology despite clear benefits and how can we improve this? Findings A locally hosted large language model successfully structured narrative reports, showing variation between languages and findings. Critical relevance Structured reporting provides many benefits, but its integration into the clinical routine is limited. Automating the extraction of structured information from radiology reports enables the capture of structured data while allowing the radiologist to maintain their reporting workflow.
结构化报告可提高可比性、可读性和内容细节。大语言模型(LLMs)可将自由文本转换为结构化数据,且不会干扰放射科医生的报告流程。本研究评估了一种用于自动构建自由文本放射学报告的本地部署、保护隐私的大语言模型。
我们开发了一种控制大语言模型输出的方法,以确保由本地托管的Llama-2-70B-chat模型生成的结构化报告的有效性和完整性。回顾性汇编了一个包含去识别化的胸部X光(CXR)报告的数据集。它包括来自公开可用的MIMIC-CXR数据集的202份英文报告和来自我们大学医院的197份德文报告。资深放射科医生编制了一个包含48个问答对的详细、完全结构化的报告模板。所有报告均由大语言模型和两名人工读者独立构建。采用贝叶斯推理(马尔可夫链蒙特卡罗采样)来估计马修斯相关系数(MCC)的分布,以[-0.05, 0.05]作为实际等效区域(ROPE)。
大语言模型在所有情况下均生成了有效的结构化报告,英文报告的平均MCC为0.75(94% HDI:0.70 - 0.80),F1分数为0.70(0.70 - 0.80);德文报告的平均MCC为0.66(从0.62至0.70),F1分数为0.68(从0.64至0.72)。两种语言中大语言模型与人工之间的MCC差异均在ROPE范围内:英文为0.01(-0.05至0.07),0.01(-0.05至0.07);德文为-0.01(-0.07至0.05),0.00(-0.06至0.06),表明性能大致相当。
本地部署的开源大语言模型能够以近似人工的准确性自动构建自由文本放射学报告。然而,对语义的理解在不同语言和影像表现之间存在差异。
问题尽管结构化报告有明显益处,但为何在放射学中未被广泛采用,以及我们如何改进?发现一个本地部署的大语言模型成功构建了叙述性报告,显示出语言和表现之间的差异。关键相关性结构化报告有诸多益处,但其融入临床常规的程度有限。从放射学报告中自动提取结构化信息能够获取结构化数据,同时允许放射科医生维持其报告流程。